Skip to content

IV LLM refactoring#21

Open
akhkim wants to merge 2 commits intorefactoringfrom
iv_refactored
Open

IV LLM refactoring#21
akhkim wants to merge 2 commits intorefactoringfrom
iv_refactored

Conversation

@akhkim
Copy link
Copy Markdown
Collaborator

@akhkim akhkim commented Mar 28, 2026

No description provided.

Copilot AI review requested due to automatic review settings March 28, 2026 18:58
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors and extends the instrumental-variables (IV) pipeline by integrating an IV-LLM-based instrument discovery flow into CAIS, adding the supporting IV-LLM package code, and updating related tooling/tests.

Changes:

  • Added an iv_discovery_tool (and Pydantic I/O models) plus an IVDiscovery component that runs an IV-LLM pipeline (hypothesizer + critics).
  • Introduced an iv_llm subpackage (prompts, agents, critics, utilities) and logging setup for IV-LLM runs.
  • Updated agent/tool wiring and dataset cleaning execution behavior; added/updated tests.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
tests/cais/test_e2e_iv_new.py New E2E IV test covering the updated IV pipeline (real LLM calls).
tests/cais/methods/test_diff_in_diff.py New unit test intended to validate DiD estimator output structure.
cais/utils/llm_helpers.py Refactors LLM base model import and adds invoke_llm helper.
cais/utils/agent.py Adds a LangChain ReAct agent executor + custom output parser (LCEL-style agent).
cais/tools/iv_discovery_tool.py New tool that discovers instruments via IV-LLM and updates Variables.instrument_variable.
cais/tools/controls_selector_tool.py Enables LangChain @tool decorator and aligns imports.
cais/models.py Adds IVDiscoveryInput / IVDiscoveryOutput models.
cais/iv_llm/src/variable_utils.py Adds helpers for extracting/mapping LLM-proposed variable names to dataset columns.
cais/iv_llm/src/prompts/prompt_loader.py Adds prompt loading/formatting helper for IV-LLM prompts.
cais/iv_llm/src/prompts/independence_critic.txt Adds IV independence-critic prompt template.
cais/iv_llm/src/prompts/hypothesizer.txt Adds IV hypothesizer prompt template with “only from columns list” constraint.
cais/iv_llm/src/prompts/exclusion_critic.txt Adds IV exclusion-restriction prompt template.
cais/iv_llm/src/prompts/confounder_miner.txt Adds confounder miner prompt template with “only from columns list” constraint.
cais/iv_llm/src/prompts/init.py Marks prompts as a package.
cais/iv_llm/src/llm/client.py Adds a thin LLM client adapter for IV-LLM code.
cais/iv_llm/src/llm/init.py Marks llm as a package.
cais/iv_llm/src/experiments/iv_co_scientist.py Adds an experimental IV co-scientist runner.
cais/iv_llm/src/critics/independence_critic.py Adds critic implementation using IV-LLM prompts.
cais/iv_llm/src/critics/exclusion_critic.py Adds critic implementation using IV-LLM prompts.
cais/iv_llm/src/critics/init.py Marks critics as a package.
cais/iv_llm/src/agents/hypothesizer.py Adds IV hypothesizer agent with column-grounding logic.
cais/iv_llm/src/agents/confounder_miner.py Adds confounder-miner agent with column-grounding logic.
cais/iv_llm/src/agents/init.py Marks agents as a package.
cais/iv_llm/src/init.py Marks IV-LLM src as a package.
cais/iv_llm/init.py Adds IV-LLM log file handler configuration on package import.
cais/components/iv_discovery.py Adds IV discovery component orchestrating hypothesizer + critics.
cais/components/dataset_cleaner.py Improves generated-script execution plumbing and adds fallback behavior/logging.
cais/agent.py Wires IV discovery into CausalAgent and adjusts controls/cleaning behavior and logging.
cais/init.py Exposes iv_discovery_tool via top-level package imports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +14 to +16
def test_iv_llm_pipeline_app_engagement_push(self):
"""Run the full CAIS pipeline end-to-end (real API calls, no mocks)."""

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This E2E test performs real LLM/API calls but doesn’t skip when credentials are missing (unlike the other E2E tests in this repo). This will make CI fail or incur unexpected costs. Add a setUpClass that loads dotenv and SkipTest when the required API key env var isn’t present.

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +34
# Mock all helper/validation functions within diff_in_diff.py
@patch('cais.methods.diff_in_diff.identify_time_variable')
@patch('cais.methods.diff_in_diff.identify_treatment_group')
@patch('cais.methods.diff_in_diff.determine_treatment_period')
@patch('cais.methods.diff_in_diff.validate_parallel_trends')
# Mock estimate_did_model to avoid actual regression, return mock results
@patch('cais.methods.diff_in_diff.estimate_did_model')
def test_estimate_effect_structure_and_types(self, mock_estimate_model, mock_validate_trends,
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mocked patch paths reference cais.methods.diff_in_diff.*, which does not exist in the codebase. After fixing the import, these decorators should patch the helpers in the actual module where they are defined (e.g., cais.methods.difference_in_differences.estimator.*), otherwise the test won’t mock anything.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +22
import yaml
import json
import pandas as pd
from ..causal_analysis.correlation import analyze_data
from ..agents.human_proxy import HumanProxy
from ..agents.causal_oracle import CausalOracle
from ..agents.hypothesizer import Hypothesizer
from ..agents.confounder_miner import ConfounderMiner
from ..critics.exclusion_critic import ExclusionCritic
from ..critics.independence_critic import IndependenceCritic
from ..agents.grounder import Grounder
from ..analysis.iv_estimator import IVEstimator
from ..llm.client import LLMClient
import os

class IVCoScientist:
def __init__(self, config):
# Accept either config dict or config path
if isinstance(config, str):
with open(config, 'r') as f:
self.config = yaml.safe_load(f)
else:
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file imports yaml (PyYAML), but PyYAML is not declared in pyproject.toml dependencies (it’s only pinned in requirements.txt). If this experiment module is meant to be usable from the installed package, add PyYAML to project.dependencies or defer the import behind a runtime guard so installs don’t break when users try to run it.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +8
# Import the function to test
from cais.methods.diff_in_diff import estimate_effect

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from cais.methods.diff_in_diff import estimate_effect will fail because there is no cais/methods/diff_in_diff.py module in the repo. Import the DiD estimator from cais.methods.difference_in_differences.estimator (or from cais.methods’ exported alias) and update the patch targets accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines 51 to 60
# Set up basic logging
os.makedirs('./logs/', exist_ok=True)
logging.basicConfig(
filename='./logs/agent_debug.log',
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging.basicConfig(...) at import time reconfigures global logging for any library consumer and for the whole test suite. Prefer leaving logging configuration to the application entrypoint/CLI; here, set a module logger and emit logs without calling basicConfig (or gate it behind if __name__ == "__main__").

Suggested change
# Set up basic logging
os.makedirs('./logs/', exist_ok=True)
logging.basicConfig(
filename='./logs/agent_debug.log',
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Set up module logger
logger = logging.getLogger(__name__)
if __name__ == "__main__":
# Configure basic logging only when this module is executed as a script
os.makedirs('./logs/', exist_ok=True)
logging.basicConfig(
filename='./logs/agent_debug.log',
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

Copilot uses AI. Check for mistakes.
variables=self.variables,
dataset_analysis=self.dataset_analysis,
dataset_description=self.dataset_description,
original_query=query
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IV discovery is invoked via iv_discovery_tool.func(...) without passing the agent’s configured LLM/provider/model. The current IV discovery component creates its own default LLM client, which can lead to inconsistent behavior (and failures) when the agent is configured for a non-default provider/model. Consider passing self.llm through the tool/component so all steps share the same client/config.

Suggested change
original_query=query
original_query=query,
llm=self.llm,

Copilot uses AI. Check for mistakes.
Final Answer: [your response here]
```

DO NOT UNDER ANY CIRCUMSTANCE CALL MORE THAN ONE TOOL IN A STEP
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The system prompt forbids multiple tool calls in a step, but the output parser’s format instructions explicitly allow repeating Action/Action Input blocks (multi-tool calls). This contradiction will confuse the model and increase parsing failures—either enforce single-action parsing or update the prompt/instructions to match the intended behavior.

Suggested change
DO NOT UNDER ANY CIRCUMSTANCE CALL MORE THAN ONE TOOL IN A STEP
In each step (each Thought/Action/Action Input/Observation block), you MUST call at most one tool. You MAY repeat this step format multiple times (with new Thoughts, Actions, Action Inputs, and Observations) before giving the Final Answer.

Copilot uses AI. Check for mistakes.
Comment on lines +86 to +87
# Get confounders from variables if available
confounders = variables.covariates or []
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool uses variables.covariates as the confounder list for IV validation. In this codebase, confounders is a distinct field on Variables (used by propensity-score methods), so this likely drops the actual confounders identified by query_interpreter. Use variables.confounders (or a union of confounders and covariates if that’s intended) to avoid validating independence against the wrong set.

Suggested change
# Get confounders from variables if available
confounders = variables.covariates or []
# Get confounders from variables if available.
# Prefer the dedicated `confounders` field, but also include any covariates.
base_confounders = variables.confounders or []
additional_covariates = variables.covariates or []
# Build a unified list without duplicates, preserving order.
confounders = list(dict.fromkeys(base_confounders + additional_covariates))

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +112
for m in matches:
tool_name = m.group(1).strip()
tool_input = m.group(2).strip().strip('"')
print('\n--------------------------')
print(tool_input)
print('--------------------------')
actions.append(AgentAction(tool_name, json.loads(tool_input), text))

return actions
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json.loads(tool_input) is called without error handling. If the model emits invalid JSON (common in practice), this will raise JSONDecodeError instead of OutputParserException, bypassing handle_parsing_errors and potentially crashing the agent. Catch JSON decode errors and raise OutputParserException with a helpful message/observation.

Copilot uses AI. Check for mistakes.
Comment on lines +266 to +270
memory=memory, # Pass the memory object
verbose=True,
callbacks=[ConsoleCallbackHandler()], # Optional: for console debugging
handle_parsing_errors=True, # Let AE handle parsing errors
max_retries = 100
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AgentExecutor(...) is constructed with max_retries=100, but in LangChain 0.3.x AgentExecutor doesn’t accept a max_retries kwarg (it uses max_iterations/max_execution_time depending on version). This will raise TypeError at runtime. Use the supported limit parameters or implement retry logic outside the executor.

Suggested change
memory=memory, # Pass the memory object
verbose=True,
callbacks=[ConsoleCallbackHandler()], # Optional: for console debugging
handle_parsing_errors=True, # Let AE handle parsing errors
max_retries = 100
memory=memory, # Pass the memory object
verbose=True,
callbacks=[ConsoleCallbackHandler()], # Optional: for console debugging
handle_parsing_errors=True, # Let AE handle parsing errors
max_iterations=100

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants