Claude Code 24: Multiple Agents Auditing Your Diff-in-Diff Code (Part 1)
- Author/Source: Scott Cunningham (Baylor University), via Substack ("Causal Inference")
-
Original: https://causalinf.substack.com/p/claude-code-24-multiple-agents-auditing
-
Key Ideas
- LLM hallucination in code generation can be modeled as classical measurement error — random syntax mistakes that cascade through pipelines
- If hallucination errors are independent across programming languages, replicating analysis in multiple languages makes joint failure probability vanishingly small (product of individual error rates)
- The proposed workflow: replicate entire research pipelines in Stata, R, and Python, then have an agent verify outputs match across all three
- Five diff-in-diff packages tested:
csdid(Stata),csdid2(Stata),did(R),diff-diff(Python),differences(Python) - Cross-language replication works for deterministic code (OLS, merges, cleaning) but not for stochastic methods (bootstraps, MCMC, random forests)
- Classic Stata pitfall illustrated:
replace x = 10 if x > 10also replaces missing values — a syntax-specific error that would not appear in R or Python -
Code audits and cross-language replication are two separate verification tasks that complement each other
-
Summary
Cunningham proposes a formal framework for using Claude Code to verify research code by treating LLM hallucination as classical measurement error. If code errors are random and language-specific (syntax mistakes in Stata are independent of syntax mistakes in R or Python), then replicating an entire analysis pipeline in multiple languages provides a powerful check: the probability that all three languages produce the same wrong result is the product of their individual error rates — a very small number.
He demonstrates this with a case study using five difference-in-differences packages across three languages, applied to a Brazilian study on mental health deinstitutionalization and homicides. The essay distinguishes between two complementary verification tasks: (1) having antagonistic subagents audit code logic and reasoning, and (2) having Claude Code replicate the entire pipeline — from data cleaning through estimation — in two additional languages, then comparing outputs digit-by-digit. He is careful to note the limitation: this approach works only for deterministic computations (OLS, fixed effects, merges, summary statistics) and breaks down for stochastic methods like bootstrapping or MCMC where random seeds differ across languages.
- Relevance to Economics Research
This directly addresses the reproducibility crisis and code verification in empirical economics. The insight that cross-language replication exploits independence of syntax-specific errors is both formally justified and practically implementable with AI agents. The diff-in-diff application is immediately relevant to the large community of applied researchers using Callaway and Sant'Anna estimators, and the framework generalizes to any deterministic empirical method.