Claude Code 24: Multiple Agents Auditing Your Diff-in-Diff Code (Part 1)

  • Author/Source: Scott Cunningham (Baylor University), via Substack ("Causal Inference")
  • Original: https://causalinf.substack.com/p/claude-code-24-multiple-agents-auditing

  • Key Ideas

  • LLM hallucination in code generation can be modeled as classical measurement error — random syntax mistakes that cascade through pipelines
  • If hallucination errors are independent across programming languages, replicating analysis in multiple languages makes joint failure probability vanishingly small (product of individual error rates)
  • The proposed workflow: replicate entire research pipelines in Stata, R, and Python, then have an agent verify outputs match across all three
  • Five diff-in-diff packages tested: csdid (Stata), csdid2 (Stata), did (R), diff-diff (Python), differences (Python)
  • Cross-language replication works for deterministic code (OLS, merges, cleaning) but not for stochastic methods (bootstraps, MCMC, random forests)
  • Classic Stata pitfall illustrated: replace x = 10 if x > 10 also replaces missing values — a syntax-specific error that would not appear in R or Python
  • Code audits and cross-language replication are two separate verification tasks that complement each other

  • Summary

Cunningham proposes a formal framework for using Claude Code to verify research code by treating LLM hallucination as classical measurement error. If code errors are random and language-specific (syntax mistakes in Stata are independent of syntax mistakes in R or Python), then replicating an entire analysis pipeline in multiple languages provides a powerful check: the probability that all three languages produce the same wrong result is the product of their individual error rates — a very small number.

He demonstrates this with a case study using five difference-in-differences packages across three languages, applied to a Brazilian study on mental health deinstitutionalization and homicides. The essay distinguishes between two complementary verification tasks: (1) having antagonistic subagents audit code logic and reasoning, and (2) having Claude Code replicate the entire pipeline — from data cleaning through estimation — in two additional languages, then comparing outputs digit-by-digit. He is careful to note the limitation: this approach works only for deterministic computations (OLS, fixed effects, merges, summary statistics) and breaks down for stochastic methods like bootstrapping or MCMC where random seeds differ across languages.

  • Relevance to Economics Research

This directly addresses the reproducibility crisis and code verification in empirical economics. The insight that cross-language replication exploits independence of syntax-specific errors is both formally justified and practically implementable with AI agents. The diff-in-diff application is immediately relevant to the large community of applied researchers using Callaway and Sant'Anna estimators, and the framework generalizes to any deterministic empirical method.