mit-lantern-panel-simulation/chatgpt_review.md at main · ITV/mit-lantern-panel-simulation

High-level mapping: docs → implementation The documented DGP in RESEARCH_DESIGN.md and formal model in METHODOLOGY.md are implemented in code: Population generation: population_generator.R (creates reference population and latent covariates). Panel sampling: panel_generator.R (large_sparse vs small_rich; LS sampling bias implemented). Treatment assignment & outcome: outcome_generator.R — implements logistic propensity for treatment and logistic outcome with shared covariates (confounding), and measurement error (add_measurement_error()). Estimation: estimation.R implements frequentist GLMs and a Stan model advertising_model.stan for Bayesian fits. Decision analysis: decision_analysis.R computes expected profit (utility), decision rules (EU, CE, VaR), EVSI approximation. Orchestration: main_simulation.R runs scenarios, aggregates, and saves results. The implementation closely follows the intended design and metrics (bias, RMSE, coverage, decision accuracy, expected utility). Short answer to your questions Are outputs correct and aligned with the docs? Mostly yes: core DGP, confounding mechanism, measurement-error patterns, estimation methods and decision rules are implemented as described. However, there are several places where small implementation choices or shortcuts affect correctness or interpretation (detailed below). These are not show-stoppers but matter for numerical validity and reproducibility. Are outputs sensitive to key assumptions? Yes — heavily. Most important sensitivities: The confounding mechanism (gender multiplicative factor) controls bias strongly. Measurement error parameters (treatment_accuracy, outcome_linkage) cause attenuation that does not vanish with n. The sample size sweep vs full-factorial design and the SE approximations used in EVSI are impactful. The way decision utility is computed (deterministic profit formula using global DECISION_PARAMS) concentrates sensitivity on ad_cost, n_impressions, revenue_per_conversion, gross_margin. How is the advertiser decision model implemented and does it make sense? Implementation is straightforward and consistent with the docs: It turns an estimated effect (log-odds) into an expected incremental profit using baseline_rate, n_impressions, revenue_per_conversion, and gross_margin. It computes expected utility with posterior samples or approximate normal sampling (frequentist). Decision rules: expected utility > 0 (EU), certainty equivalent (CARA), VaR. This is a reasonable decision-theoretic mapping, but there are important modeling and calibration caveats (listed under "risks and needed checks"). Below are the detailed findings, prioritized by severity/importance, with recommended checks and how to run them locally.

Critical findings (must review / consider before trusting results) calculate_true_att() description vs implementation What I saw: calculate_true_att() in outcome_generator.R computes an ATT on the probability scale by applying the homogeneous log-odds shift to treated individuals' factual log-odds. The Implementation Summary previously flagged this function as computing ATE instead of ATT — that comment aligns with potential confusion. Why it matters: Your validation and "true" references should be consistent — if you compare estimated ATEs to an ATT or vice versa you will measure bias incorrectly. How to check: For a generated panel, compute both: the sample difference in mean outcomes among treated vs control on the probability scale (empirical ATT), and compare calculate_true_att() output. Suggested local command (R snippet): Load an example panel and run: print(calculate_true_att(panel_with_outcomes)); compute mean(outcome_true[treatment_true==1]) - mean(outcome_true[treatment_true==0]) Decision: confirm naming and use consistently. If you want ATE, compute counterfactual outcomes for all. If you want ATT, current approach is plausible but check if it matches your validation logic. Measurement error modelling: non-differential but observation-generation choices What I saw: add_measurement_error() uses panel-level exposure_quality and outcome_quality and generates observed values via: treatment_obs ~ Bernoulli(treatment_true * exposure_quality + (1-treatment_true)(1-exposure_quality)) outcome_obs ~ Bernoulli(outcome_true * outcome_quality + (1-outcome_true)(1-outcome_quality)) Consequence: This produces symmetric flip rates around panel-specific accuracy (i.e., P(obs=1|true=0) = 1 - accuracy). That exactly matches the doc. But note: If outcome_prevalence is rare, the "observed" prevalence will be strongly biased upwards for low accuracies, because false positives proportionally inflate observed positives (especially when baseline is tiny). Reporting "measurement agreement" as mean(panel$treatment_obs == panel$treatment_true) is fine for a quick check but hides asymmetric error rates when base rates very low. How to check sensitivity: Run a small experiment across outcome_quality values and report: observed prevalence vs true prevalence attenuation factor (observed estimator versus true effect) R check snippet idea: iterate outcome_quality in (0.6,0.8,0.95) and compute bias after measurement error. EVSI approximation uses crude SE shortcut What I saw: compute_evsi() uses a simplified approach: approximate standard error as sqrt(1 / (n * 0.01 * 0.99)) and then simulates observed_effect ~ N(true_effect, se) to create a posterior. That assumes the estimate's variance follows a simple binomial variance at baseline 1% prevalence and that the mapping from log-odds effect to observed estimator is approximately normal with that se. Why it matters: EVSI is sensitive to posterior variance. That approximation may be fine for rough guidance but can mis-estimate EVSI for: adjusted estimates (because covariate adjustment changes variance), panels with very different observed prevalences, rare-event regimes where asymptotic normal approximations are poor. Suggested action: Replace or supplement with EVSI computed by generating full synthetic datasets for smaller n_sims (e.g., 200) and fitting the same estimator — or at least calibrate the SE approximation against the empirical sampling variance from run_single_simulation() for the same scenario. Check: pick one scenario and compute the empirical SD of adjusted estimates across many replications and compare to the se used in compute_evsi(). Stan diagnostic usage: summaries are present but estimate_effect_bayesian() attempts to reference posterior_prob_effect, posterior_relative_uplift, which are present in Stan's generated quantities as prob_effect and relative_uplift — but code extracts post$treatment_effect while Stan names parameter treatment_effect. What I saw: In estimate_effect_bayesian(), after rstan::extract(fit) it expects post$treatment_effect (correct) and then returns posterior_prob_effect = post$prob_effect and posterior_relative_uplift = post$relative_uplift. That matches names in generated quantities (Stan sets prob_effect and relative_uplift) so it should work. Good. Concern: sumtab[, "Rhat"] and sumtab[, "n_eff"] access may fail if summary(fit)$summary structure changes or if some columns missing. The code guards it with try-catch but still should explicitly check names. Recommended check: After a Bayesian run, examine est$diagnostics values for NA or >1.1 R-hat. Add a fail/stop if serious. Also: Stan priors center treatment_effect on 0.18 — this is okay but will pull small-N posteriors. Document this prior choice explicitly in the outputs. Frequentist adjusted estimator: linearization and probability-scale effect What I saw: estimate_effect_frequentist() returns estimate = coef(fit)["treatment_obs"] (log-odds), se from vcov, and computes prob_effect by applying the estimated log-odds shift to baseline_prob (mean outcome in controls) via qlogis(baseline) + est -> plogis -> treatment_prob - baseline. This is acceptable, but there are subtlety/caveats: Using baseline_prob = mean(outcome_obs[treatment_obs==0]) gives an empirical baseline that is noisy in small samples (introduces variance into prob_effect). For rare outcomes and small sample sizes, the delta-method or bootstrap is recommended for uncertainty in prob_effect. Check: For small n (SR panel) bootstrap option estimate_effect_frequentist_bootstrap() exists — good. But the main pipeline uses frequentist point estimate for decision-making (simulate rnorm(mean=estimate, sd=se) to create effect_samples) — this is an approximation; on probability scale it's nonlinear. The pipeline uses log-odds sampling which is better than sampling on probability scale but the mapping to utility uses plogis(qlogis(base)+effect) which is consistent. Seeding approach in run_single_simulation() is potentially collision-prone What I saw: seed_val is computed as as.integer(sum(utf8ToInt(key))) %% .Machine$integer.max. This deterministic hashing is fine, but: It is possible (though unlikely) two different scenarios map to same seed. The seed depends on sample_size being NULL or an integer — ensure consistent string representation to avoid accidental collisions. Suggestion: use a robust hashing like digest::digest(key, algo="xxhash32") and convert to integer, or use a fixed RNG stream per simulation (e.g., with rsample or withr::with_seed). Not critical for correctness but helpful for reproducibility. Medium-severity findings and suggestions Confounding implementation symmetry Implementation: generate_treatment_assignment() adds log(confounding_strength) to treatment_logodds for gender==2 and generate_purchase_outcomes() adds the same log(confounding_strength) to outcome_logodds for gender==2. This produces a symmetric confounding effect as described in the doc. Potential issue: Using the same log(confounding_strength) on both treatment and outcome ensures confounding is perfectly aligned in magnitude on log-odds scale. This is a modelling assumption; in practice imbalance between targeting intensity vs outcome association could exist. Consider parameterizing treatment and outcome gender effects separately for more flexible sensitivity. Suggested experiment: allow separate confound_treatment_strength and confound_outcome_strength, or at least add a comment and sensitivity checks. Rare outcomes & normal approx for frequentist uncertainty The pipeline's frequentist uncertainty for evaluate_decision() simulates rnorm(1000, mean = estimates$estimate, sd = estimates$se) then maps to utility. For SR panels with small n and rare outcomes, the normal approx to the sampling distribution of the log-odds estimator may be poor (separator issues, extremely wide se, or infinite se in rare separators). The code includes a bootstrap helper — recommend using bootstrap-based effect_samples when sample sizes small OR when n_treated or n_outcomes below a threshold (e.g., n_treated < 10 or n_outcomes < 5). Check: add a small wrapper to use bootstrap when rare. Using media_minutes and other continuous covariates without scaling In estimation.R the formula includes media_minutes directly. If media_minutes has large variance relative to unit-level covariates, coefficients may be small but still fine. Consider centering/scaling continuous covariates so priors / interpretation more stable (especially for Stan). Stan covariate prior ~ Normal(0,1) assumes covariates roughly standardized. Currently media_minutes mean ~120, sd ~40 — a Normal(0,1) prior is too tight; that may bias Bayesian estimation. Similarly for purchase_history (Poisson with mean 2) and brand_awareness range 0-1. Fix: either standardize covariates before passing to Stan (recommended) or use wider priors for covariates. outputs saved & heavy reliance on default directories Some filenames and directories were created using here::here() — it's fine, but note earlier "Must Fix" list in IMPLEMENTATION_SUMMARY.md flagged hardcoded paths like /home/claude/ (not present in these files). Confirm no other absolute paths exist. Also confirm saved outputs simulation_results/data/* are checked into notebooks and not too large. For long runs, ensure saveRDS() checkpoints at intervals (there is a CHECKPOINT_FREQUENCY constant but not used in main loop). EVSI compute_evsi() uses a very simplified posterior update I already flagged that; additionally, its internal se uses baseline prevalence 1% and does not use panel-type-specific measurement error nor covariate adjustment — the EVSI numbers should be treated as approximate only. Minor style/robustness items In estimate_effect_bayesian(), the return includes posterior_prob_effect = post$prob_effect and posterior_relative_uplift = post$relative_uplift — good. But the code also references post$prob_effect in places; ensure rstan::extract returns these names (it will). In several places, unique(panel$panel_type) is used to detect panel type. If a function is passed a combined dataset accidentally this returns multiple values and may cause unexpected behavior — consider panel_type <- unique(panel$panel_type); stopifnot(length(panel_type) == 1). safe_qlogis() and safe_plogis() are good guards against numerical extremes. Keep them used consistently. Add an explicit options(stringsAsFactors = FALSE) in scripts if older R versions are targeted (likely not needed but safe). Sensitivity analysis — list of key parameters that dominate outputs Confounding_strength (gender factor): primary driver of bias and hence of decision accuracy and EVSI. If you change it, everything flips. Measurement_quality (treatment_accuracy & outcome_linkage): low accuracy in LS leads to attenuation and utility loss. Sample size (for LS especially): increases precision; trade-off point with SR depends strongly on confounding_strength and true_effect. True_effect (tau): small true_effect increases the value of precision, but bias can dominate; cross-over threshold shifts nonlinearly with tau. Decision parameters (ad_cost, n_impressions, revenue_per_conversion, gross_margin): these convert estimated effect size to £ profit — small changes here can change the sign of expected utility (thus flipping decisions) and are critical for EVSI. Focused checks I recommend you run locally (quick reproducible commands) I'll keep these as short R snippets you can run in an interactive R session; they require the project packages and functions already in the repo. Put them in a file or run interactively.

Sanity check: one scenario end-to-end (this mirrors the demo in simulation_summary.qmd) Purpose: produce the numbers printed in the Quarto doc and check consistency. What to run: Sensitivity to confounding_strength (toy sweep) Purpose: quantify how quickly bias grows with confounding parameter. Inspect how estimate deviates from 0.18. Compare bootstrap uncertainty vs normal approx for SR panel Purpose: see whether normal sampling of log-odds is acceptable. If bootstrap CI is meaningfully different, prefer bootstrap for decision sampling. EVSI calibration check Purpose: compare compute_evsi() approximation to empirical EVSI via small-scale simulation (n_sims=200), for one scenario. If large discrepancies, avoid relying on compute_evsi() for final conclusions. Check covariate scales for Stan Purpose: ensure covariates are appropriate for priors N(0,1). If media_minutes sd ~40, consider standardizing before calling Stan or change prior. Recommendations / next steps (prioritized) Validation & unit tests (high priority)

Add small unit tests that verify DGP properties: When confounding_strength = 1, estimate_adj should be approximately unbiased on average. Measurement error: check observed vs true prevalences for extreme settings. Posterior coverage: for a small number of replications verify coverage near nominal. The repository already lists some checks; formalize tests in tests/testthat or an R script and run them in CI. EVSI: replace crude SE-based approximation with limited full-data posterior updates for selected scenarios to validate approximation. Use n_sims=200 for EVSI calibration runs.

Stan: standardize covariates before passing to Stan (or change priors). This avoids scale mismatch and unintended shrinkage.

Decision utilities: sensitivity analysis on DECISION_PARAMS (ad_cost, impressions, revenue). For instance, compute break-even effect and show how many scenarios flip decision if ad_cost ± 20%.

Bootstrap for small samples: use bootstrap-based sampling for evaluate_decision() when n_treated or n_outcomes small.

Documentation: Document the prior choices and standardization decisions in IMPLEMENTATION_SUMMARY.md and note where approximations exist (EVSI shortcut).

Reproducibility: consider a more robust seed hashing function to eliminate accidental collisions, and add logging of the seed used per scenario in saved results.

Quick summary of the advertiser decision model (detailed) Input: effect estimate(s) on log-odds scale (either posterior samples from Stan or frequentist point estimate ± se simulated as normal on log-odds). Transform: convert effect to probability lift via: baseline_prob = DECISION_PARAMS$baseline_rate treatment_prob = plogis(qlogis(baseline_prob) + effect) prob_lift = treatment_prob - baseline_prob Utility: incremental_conversions = n_impressions * prob_lift incremental_revenue = incremental_conversions × revenue_per_conversion × gross_margin net_profit = incremental_revenue - ad_cost Decision rules: EU (risk-neutral): run if mean(net_profit) > 0 CE (CARA): use exponential utility for risk aversion VaR: run if lower quantile > 0 EVSI: approximated by simulating observed_effect from Normal(true_effect, se) using se ~ sqrt(1/(n * 0.01 * 0.99)) — coarse. Verdict: This is a coherent, mechanistic decision model and appropriate for the task. The main caveats are: The deterministic transformation from log-odds to profit is simple and transparent, which is good. But the decision outcome is highly sensitive to DECISION_PARAMS calibrations (ad_cost, n_impressions, revenue, margin). Make sure these are realistic for the class of campaigns you consider. For small sample sizes or rare outcomes, the normal approx for the estimator may be poor; use bootstrap or Bayesian posterior for more robust decision uncertainty propagation. If risk aversion is important, consider specifying plausible ranges for the CARA parameter a and show results across them. Final checklist you can use to "kick the tyres" Run the single scenario demo (the code in simulation_summary.qmd) and check the printed numbers match expectations. Compare calculate_true_att() output to empirical ATT and record any mismatch. For 5 scenarios, record estimate_adj across confounding strengths and verify the H1/H2 patterns qualitatively (SR better at high confounding). Run EVSI approximation vs empirical small-scale EVSI for a few settings and compare. Run Bayesian fit for at least one SR example and check R-hat and n_eff — add a stop if R-hat > 1.1. Check covariate scales and consider standardizing before Stan. Add bootstrap option in the pipeline where rare events make normal approx suspect (or enable automatically when counts small). Closing summary The code implements the documented design and decision framework in a clear, modular way. The main logic of confounding, measurement error, adjusted vs unadjusted estimation, and decision-theoretic utility are present and consistent with the documents. Key numerical risks: EVSI approximation, covariate scaling for Stan priors, normal approximation for estimator sampling under rare outcomes, and clarity around ATT vs ATE in calculate_true_att(). The advertiser decision model is sensible and appropriately implemented, but the conclusions depend strongly on DECISION_PARAMS and the estimation uncertainty propagation method (bootstrap vs normal approx vs Bayesian posterior). If you'd like, I can:

Run a few of the local checks above in this environment (short examples) and paste the numeric outputs here. Produce a compact set of test scripts (R) you can run to validate the critical checks (no repo edits).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

chatgpt_review.md

Latest commit

History

chatgpt_review.md

File metadata and controls