Valid inference w/ LLM-simulated data:
1. Take subsample of texts, extract vars
2. Construct moments identifying param on those vars
3. Ask LLM to simulate vars on sample & remaining texts
4. Use same moments w/ simulated vars
5. Combine moments, estimate jointly w/ 2-step GMM
💡Can we trust synthetic data for statistical inference?
We show that synthetic data (e.g. LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moments of synthetic data and those of real data