Missing Data Handling Practices in Management Journals — Oxford RA Screening Task
This pipeline automates the classification and structured variable extraction of 50 academic papers (PDFs) from top management journals (AMJ, ASQ, SMJ, OS, JOM). It implements a 7-category classification framework to identify regression-based empirical papers (EQR), then extracts 65 variables across 11 analytical categories (with a primary focus on missing data handling practices).
Built as a 5-phase agentic pipeline using Claude (Anthropic) for LLM-based extraction, pdfplumber/PyMuPDF for PDF parsing, and openpyxl for structured Excel output.
Pipeline executed on the full 50-paper sample (February 2026, SMJ corpus).
All statistics below are drawn from data/output/extraction_output.xlsx.
| Category | Label | Count | % of Sample |
|---|---|---|---|
| EQR | Empirical Quantitative – Regression | 27 | 54.0% |
| CT | Conceptual / Theoretical | 10 | 20.0% |
| MM | Mixed Methods | 7 | 14.0% |
| EQL | Empirical Qualitative | 2 | 4.0% |
| EQNR-Other | Quantitative, non-regression | 2 | 4.0% |
| EQNR-ML | ML-focused | 1 | 2.0% |
| MA | Meta-Analysis | 1 | 2.0% |
→ 27 papers eligible for full variable extraction (EQR).
| Dimension | n | % of EQR |
|---|---|---|
| Missing data mentioned in any form | 10 | 37.0% |
| Missing data not mentioned | 13 | 48.1% |
| Missing rate explicitly reported | 1 | 3.7% |
| Method justified by authors | 1 | 3.7% |
| Missing pattern tested (MCAR/MAR/MNAR) | 0 | 0.0% |
| Sensitivity analysis on missing data | 1 | 3.7% |
Handling methods identified:
| Method | Count | % of EQR |
|---|---|---|
| Unknown / not reported | 17 | 63.0% |
| Listwise deletion | 6 | 22.2% |
| Other | 4 | 14.8% |
| Multiple imputation (MI) | 0 | 0.0% |
| Full information maximum likelihood (FIML) | 0 | 0.0% |
Key finding: 63% of EQR papers provide no identifiable missing data handling method. 0% test for missing data patterns. Neither MI nor FIML was used in any paper.
| Model | Count | % of EQR |
|---|---|---|
| Tobit / fixed-effects panel | 16 | 59.3% |
| Unknown | 4 | 14.8% |
| Survival analysis | 2 | 7.4% |
| Factor analysis / SEM | 2 | 7.4% |
| OLS / Linear | 2 | 7.4% |
| Logit / Probit | 1 | 3.7% |
| Indicator | Yes | % | No | % |
|---|---|---|---|---|
| Software reported | 7 | 25.9% | 16 | 59.3% |
| Code available | 5 | 18.5% | 18 | 66.7% |
| Replication feasible | 3 | 11.1% | — | — |
- Papers processed: 50 / 50
- Auto-classification success: 100%
- Full extraction completed: 27 EQR papers
- Human review queue: papers flagged with
FLAG-ML-UPSTREAM,FLAG-MULTISTUDY,FLAG-MISSING-AMBIGUOUS,FLAG-CLASSIFICATION-UNSTABLE
LLM-Paper-Methodology-Extraction/
├── main.py # End-to-end pipeline runner
├── config.py # Environment and path configuration
├── requirements.txt
├── .gitignore
├── agents/
│ ├── agent_0_parser.py # Phase 1 — PDF parsing + section splitting
│ ├── agent_1_classifier.py # Phase 2 — 7-category classification
│ ├── agent_2b_extractor.py # Phase 3 — 65-variable extraction (4 grouped LLM calls)
│ ├── agent_3_qc.py # Phase 4 — QC rules, auto-correction, human-in-the-loop
│ └── agent_4_exporter.py # Phase 5 — Excel output + summary report
├── schemas/
│ ├── parsed_paper.py # ParsedPaper dataclass
│ ├── extraction_schema.py # 65-variable extraction schema + QC flag injection
│ └── qc_schema.py # QCResult dataclass
├── data/
│ ├── papers/ # Drop PDFs here (0001.pdf … 0050.pdf) — not tracked
│ ├── parsed/ # Intermediate JSON per paper — not tracked
│ ├── extractions/ # Per-paper classification + extraction JSON — not tracked
│ └── output/ # Final deliverables — not tracked
└── tests/
├── test_parser.py # Integration test — PDF parsing (requires real PDFs)
├── test_classifier.py # Unit tests — classify_paper() (fully offline, mocked)
├── test_extractor.py # Unit tests — ExtractionResult schema (no LLM calls)
└── fixtures/
└── mock_paper.json # Fictional parsed paper for offline testing
| File | Description |
|---|---|
data/output/extraction_output.xlsx |
3-sheet Excel: Extraction Data, Extraction Log, Summary Statistics |
data/output/human_review_queue.xlsx |
Flagged papers requiring manual review |
data/output/summary_report.md |
LLM-generated methods summary |
Papers are classified into 7 mutually exclusive categories before extraction:
| Code | Category | Eligible for Extraction |
|---|---|---|
| EQR | Empirical Quantitative – Regression | ✅ Yes |
| EQNR-ML | ML model performance as primary finding | ❌ No |
| EQNR-Other | Quantitative, non-regression, non-ML | ❌ No |
| MM | Mixed Methods | |
| EQL | Empirical Qualitative | ❌ No |
| CT | Conceptual / Methodological | ❌ No |
| MA | Meta-Analysis / Systematic Review | ❌ No |
- Python 3.11+
- Anthropic API (
ANTHROPIC_API_KEY)
git clone https://github.com/briacSck/LLM-Paper-Methodology-Extraction
cd LLM-Paper-Methodology-Extraction
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtANTHROPIC_API_KEY=sk-ant-...
CLAUDE_MODEL=claude-3-5-sonnet-20241022
PAPERS_DIR=./data/papers/
PARSED_DIR=./data/parsed/
EXTRACTIONS_DIR=./data/extractions/
OUTPUT_DIR=./data/output/
Drop PDF files into data/papers/. Files must follow the naming convention:
0001.pdf, 0002.pdf, …, 0050.pdf.
python main.pyThis executes all 5 phases sequentially: parse → classify → extract → QC → export.
python agents/agent_0_parser.py # Parse all PDFs
python agents/agent_1_classifier.py # Classify all parsed papers
python agents/agent_2b_extractor.py # Extract variables (EQR papers only)
python agents/agent_3_qc.py # Run QC + generate review queue
python agents/agent_4_exporter.py # Export final Excel + reportpython tests/test_parser.py| Phase | Agent | Input | Output | Verify |
|---|---|---|---|---|
| 1 | agent_0_parser |
PDFs | parsed/*.json |
test_parser.py |
| 2 | agent_1_classifier |
Parsed JSON | *_classification.json |
3+ classifications |
| 3 | agent_2b_extractor |
Parsed + Classification | *_extraction.json |
Missing_Handling values |
| 4 | agent_3_qc |
Extraction JSON | QC results + review queue | human_review_queue.xlsx |
| 5 | agent_4_exporter |
All JSON + reviewed queue | extraction_output.xlsx |
3-sheet Excel |
65 variables across 11 categories:
- Paper Identification — ID, Authors, Year, Journal, Title
- Research Question — RQ summary, hypotheses count, relationship direction
- Dependent Variable — Name, construct, measurement, source, type
- Independent Variable — Name, construct, measurement, source, type
- Mediating Variables — Presence, method (Baron & Kenny / Bootstrap / SEM…)
- Moderating Variables — Presence, method (interaction term / subgroup…)
- Control Variables — Count, list, justification quality
- Sample & Data — N, context, data type, unit of analysis, time period
- Analytical Method — Model type, endogeneity correction, robustness checks
- Missing Data Handling — Mentioned, rate, method, justification, pattern tests
- Replication Potential — Data/code availability, software, feasibility score
Category 10 (Missing Data) is the primary focus of this study.
The pipeline applies 11 automated QC rules and routes flagged papers to a human-the-loop review queue. Key flags:
| Flag | Trigger |
|---|---|
FLAG-MULTISTUDY |
2+ distinct empirical studies in one paper |
FLAG-ML-UPSTREAM |
ML constructs IV/DV — verify regression is the hypothesis test |
FLAG-MISSING-AMBIGUOUS |
Missing data handling inferred, not stated |
FLAG-CLASSIFICATION-UNSTABLE |
Two LLM classification runs disagree |
FLAG-LPM |
Binary DV with OLS (linear probability model) |
FLAG-QC-MISSING-LOGIC |
Missing_Mentioned=0 but sub-fields are not NA |
The extraction agent (agent_2b_extractor.py) includes:
- Retry logic with exponential backoff on Anthropic API rate-limit errors
- Cache protection — rate-limited extractions are not written to disk, preventing corrupt cached results from blocking reruns
- Resumable runs — already-extracted papers are skipped automatically on re-execution
anthropic>=0.34.0
pdfplumber>=0.11.0
PyMuPDF>=1.24.0
pydantic>=2.0.0
pandas>=2.0.0
openpyxl>=3.1.0
python-dotenv>=1.0.0