Pharmaceutical companies committing fraud (through illegal kickbacks, off-label promotion, or false claims submissions) leave detectable traces in their SEC financial disclosures years before legal action materializes. This project builds a complete, end-to-end pipeline that (1) automatically extracts domain knowledge from historical DOJ complaint filings using a large language model platform deployed on Google Cloud, (2) maps those domain insights to around 50 signals across SEC 10-K and 8-K filings, and (3) trains and validates a misconduct classifier on those signals using a rigorous nested cross-validation framework.
The central contribution is the domain bridge: instead of applying generic anomaly detection to financial data, we operationalizes the fraud playbook as written in actual legal records. For example, illegal kickbacks map to abnormal selling and marketing expense trends.
The modeling pipeline, implemented in Python (scikit-learn), processes SEC 8-K event items as the primary feature source, uses a custom leakage-safe class for feature engineering inside each cross-validation fold, and evaluates three classifiers (i.e. Logistic Regression, Random Forest, and XGBoost) across multiple random seeds with nested stratified k-fold or leave-one-out cross-validation depending on class balance.
Mentor: Jana Schaich Borg
Project poster (PDF)
