Topic Modeling and Novelty Detection in IT Earnings Call Transcripts

MIDS logo
: Finance
: 2026

This project builds a pipeline that finds what IT company executives are talking about on their earnings calls, and flags when something new shows up. We pulled transcripts from 30 IT companies between 2023 and 2025, kept only the executive speech, and broke it down to the sentence level. Each sentence is turned into a 1024-dimensional vector using Qwen3-Embedding-0.6B, then reduced to 40 dimensions through PCA and UMAP. We cluster the 2023 to 2024 data with a Gaussian Mixture Model to get a stable map of topics, then project 2025 data into that same map without refitting. From there, we measure topic drift and run novelty detection to find the 2025 sentences that don’t fit the older topic structure. The output is a set of topic labels, drift metrics, and novel clusters that give our client a clear insights into where industry conversation is shifting. Results confirm the pipeline’s effectiveness. GMM with K=150 and tied covariance gave stable clusters (mean ARI of 0.8191 across bootstrap resamples). Jensen–Shannon distance between 2023–24 and 2025 topic distributions was 0.085, with a permutation p-value of 0.002. AI-related topics grew by 4.06 percentage points in combined share across 30 of 36 companies, led by GenAI models and infrastructure.

Mentor: Yue Jiang

Project poster (PDF)