Auto-EHRmonize: Abstracting flowsheet medical concepts using large language models

MIDS logo
: Biology
: 2026

Auto-eHRmonize Flowsheet is a Python package designed to solve a critical interoperability problem in healthcare informatics: clinical flowsheet label harmonization. Healthcare facilities use heterogeneous naming conventions for identical clinical measurements (e.g., “Heart Rate”, “HR”, “Pulse”, “Heart rate BPM”). This tool automatically maps these varying labels to standardized, reference terminology using a two-stage semantic search pipeline. In real-world data, the same clinical concept is often recorded in different ways. These differences make it difficult to combine data with the same clinical measurement across sources. Mapping labels between these datasets manually is time-intensive, costly and error prone. Furthermore, relying only on label names can be misleading. Two labels that look similar may actually represent different clinical measurements, or have very different behavior based on their data patterns. This package is intended for data scientists, clinical researchers, and practitioners working with multi-source healthcare data. The system combines transformer-based sentence embeddings with a pre-built vector database (built from MIMIC clinical data) to rank candidate matches by both semantic label similarity and value distribution proximity. The result is a ranked list of harmonized labels from reference sources, enabling downstream clinical data integration and research.

Mentor: Jana Schaich Borg

Project poster (PDF)