Publishing process signals: STRONG — reflects the venue and review process. — venue and review process.

Machine-learning algorithms improved smoking identification in health records

—

Research area:Machine learningData-Driven Disease SurveillanceReliability and Agreement in Measurement

What the study found

Model-based algorithms using machine learning identified current smokers in administrative health data more sensitively than rule-based algorithms, while rule-based algorithms were more specific.

Why the authors say this matters

The authors conclude that combining more data sources with machine learning may improve smoking identification in administrative health data. They also say that balancing correct identification with false positives is important when choosing an algorithm.

What the researchers tested

The researchers conducted a retrospective cohort study in Manitoba, Canada, using administrative health data from hospital abstracts, medical claims, and prescription drug records linked to a clinical registry with self-reported current smoking. They compared rule-based algorithms based on diagnosis codes and nicotine dependence medication with model-based algorithms built using Random Forest and LASSO, a machine learning method that selects important predictors.

What worked and what didn't

The cohort included 24,718 adults, and 10.0% were current smokers. A comprehensive rule-based algorithm had low sensitivity but high specificity, while the Random Forest model-based algorithm had higher sensitivity but lower specificity; negative predictive values were consistently above 90.0%. Model-based algorithms had higher balanced accuracy than rule-based algorithms, and results differed by sex and residence location. The number of years of administrative health data did not affect the model-based algorithm results.

What to keep in mind

The abstract does not provide details on all model settings or threshold choices. It also notes that performance differed across sex and residence location, so results may not be identical across subgroups.

Key points

Machine-learning model-based algorithms were more sensitive for identifying current smokers than rule-based algorithms.
Rule-based algorithms were more specific than model-based algorithms.
The study used linked administrative health data and a clinical registry in Manitoba, Canada.
The Random Forest model-based algorithm had sensitivity of 66.8% and specificity of 77.8%.
Performance differed by sex and residence location.

Disclosure

Research title:: Machine-learning algorithms improved smoking identification in health records
Publication date:: 2026-03-29
DOI:: 10.1186/s12874-026-02839-8
OpenAlex record:: View

AI provenance: AI provenance information is not available for this post.