What the study found
Model-based algorithms using machine learning identified current smokers in administrative health data more sensitively than rule-based algorithms, while rule-based algorithms were more specific.
Why the authors say this matters
The authors conclude that combining more data sources with machine learning may improve smoking identification in administrative health data. They also say that balancing correct identification with false positives is important when choosing an algorithm.
What the researchers tested
The researchers conducted a retrospective cohort study in Manitoba, Canada, using administrative health data from hospital abstracts, medical claims, and prescription drug records linked to a clinical registry with self-reported current smoking. They compared rule-based algorithms based on diagnosis codes and nicotine dependence medication with model-based algorithms built using Random Forest and LASSO, a machine learning method that selects important predictors.
What worked and what didn't
The cohort included 24,718 adults, and 10.0% were current smokers. A comprehensive rule-based algorithm had low sensitivity but high specificity, while the Random Forest model-based algorithm had higher sensitivity but lower specificity; negative predictive values were consistently above 90.0%. Model-based algorithms had higher balanced accuracy than rule-based algorithms, and results differed by sex and residence location. The number of years of administrative health data did not affect the model-based algorithm results.
What to keep in mind
The abstract does not provide details on all model settings or threshold choices. It also notes that performance differed across sex and residence location, so results may not be identical across subgroups.
Key points
- Machine-learning model-based algorithms were more sensitive for identifying current smokers than rule-based algorithms.
- Rule-based algorithms were more specific than model-based algorithms.
- The study used linked administrative health data and a clinical registry in Manitoba, Canada.
- The Random Forest model-based algorithm had sensitivity of 66.8% and specificity of 77.8%.
- Performance differed by sex and residence location.
Disclosure
- Research title:
- Machine-learning algorithms improved smoking identification in health records
- Authors:
- Aminul Haque, Nathan Nickel, Maxime Turgeon, Lisa M. Lix
- Institutions:
- University of Manitoba
- Publication date:
- 2026-03-29
- OpenAlex record:
- View
Get the weekly research newsletter
Stay current with peer-reviewed research without reading academic papers — one filtered digest, every Friday.


