Comprehensive representation of health-related phenotypes in one million dogs using topic modelling of electronic health records

Two volunteers wearing blue shirts and personal protective equipment examine a brown dog lying on a white medical examination table in a clinical room with supplies visible.
Image Credit: Photo by Mikhail Nilov on Pexels (SourceLicense)

AI Summary of Peer-Reviewed Research

This page presents an AI-generated summary of a published research paper. The original authors did not write or review this article. See full disclosure ↓

Journal Of Big Data·2026-02-24·Peer-reviewed·View original paper ↗·Follow this topic (RSS)
Publication Signals show what we were able to verify about where this research was published.STRONGWe verified multiple publication signals for this source, including independently confirmed credentials. Publication Signals reflect the source’s verifiable credentials, not the quality of the research.
  • ✔ Peer-reviewed source
  • ✔ Published in indexed journal
  • ✔ No retraction or integrity flags

Overview

This study applied unsupervised machine learning to characterize health-related phenotypes across one million canine electronic health records from the Small Animal Veterinary Surveillance Network. The research addresses the methodological limitations of traditional small-scale veterinary studies by implementing a scalable computational approach to extract disease phenotypes and population-level patterns from clinical text.

Methods and approach

The researchers implemented BERTopic, a topic-modelling algorithm based on Bidirectional Encoder Representations using Transformers architecture, to process clinical notes collected by SAVSNET from UK veterinary practices. This unsupervised machine learning methodology was applied to generate a comprehensive representation of clinical presentations across the canine population without requiring manual disease annotation or predefined hypothesis targeting.

Key Findings

The BERTopic approach successfully identified established disease phenotypes, including breed predispositions to hypoadrenocorticism, diabetes mellitus, and mitral valve disease. Beyond known associations, the analysis surfaced previously uncharacterized patterns of disease phenotypes within the population. The methodology demonstrated capacity to detect temporal variations in disease distribution patterns, suggesting utility for identifying emerging infectious or environmental disease signals.

Implications

The scalable, unsupervised approach offers a systematic alternative to traditional hypothesis-driven screening methods that rely on collating multiple small-scale studies. By enabling granular interrogation of large clinical datasets without requiring predetermined disease categories, the methodology facilitates comprehensive phenotyping across populations and supports surveillance for emerging health threats. This computational framework represents a methodological advance for leveraging the intrinsic information density of existing electronic health record systems in veterinary practice.

Disclosure

  • Research title: Comprehensive representation of health-related phenotypes in one million dogs using topic modelling of electronic health records
  • Authors: Peter‐John M. Noble, Sean Farrell, Noura Al-Moubayed, Alan David Radford
  • Publication date: 2026-02-24
  • DOI: https://doi.org/10.1186/s40537-026-01365-0
  • OpenAlex record: View
  • Image credit: Photo by Mikhail Nilov on Pexels (SourceLicense)
  • Disclosure: This post was generated by Claude (Anthropic). The original authors did not write or review this post.

Get the weekly research newsletter

Stay current with peer-reviewed research without reading academic papers — one filtered digest, every Friday.

More posts