What the study found
The study presents Microsoft Research Accurate Chemistry Collection (MSR-ACC) and its first release, MSR-ACC/TAE25, a dataset of 73,040 total atomization energies calculated at the CCSD(T)/CBS level using the W1-F12 thermochemical protocol. The dataset is designed to cover a broad space of closed-shell, neutral, covalently bound equilibrium molecules with up to 5 non-hydrogen atoms.
Why the authors say this matters
The authors say that sub-chemical accuracy means being within 1 kcal mol^-1 of the empirical ground truth, and that datasets with this level of accuracy are still limited in size or scope. The study suggests that MSR-ACC/TAE25 can help develop data-driven computational chemistry methods with more predictive accuracy across broad chemical space.
What the researchers tested
The researchers built an openly available dataset on Zenodo in QCSchema format under the CDLA Permissive 2.0 license. It includes molecules made from elements up to argon and excludes structures with significant multireference character.
What worked and what didn't
The release contains 73,040 total atomization energies and is described as exhaustively covering the specified chemical space. The abstract does not report performance comparisons, model benchmarks, or failures.
What to keep in mind
The available summary does not describe limitations beyond the dataset scope itself. The release is restricted to closed-shell, charge-neutral, covalently bound equilibrium structures with up to 5 non-hydrogen atoms and without significant multireference character.
Key points
- MSR-ACC/TAE25 contains 73,040 total atomization energies.
- The energies were obtained at the CCSD(T)/CBS level using the W1-F12 thermochemical protocol.
- The dataset covers closed-shell, neutral, covalently bound equilibrium molecules with up to 5 non-hydrogen atoms.
- The covered elements extend up to argon, and molecules with significant multireference character are excluded.
- The dataset and canonical train/validation splits are openly available on Zenodo in QCSchema format under the CDLA Permissive 2.0 license.
Disclosure
- Research title:
- New dataset covers atomization energies across broad chemical space
- Authors:
- Sebastian Ehlert, Jan Hermann, Thijs Vogels, Víctor García Satorras, Stephanie Lanius, Marwin Segler, Klaas J. H. Giesbertz, Kenji Takeda, Kenji Takeda, Giulia Luise, Giulia Luise, Rianne van den Berg, Paola Gori-Giorgi, Amir Karton
- Institutions:
- Microsoft (Netherlands), Microsoft (United States), Microsoft (Germany), Microsoft Research (United Kingdom), University of New England
- Publication date:
- 2026-04-25
- OpenAlex record:
- View
- Image credit:
- Photo by Polina Tankilevitch on Pexels · Pexels License
Get the weekly research newsletter
Stay current with peer-reviewed research without reading academic papers — one filtered digest, every Friday.


