Reviving analytical data of the past with open submission databases and text mining tools

CINF 101

Sam Adams1, Stefan Kuhn2, Peter Murray-Rust, pm286@cam.ac.uk3, Christoph Steinbeck, c.steinbeck@uni-koeln.de2, Joe A Townsend, jat45@cam.ac.uk1, and Christopher A. Waudby, caw47@cam.ac.uk4. (1) Unilever Centre for Molecular Science Informatics, University of Cambridge, Lensfield Road, CB2 1EW Cambridge, United Kingdom, (2) Research Group for Molecular Informatics, Cologne University Bioinformatics Center (CUBIC), Zuelpicher Str. 47, D-50674 Cologne, Germany, (3) Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, CB2 1EW Cambridge, United Kingdom, (4) Department of Chemistry, University of Cambridge, Lensfield Road, CB2 1EW Cambridge, United Kingdom
In contrast to Molecular Biology, Chemistry faces a significant lack of open databases. We have addressed such a lack in our own field of research, Computer-Assisted Structure Elucidation, by creating an open access, open submission database of Nuclear Magnetic Resonance (NMR) spectra called NMRShiftDB. NMR data have been published in the literature for 40 years, electronically only available as scanned bitmaps. NMRShiftDB allows to revive this information by providing means to enter data via a submission interface, augmented by quality-assurance procedures. We also present the application of the analytical data mining tool, OSCAR, to produce starting material for NMRShiftDB's authoring process. OSCAR parses organic chemistry papers, summarizes the data it finds and alerts the user of potential errors in the data. The discovered spectral data stored by OSCAR as CMLSpect files are used to author NMRShiftDB dataset.