Introducing CLiDE Pro

CINF 54

Aniko Valko, aniko.valko@keymodule.co.uk, Keymodule Ltd, Leeds, United Kingdom, A Peter Johnson, a.p.johnson@chemistry.leeds.ac.uk, School of Chemistry, University of Leeds, Leeds, LS2 9JT, United Kingdom, and Aniko Simon, SimBioSys Inc, 135 Queen's Plate Dr, Unit 520, Toronto, ON M9W 6V1, Canada.
CLiDE Pro is the latest incarnation of software to emerge from the long-term CLiDE (Chemical Literature Data Extraction) project. Chemical OCR involves three main problems: (a) identification of chemical images within a document, (b) compilation of chemical graphs of individual molecules from chemical images, and (c) interpretation of complex objects such as generic molecules and reaction schemes using the retrieved chemical graphs. The structure recognition methods implemented in CLiDE Pro will be presented. Structure features which frequently cause problems such as crossing bonds, lines found in various chemical entities such as single bonds attached to triple bonds, dashed bonds and parts of atom labels commonly misclassified as lines (e.g. I and Cl) will be discussed together with our solutions to these problems. A key component of the presentation will be CLiDE Pro's approach to the interpretation of generic structures.