German scientists develop software to read chemical compounds
German scientists have developed a new software tool capable of identifying pictures of chemical structures in patent files. The aim is to make these pictures computer-readable and retrievable. Patent files and repositories of scientific publications often contain information on chemical structures in image format. While classifying these structures poses no problems for chemical scientists, who can open the document and understand the meaning of the images, computers have no way to index the structures since they only amount to a mass of pixels. The chemoCR software, which was developed by Fraunhofer Institute for Algorithms and Scientific Computing (SCAI) and InfoChem, a German company, combines pattern recognition techniques with supervised machine-learning concepts. The method is based on the idea of identifying from structural formulae the most significant semantic entities (e.g. chiral bonds, super atoms, reaction arrows). This enables computers to retrieve information contained in chemical-pharmaceutical patents, by performing structure searches. 'Up to now, structures have been drawn by chemists in India, Russia and other low-wage countries, and entered manually in databases. These fast developing countries are benefiting from the added indexing value. With chemoCR we can now reconstruct chemical structures faster and more cost-effectively, with computers,' says Peter Loew, InfoChem's CEO. 'With our software, for the first time, millions of patents can be searched using the chemical information contained in the pictures. This opens new possibilities for the investigation of patent claims on compounds and synthesis procedures; chemoCR addresses one of the most common challenges of the chemical and pharmaceutical industry,' added Professor Martin Hofmann-Apitius, Director of SCAI.
Countries
Germany