During the project the innovation associate held various workshops with end users to understand the business problem, the context and the required end result. The intermediate result was a mockup of a website and a specification of a smart service. A metadata extraction service has been created and deployed in an NLP platform and web service that automates text data from contract documents so that our customer may retrieve desired information at ease. Metadata extraction is done by using an information extraction method called Named Entity Recognition (NER). We have worked with three methods for named entity recognition i.e pattern lookup using regular expression, rule based approach, and machine learning models. We then describe an actual system for finding named entities in contract document and evaluate its confidence score. Finally, we created web application that was deployed for the most optimal model using AWS ECS with docker containers. Named Entity Recognition consists actually of two substeps: Named Entity Classification and Named Entity Identification. We applied text preprocessing framework as given by using tokenization, normalization and noise removal. The model has been trained with labeled test data, which has been prepared together with a customer. The model has been implemented in the Python library spaCy by using pattern matching, rule based approaches, machine learning and a hybrid approach. Afterwards the model has been evaluated on validation data. Finally the model has been deployed in the cloud, using ECS, Fargate and Flask.
The main results are a working metadata extraction service with a clearly defined API and a prototype frontend, which can be used vor validation, testing and prototyping of the service. The results have been properly checked into a code library and documented for different stakeholders.