Skip to main content
European Commission logo
Deutsch Deutsch
CORDIS - Forschungsergebnisse der EU
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary
Inhalt archiviert am 2024-06-18

Tool platform for intelligent Patent Analysis and Summarization

Final Report Summary - TOPAS (Tool platform for intelligent Patent Analysis and Summarization)


Executive Summary:

Patents are the treasury of knowledge that drives the modern economies worldwide. The support of targeted search in this treasury has increasingly been topic of research, with a very significant outcome. But search is just the first stage that provides the “raw” material that is to be analyzed, digested, assessed and summarized, depending on the concrete task of the user – for instance, prior art or infringement surveillance, product similarity studies, white spot analysis, etc. So far, the main burden of all of these tasks has been on the manual skills of specialists. Automatic low level text content analysis such as entity recognition, name identification, etc. that would constitute an essential aid in the task of patent material reviewing has so far not fully found its way into the patent processing industry – with the exception of, for instance, name recognition in chemical or bio-medical patent material. However, we must be also aware that the needs of the users do not demand the availability of a fully automatic means that assigns a relevance score to each patent of the result set. Moreover, such an automatic means without the involvement of the users in the processing loop is strongly rejected by patent specialists, again, due to the risk of missing a highly relevant patent. Rather, the users need an intelligent assistant that provides all information necessary to enable them to grasp in the shortest possible time the essence of a given patent in order to be able to judge its relevance for their purpose. To address these needs, TOPAS pursued a number of industrial/economic, strategic and innovation-oriented objectives that resulted in the development of advanced technologies for (i) entity recognition, (ii) lexical chain (or relation) identification, (iii) patent segmentation, (iv) patent claim – description alignment, and (v) patent summarization. The developed technologies have been integrated into the product lines of the SMEs of the Consortium and the first tests have been performed with the integrated platforms.

The achievements of the TOPAS Project already show a clear positive impact for both the SMEs and the RTD Performers of the Consortium. In accordance with the projection of the impact in the Work Plan, the impact is expected to further increase considerably when the  releases of the developed components are consolidated into and  releases within the first years after the life time of the project. The impact for the SMEs has been so far of three kinds. Firstly, benefits already originated from the preliminary integration of selected TOPAS technologies into the existing product lines of the SMEs. Secondly (and as expected in the Work Plan), the possibility to advertise TOPAS technologies as future part of their product lines already helped the SMEs to enter into new market segments respectively attract attention of new potential clients in the same market segment. Thirdly, the SMEs obtained through TOPAS an operational infrastructure for language-oriented document processing, based on the GATE development environment. This infrastructure will directly benefit any future activity of the individual SMEs in the area of patent processing. The RTD Performers already benefited from TOPAS by the increase of their know-how and skills in the domain of patent processing.

During the lifetime of the project, a number of dissemination measures have been taken. These measures were of four kinds: (i) presentation of TOPAS technologies and dissemination of the information on the project to patent professionals, potential clients and the professional world in general; (ii) presentation of the scientific achievements during the development of TOPAS technologies to academic audience; (iii) demonstration of TOPAS technologies to interested public and potential clients.

The IPRs concerning the Foreground developed in the project are outlined in the Consortium Agreement and in a separate Agreement document signed by the SMEs of the Consortium.

Project Context and Objectives:

Patents are the treasury of knowledge that drives the modern economies worldwide. The support of targeted search in this treasury, i.e. in the large patent data bases, has increasingly been topic of research, with a very significant outcome. But search is just the first stage that provides the “raw” material that is to be analyzed, digested, assessed and summarized, depending on the concrete task of the user – for instance, prior art or infringement surveillance, product similarity studies, white spot analysis, etc. Due to the potentially wide-reaching consequences when a relevant patent remains undiscovered during search, patent search traditionally aims at high recall at the cost of lower precision, such that this raw material tends to contain many more irrelevant hits than one would assume of a general discourse search – with the consequence that patent specialists have to review large quantities of patents in each phase of the search procedure. So far, the main burden of the reviewing has been on the manual skills of specialists – with the consequence that the analysis, digestion and assessment of patents are an extremely costly and protracted affair. Fewer and fewer innovation-oriented companies in Europe can afford it – with all the negative ramifications this implies. Commercial services that have been offered so far by the industry fall short of really easing the situation. Thus, the patent abstracting service offered by the monopolist Thomson Derwent Inc. as a means to partly accelerate the reviewing procedure is too expensive to be a realistic option for the vast majority of the innovation-oriented SMEs in Europe: currently, Thomson bills 528,– US$ per hour for the use of the Derwent patent data base or 6, – US$ per data record from the Derwent World Patent Index. Automatic low level text content analysis such as entity recognition, name identification, etc. that would constitute an essential aid in the task of patent material reviewing has so far not fully found its way into the patent processing industry – with the exception of, for instance, name recognition in chemical or bio-medical patent material.

On the other hand, we must be aware that the needs of the users do not demand the availability of a fully automatic means that assigns a relevance score to each patent of the result set. Moreover, such an automatic means without the involvement of the users in the processing loop is strongly rejected by patent specialists, again, due to the risk of missing a highly relevant patent. Rather, the users need an intelligent assistant that provides all information necessary to enable them to grasp in the shortest possible time the essence of a given patent in order to be able to judge its relevance for their purpose. “Intelligent” means here that (i) the quality of the provided information is ensured by using cutting edge natural language processing technologies which meet the challenges innate to the patent genre; (ii) the information is provided when requested by the user, (iii) the process of the acquisition of the information is totally transparent to the user, i.e. the user must always be able to solicit evidence why the system did what it did and which options have been discarded.

The needs concern support at three levels of patent material processing:

– The availability of an automatic means for obtaining an on-the-fly summary of any patent under inspection and/or a block diagram of a patented invention at a reasonable cost in order to avoid the necessity to examine all patents in the retrieved result set in their entirety. Patent specialists will greatly benefit from an option to request a summary of any patent when needed without their work being significantly delayed until the summary arrives. A high quality summary reduces the amount of material that is to be read by a specialist in order to decide whether the claims, the description and/or other sections of a given patent are worth examining. The available commercial abstracting service by Thomson Derwent is not adequate in the context of the task of search-related patent review due to three major reasons: (i) time required to receive such a summary, (ii) high costs it implies, and (iii) its availability (since it is created semi-manually and stored in the Derwent Patent DB) is not guaranteed. Neither are adequate the abstracts provided by the patent filing institution because of their very abstract and intentionally obfuscating style.
– The availability of an automatic means for the identification of thematic text passages (or segments) in patent material in accordance with the criteria of the user. The guidance of the user to specific segments in a patent is crucial for efficient examination of a patent. Thus, the position of the description of the novelty of the invention, the outline of the possible applications, the description of the state-of-the-art, etc. vary from patent to patent. The automatic identification of a segment on a theme specified by the user facilitates a speedy acquisition of necessary information by the user.
– The availability of automatic means for the identification and classification of content elements in patent material in general or in previously identified thematic text passages. The types of content elements in which the users are interested are in particular: (i) names of institutions, products, substances, authors of scientific publications, etc.; (ii) titles of scientific publications; (iii) components of the described invention and the ways of their interaction; (iv) kind of invention: method, process, instrument, substance, etc. The need for aid in the identification and classification of content elements is motivated by: (i) knowing who are the competitors / inventors / etc.; (ii) what the others are doing exactly and how this differs from what the user is doing themselves; (iii) references to the state of affairs in a field; and (iv) what the invention described in a patent comprises.

To address these needs, TOPAS pursued a number of industrial/economic, strategic and innovation-oriented objectives. The primary industrial/economic objectives of TOPAS have been:

– EO1: Testing the TOPAS product by a test user pool in order to establish usage patterns across patent specialists with different profiles and across different patent domains, and thus further optimize the performance of the product.
– EO2: Considerable reduction of the amount of patent material that is inspected in detail by patent specialists in a patent search session by means of the TOPAS patent summarization module.
– EO3: Considerable reduction of the time spent by a patent specialist for the assessment of a single patent document by means of the TOPAS functionality of text segmentation and content element identification and classification.
– EO4: Reduction of the costs for a patent search session by using the TOPAS product in its entirety.
– EO5: Determination of the limitations of the application of the TOPAS product due to (i) the varying nature of patent layout and style across different domains, (ii) time constraints on the provision of information by TOPAS, (iii) potential performance bottlenecks of the TOPAS product.
– EO6: Endorsement of the TOPAS Product by major professional bodies of main patent holders and patent attorneys, respectively, achieving thus a maximally broad dissemination among potential users of TOPAS.
– EO7: Achieving a quality of the individual technologies in TOPAS that is competitive when compared to state-of-the-art technologies.
– EO8: Integration of the TOPAS Product into the product lines of the Consortium’s SMEs.
– EO9: Achieving in midterm time that 50% of the clients of the Consortium’s SMEs use the new product, which would imply a very high degree of exploitation of the TOPAS technologies.
– EO10: Achieving an increment of the business volume and economic benefits of the Consortium’s SMEs thanks to TOPAS by at least 20%, within the second year after completion of the project.

The primary strategic objectives of TOPAS have been:

– SO1: Extension of the market for the Consortium’s SMEs. We expect that with the integration of TOPAS into the product lines of the SMEs of the Consortium, the market in which they will be able to operate will be considerably broader.
– SO2: Extension of the business network of the SMEs.
– SO3: Acquisition of additional technical know-how and research skills by the Consortium’s SMEs, enabling them thus to address in the future research topics autonomously.
– SO4: Improvement of the exploitation of research results.
– SO5: Making known the TOPAS Product to a large community of new customers who currently either predominantly use free tools, with an enormous amount of time and effort due to the bad quality of the latter, or completely outsource the activity, loosing direct control of a business critical task and spending substantial funds.
– SO6: Mid and long term-oriented dominance of the market of intelligent tools for patent documentation analysis and summarization, including flexible adaption of business and exploitation models of Consortium’s SMEs to upcoming standard solutions.

The primary innovation-oriented (or technical) objectives of TOPAS focused on the development of robust and efficient technologies for patent analysis and summarization, and in particular, for:

– IO1: Entity recognition and classification in patent documentation – including named and unnamed entities and functional and composition content relations.
– IO2: Patent document segmentation and segmentation labelling.
– IO3: Patent material summarization.

These technologies were to be integrated into a holistic workbench equipped with web services to connect to the products of the SMEs involved in the project.

Project Results:

This section describes the achievements of the TOPAS project in the course of its effort to achieve its objectives presented above. In accordance with the Work Plan, these achievements focus, on the one hand, on (i) entity recognition; (ii) content relation (or lexical chain) identification; (iii) extraction and explicit representation of the composition of an invention; (iv) segmentation of patent material at different levels of abstraction; (v) claim – description segment alignment, and (vi) patent summarization. On the other hand, the achievements of the project also consist in the integration of the above technologies into one holistic platform and into the product lines of the SMEs of the Consortium.

In what follows, these achievements are described.

Entity Recognition

Entity recognition in TOPAS is twofold: (a) recognition of entities of a pre-defined entity class that is significant for patent analysis and (b) recognition of general entities that represent the main content of an invention.

Recognition of entities of a pre-defined entity class:

We defined five entity classes that are crucial for patent analysis:

- Patent citations: Patents frequently refer to other patents concerning closely related inventions using patent citations, which are crucial for patent retrieval and research.

We recognize patent citations using a regular expression engine. We extract the most important components based on the regular parts that constitute the final regular expression, such as nationality (e.g. German), country code (e.g. DE), patent type (e.g. application), further properties (e.g. laid-open), version (e.g. A1) and patent number (e.g. 10309383). Frequently patent citations do not contain all necessary information for linking them to the original patent document (i.e. country code and patent number). Since the SMEs plan to link each patent citation to the corresponding original patent document, we had to provide a valid country code for each patent citation. This is the case in lists of patent citations of the same country (e.g. “as in "US patents Nos. 6,034,093, 6,020,357, 5,994,375, …”) or in repetitions of patent citations that have been mentioned before. We developed methods for detecting the right antecedent country code in these cases. For the patent citations that contain only adjectives instead of country codes, we introduce a mapping between nationality and country code (e.g. "European" is mapped to "EP", "International" is mapped to "WO", "German" is mapped to "DE", etc.). We iterate over all patent citations that provide a nationality but lack a country code and add the corresponding country code to the citation's feature set. The final system provides the ability to recognize patent citations in English, German and French.

- Physical measurements: Physical measurements are of great interest, as they help describing the property of components (e.g. “a laser beam having a width of 10 to 20 nm”), i.e. they help identifying relations, components, properties thereof and summarizing the content. As these constructions show some regularities, we choose an approach using a regular expression that captures all of those examples. A regular expression engine will iterate over a given document and return the offsets and grouped information of each physical measurement mention. The regular expression is based on a set of possible physical units that are derived from an unit ontology as a grounding source. We want to provide all information that is necessary for comparing several patents based on a physical quality. For this reason, beside the basic physical unit, we also provide information about the factor (i.e. the prefix) of each unit (e.g. kilo in kilometer). Since one patent may mention a length of 20 kilometers whereas another patent may use 20,000 meters, information about the factor is indispensable. Providing information about any given factors enables the SMEs to convert measurements and compare 20 kilometers with 19,500 meters. Our final system provides the ability to recognize physical measurements in English, German and French.

The subsequent three entity classes are not based on a regular expression but on an exhaustive list of entities, i.e. a gazetteer. The construction of such gazetteers was subject in the adaption of a lexicon bootstrapping task on the patent domain, as described in Ziering et al., (2013) and in Section 2.3.2 of deliverable D3.2. The SMEs are provided with the documented source code of these works and can create new or expanded gazetteers of any entity class on their own.

We have implemented a system component that labels all occurrences of a class instance based on an input gazetteer. There are many statistical classifiers that exploit both gazetteer and contextual information like a CRF. Instead of utilizing contextual information, we chose to apply a simple gazetteer lookup that iterates over all tokens in a patent document and checks for matches with any gazetteer entry. For improving tagging accuracy, we additionally provide options for only longest match, case-sensitivity and merging of adjacent labels. In our decision to use gazetteer lookup for tagging, we follow Qadir and Riloff (2012), who argue convincingly that for a specialized class and domain, ambiguity of terms (which would be the main reason for using a context-dependent method like a CRF) is a limited phenomenon and ignoring it does not greatly affect performance.

- Technical Quality: A technical quality is a basic or essential attribute or a property which is measurable or shared by all members of a group. Examples of technical qualities are physical quantities of physical measurements such as length or volume, or more specific properties like power consumption, piston speed or light reflection index.

Recognizing technical qualities provides an important complement for physical measurements. Given a suitable association means (e.g. proximity or the scope of a sentence), the SMEs can relate a technical quality with a physical measurement. This way, patents can be compared or ranked by several technical qualities.

- Substance: A substance is a particular kind of physical matter with uniform properties. Examples of substances are metal, air, polyvinyl ether, wood, aluminum chlorohydrate, or gold. Substance is a very predominant class in coordination structures (cf. section 2.3.2.1 in deliverable D3.2) in the patent domain.
- Process: A process is a method or event that results in a change of state. Examples of processes are stretching, molding process or redundancy control. Process is a very important class for patent research since processes provide information about relations/interactions of components of an invention or they highlight an invented method itself.

Recognition of general entities:

We now look at the most general definition of (named) entities, including product names, brand names, components of machines, substances etc. We call this task “automatic terminology acquisition” (ATA) and what it acquires the “technical terminology” of a patent. In contrast to the previously specified part (a) of entity recognition, we are not interested in the actual semantic classes of the recognized entities but in a broad coverage of technical expressions.

Consider the example: “The present invention relates to a charging apparatus of a bicycle dynamo”. Here, the bolded compound nouns are the main content words because they refer to specific technological concepts. Our task here is to recognize that word sequences like “charging apparatus” are technical terms and others, like “present invention”, are not.

A sequence of words w1,w2,...,wk is a term of domain D iff (i) it is a head noun (“dynamo”) or a head noun modified by other nouns (“bicycle dynamo”), adjectives (“secondary dynamo”), or present participles (“charging apparatus”) and (ii) it denotes a concept specific to D. (i) and (ii) describe the syntactic and semantic properties of a term, respectively. Part (i) of the definition restricts terms to parts of noun phrases. This is a reasonable restriction that covers most terms (Daille et al., 1996) and it has been frequently made in the computational terminology literature. Part (ii) of the definition restricts terms to be specific to a domain D. We can set D to a general domain like ‘electricity’ and be on a par with many prior definitions, but we can also set D to a narrow domain like ‘emergency protective circuit arrangements’ (IPC code1 H02H). In TOPAS, we choose the most general technical domain possible: the domain of all technical subjects. This is a good setting for many downstream tasks in the project, e.g. summarization or claim-segment alignment should benefit from a broad coverage of D. Another advantage of finding all technical expressions in a given document or corpus is that it makes ATA easier.

The input to ATA is a patent, the output annotations of type “Terminology” which mark (parts of) noun phrases denoting technical concepts. We perform ATA in a multi-step procedure. First, pre-processing is applied. Pre-processing gives us access to various linguistic information about the text, e.g. what the words are, where a sentence ends, or what part-of-speech a word has. Next, we apply a linguistic filter in order to collect term candidates. This filter discards most word combinations which form clear non-terms, e.g. the word sequence “present invention relates” in the example above can be discarded because it does not end with a noun. After candidate acquisition, we apply a previously learned statistical model to distinguish the candidates denoting technical concepts from those denoting non-technical concepts, e.g. “invention” in the example above. This statistical model computes the probability, that a candidate like “charging apparatus” is a technical term by looking at various kinds of information (called features of the model). Examples of terminology-related features are the frequency of a candidate – a candidate is more likely a term, if it occurs often in a patent and less often in a non-patent domain, if it comes after determiners (“a”, “the”, etc.), if it comes before figure references, or if it shares many words with the title of the patent. We developed 74 features and let an automatic procedure select the best subset of features automatically.

In order to let the model know how important features are, it must compute feature values for a specific candidate, look up if the candidate is a term or not and adjust feature weights respectively, before computing new feature values with the previously adjusted weights etc. Candidate labels usually come from human annotators. This is a time-consuming and costly piece of labour, because thousands of candidates must be labelled. Our main scientific contribution is a novel method to automatically extract training data from patents for ATA. If a candidate comes before a figure reference (or figure pointer), we know with a precision of >95% that this candidate is a term. By extracting such candidates and their other occurrences in patents, we can easily compile huge amounts of training data. The features we developed help to generalize from the “figure reference terms” to other terms, e.g. with no figure reference.

Using figure reference terms alone gives us a method with very high precision. However, we miss many technical terms in the patent. By applying our statistical model, we maintain the high precision, but significantly decrease the amount of missed candidates per patent. We show in several experiments that our automatically extracted training data can compete with human annotations and that using more training data leads to better performance.

Lexical Chain identification:

The notion of lexical chain is an extension of the notions of coreference and content relation (Morris and Hirst 1991) and defines a sequence of lexical items in a text related by a semantic relation (including identify, synonymy, part-whole, etc.). The identification of lexical chains is essential for both interactive inspection of a patent by an expert and computational processing of it (e.g. for purposes of summarization).

As the basic infrastructure for the recognition of lexical chains, we take in TOPAS the Stanford Deterministic Coreference Resolution System (henceforth, StCR); see (Raghunathan et al. 2010; Lee et al. 2013; Lee et al. 2011). StCR scored best in the CoNLL 2011 Shared Task on co-reference resolution and is freely available for download and incorporation. It is the last stage of a pipeline of linguistic processing modules in the Stanford CoreNLP platform that makes use of the results of the preceding modules (e.g. tokenization, (named) entity delimitations and dependency parses).

The original StCR consists of three stages: (1) Candidate detection: Detection of nominal, pronominal and named entities using a high-recall algorithm, with the exclusion of undesirable mentions such as impersonal pronouns, partitives, numericals, bare NPs, etc. (2) Coreference resolution: Application of a succession of independent coreference models (the sieves), from highest to lowest precision to all candidate coreference mentions selected in (1), to obtain clusters of related entities. (3) Post-processing: Elimination of singleton clusters, thus leaving only clusters with at least two mentions. A sieve in (2) starts with the first mention of the first sentence and, moving forward one mention at a time, assigns the current mention mi an antecedent mi-1 from a list of candidates (Lee et al. 2011). These are ordered (i) from left-to-right traversing breadth-first the syntactic dependency tree when mi-1 is in the same sentence as mi (thus favoring subjects), (ii) from right-to-left, breadth-first when the candidate is an NP in a sentence preceding that of mi (thus favoring syntactic salience and proximity), and (iii) from left-to-right if the candidate is in pronominal form in a sentence preceding that of mi. Each sieve traverses the candidate list until a coreferent antecedent is detected or the end of the list is reached. A mention mi and its antecedent mentions are clustered.

Due to patent idiosyncrasies, the original constellation of StCR had a limited performance on patents. In particular:

- NPs were wrongly included or excluded as mentions. For instance, many impersonal pronouns are included because the filter for excluding impersonal pronouns is inadequate for the patent domain:

It is then judged whether the current value is not more than a specified value.

- Bare NPs are also excluded even though they can be found at the beginning of claim sentences as name of the invention:

A battery charging device […] for continuing supply of air to [batteries] […] comprising: a charge current controlling portion for judging whether an abnormal condition is stored in the memory means incorporated in [the batteries]

Smaller mentions within larger mentions with the same head (a typical phenomenon in patents where multiple levels of NP embedding is common) are also excluded:

[[a charge controlling portion] for charging [the batteries] at [the current value that has been retrieved by [the current value retrieving portion]]]

The ordering of candidate antecedents of nominal mentions within the same sentence (left-to-right breadth-first) favors the subject of the sentence as antecedent, which is not adequate in the patent domain in which (i) a claim sentence can be a page long and contain many embedded noun phrases, and (ii) there is a lot of repetition of terms that refer to different objects:

- Fig. 1 is a perspective view of the battery charging device according to one form of embodiment of the present invention.
- Fig. 2 is a perspective view of a battery package according to the one form of embodiment of the present invention […]
- Fig. 15 is an explanatory view showing a theory for charging of the battery charging device of the third embodiment.

[…] In the illustrated embodiment […]

Some patent-specific lexical chains are not detected, such as the part-whole relation between an invention and its different components, instance-of relations, or the relation between a set and its members:

[A battery charging device (…)]Head, comprising: [a judging portion for (…)]comp_1, and [an abnormality indicating portion for (…)]comp_2

To adapt StCR to patent material, we (1) substituted the Stanford CoreNLP pipeline by our own pipeline; (2) tuned the three stages of the Coreference Resolution System.

Our pipeline consists of a chunker and Bohnet’s dependency parsing framework (Bohnet, 2011), which includes a tokenization module, a PoS tagger, a lemmatizer and parser. The chunker is adapted to the peculiar syntactic structures of patent sentences. From the output of the chunker, we derive flat constituency structures used then by StCR. These structures include only NPs, which is justified by the fact that StCR operates mostly on NPs when determining the element of coreference chains (Lee et al. 2013). Bohnet’s parsing environment was chosen because of its ability to handle very long sentences of up to 900 words found in the claims section, and the possibility to retrain it (Burga et al. 2013). A GATE plug-in was created as a wrapper around the StCR code. The plug-in integrates the StCR into our patent processing pipeline in that it converts, on the one hand, our annotations into format accepted by StCR as input (i.e. format produced by the Stanford CoreNLP pipeline) and, on the other hand, the lexical chains output by StCR into GATE annotations.

The following adaptations have been performed in the individual stages of the StCR.

In the mention detection stage, we introduced patent-specific exclusion filters, e.g. simple NPs using head nouns that are frequent in the patent domain (e.g. claim, part, method), or awkward mentions containing colons and semi-colons, or mentions with over 30 tokens (to palliate chunking errors). Unlike for general discourse, we kept smaller mentions within larger mentions with the same head, and bare NPs.

In the coreference resolution stage, nominal and pronominal antecedent ordering was adapted to fit patent idiosyncrasies. Except for the sieve for detecting the antecedent of deictic you/I, all the StCR sieves were included in the same order, albeit with some modifications. In addition, a new sieve was created that detects elements related through copulative predicates. The specific adaptations were the following:

- In the sieves that deal with exact and relaxed string match, bare NPs can be linked when they occur at the beginning of sentences.
- In the sieve that implements precise pattern-based constructs, the detection of appositives was disabled; relative pronouns refer back to the nearest antecedent mention whilst relational pronouns such as “one another” or “each other” refer back to the nearest plural antecedent mention.
- In the sieves that implement relaxed head match, the mention and its antecedent are not required to match in number, so as to allow set-member relations.
- The copula sieve was modified to distinguish between attributive relations and identity relations found in copulative constructions.
- The lexical chain sieve was modified to use elaborate patterns to detect the part-whole relations and instance-of relations between mentions within the same sentence.

The post-processing stage was adapted to post-process the obtained clusters. Firstly, clusters that contain a mixture of reference relations are divided into different chains. Secondly, if compatible mentions (i.e. same head, same attributes, and one smaller mention included in the other larger one) are used to establish different cohesion relations, their clusters are merged into one.

For the evaluation of the performance of the adapted StCR compared to the original StCR we followed COALA (Andrews et al. 2007), which compares the coreference annotations of two different algorithms and presents the differences in a format suitable for manual evaluation. We developed a similar tool that, given two annotations, computes all pairs of chains with at least one mention in common and determines the complete context for all mentions involved. The calculation of the context involves, for each mention member to only one of the two chains, determining in what other chain it is included in the other annotation.

We focus on identity relations (since the original StCR deals only with them). Both systems were run within our pipeline adapted to the patent domain.

A set of 100 pairs of aligned chains and their contexts produced by both systems was selected at random, excluding those pairs with 100% overlap. Each pair of coreference chains was assessed manually: how many of the mentions in each chain are correct, how many are incorrect. For mentions found in only one of the chains, it was assessed whether their inclusion into other chains is correct or not. In 61%, the output of the adapted version of the StCR had a higher hit rate; in 24%, this was the original StCR, and in 15% the versions were equal.

A most qualitative evaluation was performed by end users. The users received a questionnaire and a set of patents to evaluate. The users consider that lexical chains are correct enough and help them to better understand the patent content, as a drawback they point that recall is a bit low.

The Stanford CoreNLP package supports English only and has no provision for other languages. This is reflected in the source code, where linguistic expressions and language-dependent constructs (e.g. regular expressions) are often hard-coded. In order to add support for French and German, we had to refactor parts of the code, removing all language-specific bits and replacing them with language-independent references to a centralized dictionary which can be exchanged depending on the language. Thus, the resulting adapted coreference module can easily be extended to support additional languages.

We extended an existing dictionary file to include all language-specific bits and replicated the dictionary for both French and German. A manual analysis of patents in those languages was carried out by experts to determine the best translations for the patent domain.

Extraction and explicit representation of the composition of an invention

Recognizing the components of an invention is a difficult task because of the different types of inventions there are. A component of an invention can be literally a component in an apparatus invention, but it can also be a substep in a method invention. We utilize the results of entity detection to have a starting point for detecting components of an invention in a claim. Our general approach is as follows: We first segment each claim into meaningful units, which are text spans focused on describing one component of an invention. This is possible before detecting the actual focused on component because text spans of this kind are naturally delimited with punctuation symbols and phrases in patent descriptions. Then, we use a simple heuristic to decide whether a segment focuses on describing a part of an apparatus or a substep of a method: If the first terminology element appears before the first gerund in the token stream contained by the segment we assume that the segment focuses on describing a part of an apparatus, if there’s a gerund before the first terminology element we assume that the segment focuses on describing a substep of a method invention. This seems simplistic, but works well in practice because the component focused on in a segment tends to appear very early in the segment, and substeps of a method usually begin with a gerund (like “detecting a voltage”). With this methodology the start of the component which is focused on can be detected fairly reliably.

A more difficult problem is to detect the end of the component. Looking at the example for a substep “detecting a voltage” it seems fairly simple: We have “detecting”, the gerund, as the beginning, and the argument “a voltage”, which is a simple noun phrase. The problem is that this noun phrase can get fairly complex, like in “detecting the voltage of a battery”, where the argument is the complex noun phrase “the voltage of a battery”. This problem isn’t restricted to substeps: A terminology element can be part of a complex noun phrase as well, and ideally we would want the complex noun phrase marked if we have something like “voltage calculator of the battery”. We solve this problem with a domain specific grammar which defines the linguistic material which can follow a terminology element and a gerund so that the terminology element or the gerund form a complex linguistic phrase which constitutes a component of the invention.

In the following, we see an example for a patent claim which is about a method invention:

A method of generating a patter in a display apparatus, the method comprising: determining a basic pattern block size to indicate an absolute location of a pixel on a display panel; determining points where a plurality of holes are to be formed in subpixels included in each block region determined based on the basic pattern block size, and the plurality of holes are used for calculating an absolute location of a corresponding pixel; and generating a pattern by forming a corresponding hole in each point based on an absolute location of a pixel included in each block region, among the points where the plurality of holes are to be formed.

We notice that in the first segment the component is actually not a substep of the method but a description of what the method is for. We take this into account by attaching a feature to a substep which indicates whether it is just a part of the invention or if it is the invention itself. This feature has the value “initial” if the component is the invention itself, and “component” or “substep” respectively for parts of an apparatus or substeps of a method.

We also mark the parts of an invention we find in the description so that the user can find discussions about a part of the invention quickly. Thus this markup is both helpful for understanding the composition of the invention and for finding discussions about certain parts of the invention.

Segmentation of patent material

We provide segmentation on 3 different levels of abstraction: The first level segments a patent into five mandatory and two optional zones. Most patent descriptions consist of the mandatory segments, but patent descriptions frequently do not contain the two optional segments.

The mandatory segments are:

- Technical Field
- Background Art
- Summary of the Invention
- Description of Drawings
- Preferred Embodiments

The optional segments are:

- Industrial Applicability
- Examples

We segment a patent description into those segments using the headlines which signal the start of a respective segment of this type (for example a headline like “Technical Field”). If a headline marking the start of a segment is not present we use the following heuristics for the segmentation:

- Finding the end of the Technical Field / start of Background Art: This heuristic is a modified version of the heuristic used in Tseng et al 2007. They take the first paragraph in the description as the Technical Field segment if no headline marking the start of this segment is present. We modify this heuristic by taking into account that information can precede the Technical Field in a patent description: If we face a situation where a headline precedes a headline signaling the Background Art segment (like “Background of the Invention”) then we take the first paragraph after the Background Art headline as the Technical Field segment rather than the first paragraph of the description, which in that case contains information different from information about the technical field.
- Finding end of Background Art / start of Summary of the Invention: This heuristic is based on the observation that if an objective of the invention is mentioned in the Summary of the Invention it is often mentioned in the first paragraph of the Summary of the Invention segment. Thus, if no headline marking the start of the Summary of the Invention is present, we can use the first occurrence of a sentence describing an objective of the invention as the boundary between the Background Art and the Summary of the Invention segment. If no objective segment is present we use the first paragraph after the Technical Field segment which contains the word invention as that boundary, which implements the intuition that most of the time the first time the patent talks about the invention after the Technical Field segment is when the summary of the invention begins. In experiments those heuristics have proven to be very reliable.
- Finding the start and the end of the Description of Drawings: The Descriptions of Drawings has structural properties which makes it possible to reliably detect it even if no headlines marking its start and/or end are present. First, the Description of Drawings contains a description of all the figures used to describe the invention. Second, the individual descriptions of the figures are very short compared to paragraphs in the rest of the description which mention a figure or are about what is shown in a figure. Also, the description of drawings starts with a description of figure 1. Thus, we search for the smallest textual area which starts with a reference to figure 1, ends with a reference to the highest figure in the patent description and contains references to all figures in the patent description.

Those three heuristics allow us to detect the boundaries for all five mandatory segments even if no headlines are present in the patent description. For the detection of the optional segments we rely on the recognition of headlines.

The second level of segmentation is that of headline and transition segment recognition, which results in a partial table of contents. We use a type system based on the mandatory and optional level 1 segments. The type system includes headlines which signal the start of a discussion about an embodiment, what is shown in a figure, an embodiment of the invention or about another topic discussed in the context of the particular patent at hand. Because headlines were not formally marked in our data we used the following methodology for headline recognition: We define every paragraph which is shorter than 100 characters and does not end with a sentence delimiter (., !, :, and ?) as a headline candidate. The candidates are then classified as being of one of the following types of headline:

- 7 types for the 7 level 1 segments (5 mandatory and 2 optional)
- A type “embodiment” for a headline marking the beginning of a discussion about an embodiment of the invention and a type “figure” for a headline marking the beginning of a discussion about what is shown in a figure
- A type “other” for headlines marking the beginning of a topic specific to the invention, for example the headline “Structure of the carburetor 11” marks the beginning of a discussion about the structure of the carburetor 11 or as “no_headline” if a candidate is not a headline (most frequently those are formulas and equations).

We manually labeled 2010 instances of headlines in 411 patents and trained a support vector machine classifier on the patents. The results of this classification are:

- Precision = 93, Recall = 93, F = 93

This shows that the classification of headlines is very reliable.

Transition segments are sentences which signal upcoming content. An example is:

“Next, the operation principle of the battery charger in the first embodiment will be described.”

Here the upcoming content is signaled by two features: First, this sentence is the first sentence of the paragraph it occurs in, and second it contains the phrase “will be described”, indicating the upcoming event of a description. The word “next” signals that the position where the description occurs is immediately after the sentence, not somewhere later in the description. Thus we can recognize sentences of this type by recognizing phrases like this. We evaluated this method on a manually annotated test corpus of 17 patents, with the following results:

Recall = 38, Precision = 81, F = 52

We can also use two very important concepts in patent descriptions, or rather the textual references to them, to recognize transitions: Embodiments and Figures. From observation we hypothesized that a sentence which occurs at the beginning of the paragraph it occurs in and begins with a figure reference or a reference to an embodiment is also a transition, meaning it signals an upcoming discussion about the embodiment or what is shown in the figure.

An example would be: “FIG. 2 presents a chart exemplifying the operating sequence.” with a following discussion of the operating sequence. To verify this hypothesis we made the following experiment: We chose some patents randomly and used the method described before to automatically mark sentences as transition segments. Then we showed an annotator the data and asked her to answer the following question for every sentence: “Does the sentence marked as a TRANSITION indicate the topic which is discussed in at least the first paragraph following the sentence?”

Her answers where attached by her through manually attaching a respective feature to the annotation. We then evaluated our hypothesis by measuring how many times she answered “yes” vs. how many times she answered “no”. For the 128 instances of transition segments our method found she answered “yes” 119 times and “no” only 9 times, resulting in an accuracy of the method of 93%.

Thus both methods have high precision, they do however not find all transition segments present in a patent description. This is why we can only derive a partial table of contents from headlines and transitions. The procedure for deriving a table of contents is as follows: We simply list all headlines and transition segments in the order in which they appear in the patent description and link from each to the textual part which comes right after the headline / transition. This gives a reader an overview over the topics discussed in a patent and lets him or her navigate to a particular topic of interest very quickly. This can be useful for document navigation, but also potentially for information retrieval since words appearing in a headline or transition segment may be more relevant for the overall content than a word which appears in a different part of the description.

The level 3 segments are sentences which contain a useful type of information. The recognized level 3 segments are:

- TRANSITION
- ADVANTAGE
- OBJECTIVE
- SOLUTION
- LEGAL

In the following the types will be explained shortly (except for TRANSITION which was explained above):

ADVANTAGE: ADVANTAGE segments are sentences which express an advantage of the invention. This is the only level 3 segment for which we did not achieve a precision which allows direct extraction of the segments. However, in combination with OBJECTIVE segments the annotations of this segment still allow for relatively fast detection of advantages of the invention.

OBJECTIVE: OBJECTIVE segments are sentences which express an objective of the invention. They usually appear at the beginning of the SUMMARY OF THE INVENTION segment. We trained a support vector machine classifier on manually labeled sentences which contained one of the keywords “object, objective, purpose”. We labeled them as either an OBJECTIVE or not being an OBJECTIVE. Then we trained the support vector machine on this data, where we used 50% of the data for training and 50% for evaluation. We achieved a precision, recall and F of 97.

SOLUTION: While working on methods for recognizing the advantage of the invention we noticed that sometimes patent authors include sentences like: “To achieve the object mentioned above, according to the first aspect of the present invention, there is provided a portable electronic device having at least one electronic device being a heat source, which operates…” indicating how the objective(s) of the invention are achieved. We call sentences like this SOLUTION segments and recognize them using a simple heuristic: If a sentence directly follows an OBJECTIVE segment and contains one of the keywords “object, objective, purpose” then it is a SOLUTION segment. We tested this heuristic on the 139 instances of SOLUTION segments in a corpus of 411 patents corpus and the precision, recall and F are: Precision = 0.94 Recall = 0.38 F = 0.54. Note that the precision is again very good, showing that this heuristic is very robust and well suited to recognize the segments of this type.

LEGAL: This is a type of sentence which expresses that the invention is not restricted to the embodiments but is only restricted by the claims. This is most often expressed either at the start or, more importantly, the end of the PREFERRED EMBODIMENTS. Thus it can be used as a feature for recognizing the end of the PREFERRED EMBODIMENTS and the start of the optional segments.

In addition to the types on the three levels of segmentation we recognize references to embodiments and figures. This is done with regular expressions based on strings and part of speech tags. An example for a reference to an embodiment is:

“a first embodiment of the invention”

An example for a figure reference is: “Figure 1” or, if more than one figure is referred to, something like: “figures 1 to 3”, “figures 1,2,4”, to indicate a range or an enumeration of figures which are referred to.

We also provide an advanced version of the level 1 segmentation, which utilizes a conditional random field, which is a state of the art machine learning method. The CRF takes the sequence of paragraphs the description consists of, represented by a small number of features like “does the paragraph contain a figure reference?”, as input. In those sequences of paragraphs each paragraph is labeled with the level 1 segment it belongs to. The CRF then learns the properties of, for example, which properties the sequence of paragraphs belonging to the Description of Drawings have. Then, given a previously unseen sequence of paragraphs representing the description of a new patent, the CRF can label the paragraphs according to which level 1 segment the paragraphs belong to.

This leads to two distinct methods for level 1 segmentation being available: The heuristics based method and the CRF. This enables the computation of a confidence value in the following way: If headlines mark every level 1 segment we assign the highest level of confidence. If the two methods agree the level of confidence is still high, but when the two methods do not agree it is only mediocre. Our experiments showed that the heuristics are more reliable than the CRF, so we assign a low level of confidence if the heuristics can’t find a segmentation at all so that we must rely on the CRF for the segmentation. The latter is based on the observation that patents tend to conform to the heuristics, so the patents for which the heuristics can’t find a segmentation at all have, from the point of view of our methods, an unusual structure.

The level 1 segmentation was evaluated with a qualitative evaluation, where 40 patents where chosen from a set of patents provided by Brügmann Software. The set of patents provided by Brügmann Software was intentionally designed to be challenging, so that the results should be considered a lower boundary for the real results. Of the 40 patents, 5 had a very unusual structure and where therefore not segmented by the method, leaving 35 segmented patents. Of those 35, 28 were labeled correctly or almost correctly (where “almost correctly means that only a minor error occurred, like labeling a single short paragraph incorrectly), which is an accuracy of 80%. Note that on the previously mentioned manually annotated corpus of 17 patents every description received a level 1 segmentation and the method achieved a lenient F-score of 93, indicating that the true accuracy is somewhere in between the lenient F-score of 93, which is an optimistic estimate, and the accuracy of 80% measured on the subset of patents chosen from the patents provided by Brügmann software. Those results suggest that the performance of the method is around 85%.

Thus we provide robust methods for segmenting a patent description into a small number of segments relevant for users and a mechanism for providing an overview over the topics discussed in the patent (the partial table of contents).

Claim – description alignment

The aligment between claims and descrption, has been done at different levels of granularity: paragrpah, sentences and sub-sentential segments.

Claim-segment alignment is the task of aligning the sections of claims and description that are directly related to each other. In order to do so, it is necessary to find the linguistic structures in the description that are related to a claim. The purpose of claim-segment alignment is twofold: on the one hand, it can provide a way for the user to explore claims by highlighting parts in the description that explain and/or elaborate on a claim or a part of the claim, which can be used for an interactive navigation through the claim; on the other hand, it can also provide features for abstractive summarization by dividing a claim into parts that describe the components of the invention (or steps, if the invention is a method), allowing for the alignment to be more fine-grained through relating the elements of the claims to sentences or parts of sentences in the description and choosing which information is more relevant to be included in a summary.

Claim-paragraph alignment is a very coarse-grained kind of alignment which is not sufficient for supporting summarization given that the information units to be extracted by a summarization system are more on the level of sentences or even below the sentence-level. Thus, it is necessary to apply a fine-grained alignment based on the notion of an element of the invention (e.g. a component, a step, a subcomponent, a sub-step) and a textual part in the claim which focuses on describing such element of the invention. The claim segment composed by an element of the invention and any of its respective characteristics is called an aspect of the invention. Elements and aspects are then used to find sentences and sub-sentential units in the description which provide additional information about those elements and aspects, or express in a different way the information contained in the claim.

The similarity computation is based on the important element of the invention in the aspect intended to be aligned to sentences or sub-sentential units in the description. The algorithm works as follows:

- For a given sentence or sub-sentential unit and a given aspect, do:
- compute the similarity between the important element of the invention in the aspect and all linguistic structures in the sentence or sub-sentential units in the description which have the same syntactic structure (e.g. nominal phrases, gerundive phrases) and thus would qualify as an element of the invention;
- choose the highest similarity value in step 1 and combine it with a lexical overlap measure between the sentence or sub-sentential unit to compute the similarity score between the given aspect and the given sentence or sub-sentential unit.
- Carry out step 1 for every aspect in the claims and every sentence or sub-sentential unit in the description. This process is efficient from a computational point of view because every individual step consists of simple comparison operations which are carried out very quickly by modern computers.

The particular similarity function used to combine the maximum similarity found in step 1.1 and the lexical similarity referenced in step 1.2 is:

Similarity (aspect, description segment): element_similarity(aspect, description segment) * 0.5 + lexical_similarity(aspect, description segment) * 0.5

The function element_similarity is the one used for comparing the important element of the invention to the analogous linguistic constructions in the description; lexical_similarity is a cosine-similarity based lexical overlap function. Since both aspect and description segments can contain much more text than just the element of the invention, the score effectively weights two aspects equally:

- Does a linguistic construction corresponding to the important element of the invention appear in the description segment? (element_similarity)
- Does the lexical material in the description segment correspond to the lexical material in the aspect? (lexical_similarity)

An example for an aspect in the claims and a sub-sentential unit having a high similarity:

- Sub-sentential unit: “The sensor may be mechanically, pneumatically, or electrically coupled to the flow restrictor in any manner such that”
- Aspect: “a flow restrictor coupled to the sensor”
- Similarity value: 0.825

Note that in both the aspect and the sub-sentential unit, flow restrictor -which is the important element of the invention in the aspect- plays a significant role. The sub-sentential unit provides additional information regarding how the flow restrictor can be coupled to the sensor. That's exactly the type of relation between aspect and sub-sentential unit that is wanted, demonstrating with the score that it is possible to find segments in the description that elaborate on a claim aspect.

Once the computation of claim segments and their alignment with parts of the description has been explained, the following lines give an overview of the complete process of claim-segment alignment:

- Compute the cosine similarity between claims and paragraphs in the description and mark a paragraph as related to a given claim if its cosine similarity to the claim is above a defined threshold.
- Segment the claims into aspects.
- Mark the elements of the invention on which the aspects focus. Such element is called “part of the invention”.
- Align the important elements of the invention with their corresponding linguistic structures in the description.
- Align the aspects in the claims with the sentences or sub-sentential units in the description which have a high similarity value between each other.

The first step yields the basic output of claim-segment alignment for the purpose of enabling a user to interactively navigate through the claims. Steps 2 to 5 primarily compute alignments and structures needed in abstractive summarization (see section Abstractive Summaries). Note however that the user can also access this alignment by defining a threshold which defines when a sentence should be considered a match for a claim aspect. This allows an explicit alignment between claim aspects and sentences which the user can use for interactively navigating the claims.

The claim-segment technology developed in the TOPAS project fulfills the goals formulated in the description of task 4.3 and even goes beyond those goals with the detection and markup of important elements of the invention, which can be used for purposes other than interactively navigating a claim or their use in abstractive summarization (a possible use would be keyword extraction or tracking the introduction of new terms in a field to see how a field develops).The utility of the alignment for an end-user can only be evaluated by concretely using the alignment in applications with results presented to an end-user, but it is reasonable to expect that useful products can be developed on the basis of the alignment technology.

Patent Summarization:

The TOPAS advanced summarization module first selects units (sentential or phrasal segments) from the patent under consideration based on a combination of a number of features and then aggregates the selected segments into one coherent and cohesive summary using deep linguistic techniques.

Selection of features for summarization:

The features take into account linguistic and semantic information as well as the structure of the patent to measure relevance. In particular, we consider annotations of type Mention as the basis for feature computation since they are approximations to noun phrases and therefore should refer to the concepts in the patent (e.g. the invention itself, the methods, components, etc.), but also consider more subtle ways of measuring relevance through semantic relations such as coreference and lexical chains.

Mention-based features:

Frequency-based Mention Relevance:

The first Mention feature we consider is frequency distribution. Given a patent and a list of target patent sections (e.g. abstract, claims, description), we compute the relative frequency of each Mention string in a section as the number of occurrences of the Mention divided by the total number of mentions in the section. These relevance figures are then transferred (copied) to the Mention(s) annotations in a target annotation set where the content units to summarize are. The names of the features generated are chunk_relevance_<section> were <section> will be the name of the section where we compute the frequency of the Mention. For the final version of the summarizer, the features computed for each mention are chunk_relevance_description, chunk_relevance_abstract, chunk_relevance_claims which correspond to frequencies for the three main parts of the patent (i.e. the claims section, the abstract, and the description).

Coreference-based Mention Relevance: There are several ways in which a concept can be referred to in discourse, for example mentions such as “the engine stopping signal” and “the engine” might be referring to the same entity in the patent although their superficial rendering are different. The association of these different string mentions by way of coreference relations (i.e. identity) makes it possible to measure the relevance of a concept in the patent. For each mention in the patent we identify the coreference chain it belongs to and consider the set of mentions in the chain as referring to the same concept. The mention is annotated with a relevance feature (corefRel) which value is the ratio of the numbers of elements in the chain divided by the number of mentions in the patent (concept relative frequency). We also annotate the mention with a feature indicating the position of the mention in the chain and a location relevance feature (corefRelPos) which is inversely proportional to the position of the mention in the chain (i.e. mentions appearing earlier in the chain are considered more relevant). The formulas used for computing these features are:

corefRel= (number of mentions in chain)/(number of mentions in patent)

corefRelPos= 1/√(position in chain)

Lexical Chain-based Mention Relevance: Another source of information about mentions we take into account in order to assess relevance is lexical relations which occur between mentions as in the following sentence fragment: “The conventional differential device comprises a housing that…” where a whole-part lexical relation is established between the mentions “the conventional differential device” and “a housing”. We identify the mentions involved in the relation and associate relevance scores to them which depend on the role they play in the relation. The mention playing the role of whole receives a weight of 1.0 while the mentions playing the role of part (or component) receive a weight of 0.5. These values are parameters of the system which can be modified. These relevance features will in theory make the content unit containing the related mentions more valuable for the summary. The feature computed in this case is lexRelevance.

Claim Structure-based Mention Relevance: We consider the structure of the claims in order to measure relevance. On the one hand we consider that mentions appearing in the claims section of a patent can be considered relevant (or more relevant than other mentions) and on the other hand we consider that mentions appearing in independent claims are more relevant than those appearing in dependent claims, and at the same time that mentions appearing at deeper levels in the claim structure are less relevant than those appearing at upper levels of the claim structure. The feature computed is called chunkRel and its value is 1/L where L is the level of the claim the mention appears in (independent claims are at level 1 while dependent claims can be at levels 2, 3, etc.). The value of the feature is first computed for the mentions appearing in the claim section and then transferred to the rest of the mentions in the patent. For transferring the values we use as key the superficial form of the mention. A mention in a non claim section will receive the maximum possible chunkRel value of a mention in the claims with identical superficial form.

As our objective is to score content units based on Mention relevance and given that each content unit may contain more than one mention we have devised an aggregated score for each individual mention relevance feature computed. The aggregator sums up the values of a feature and smoothes this value in order to balance the weight of units which contain many mentions vs those containing only a few. The exact formula for computing an aggregated value for feature F in a content unit is:

aggregatedF= (∑_(i=1)^n▒Fi)/√(K&n)

Where n is the number of mentions in the content unit, Fi is the feature value of mention i, and k is a smooth factor (a parameter of the system that has been set to 2 for the summarizer). Computed aggregated features are the following: aggrChunkRel (for feature chunkRel), aggrCorefRel (for feature corefRel), AggrLexRel (for feature lexRelevance), aggrRelAbstract (for chunk_relevance_abstract), aggrRelClaims (for chunk_relevance_claims), aggrRelDESC (for chunk_relevance_description).

Content unit relevance features

Claim alignment similarity features:

The process of alignment of content units from the description (sentences or partial sentences) with claims produces two quantities: score_rank_1 and score_rank_2 which indicate the degree of association of the content unit with the claims (score_rank_1 indicates the maximum alignment score with a claim while score_rank_2 indicates the second best alignment score with a claim). We take these numbers as features to measure content relevance with respect to individual claims as opposed to the whole claim section that we consider in another feature.

Length features:

Noticing that long content units with many mentions could be weighted heavily in an artificial way putting in disadvantage units composed of few mentions we have designed a feature to measure the length of the content units. This feature is inversely proportional to the number of mentions contained in the unit. The feature name is lengthRel and its value is computed as 1/N where N is the number of mentions in the content unit.

Semantic features:

Having access to the semantic-based segmentation of the patent (background, embodiments, summary of the invention, description of drawings, etc.) is very useful in order to inform the summarization process. It is well known than in specific textual genres knowing the type of information conveyed in a sentence is important to decide on its inclusion in the summary. We have designed a family of features to indicate whether or not content units belong to specific segments in the patent. Thus features relSeg<section> have a value of 1.0 if the content units belongs to section <section> while a value 0.5 will indicate the content unit does not belong to section <section>. We have computed the following features: relSegBKGR (for the background art section), relSegDRW (for the description of drawings section), relSegEMBD (for the preferred embodiments section) relSegSUMM, and relSegFIELD (for field of invention section).

Invention feature:

Through the segmentation process we are able to identify in the patent which is the content unit introducing the method, apparatus of invention being reported. The feature explanation_rel has the value of 1.0 if the content units contain the invention and 0.0 otherwise.

Classical features:

Apart from mention- and content-oriented features, we use a number of features that are also used in classical extractive summarization. Many of these features are based on the computation of similarities between different units in the patent. In order to compute similarities the units to be weighted (sentences, partial sentences, whole sections, etc.) are represented as vectors of terms and weights where the terms used for representation are the stem forms of the strings in the patents (with stop words and other meaningless units removed using specific tables). The weights for each term in the representation are term frequencies * inverted document frequencies (tdf * idf) where the inverted document frequencies where obtained from a collection of patents at the beginning of the project. The similarity between units is calculated as the cosine of the angle between two vector representations.

The computed similarity features are:

- abs_sim for the similarity of the content unit to the author abstract
- title_sim for the similarity of the content unit to the title of the patent
- claims_sim for the similarity of the content unit to the claims section of the patent

The other features computed are:

- position_score for the position of the content unit in the patent which value is (n-pos-1)/n where n is the number of units to summarize and pos is the absolute position of the sentence in the patent.
- tf_score for the term distribution features which is the sum of tf*idfs of units in the content unit

Setting up the selection of the segments

For selection (and thus summarization), all units are weighted using the following formula that implements linear regression:

score(U)=∑_(i=0)^l▒〖Wi*Fi〗

Where U is a content unit, Fi is the value of feature i and Wi is the weight associated with feature i. In order to decide what weights to use for weighting each features we use training data and a linear regression procedure.

We rely on a set of patents for which professional abstract produced by Thompson (Derwent Abstracts). For each content unit in the patent we measure the similarity of the content unit to the Derwent abstract using cosine similarity and annotate the content unit with this value, we assume that a content unit which is “similar” to a Derwent abstract will be worth extracting and we therefore want to use the features to approximate this similarity. Each patent is then processed with TOPAS components in order to compute all summarization features reported in this document and the feature values extracted to create an ARFF file for identifying the most appropriate weights for the features. We use the standard WEKA Linear Regression algorithm to estimate the features given the ARFF file.

The weights obtained by linear regression are used in our final content unit scoring mechanism in order to weight units and select the heaviest ones.

The summarization module assumes that all semantic and linguistic information as well as the segmentation in partial sentences and alignment have been produced. Most importantly, Mention annotations are assumed to be present in the default annotation set of the patent. The summarizer creates several annotation sets during its operation, most importantly, the annotation set called description_claim to store partial sentences - from descriptions and claims - and Mentions. All features for the mentions and for the content units are stored in this annotation set. The summarization features for partial sentences can be seen in the following figure.

Fig. 1

The following two figures show partial sentences selected for a summary.

Fig. 2 and 3

The extractive summarization module is basically language independent, as it relies on the results of the previous modules that perform all the linguistic analysis. Tf_score is the only feature that is language dependent, as it needs a dictionary of word frequencies. This dictionary can be obtained from a set of patents in the target language already tokenized, and does not need any human action.

The extractive summarization module produces as output a list of selected segments or sentences.

When the outputs are sentences, the obtained summary is fluently readable, even that there can be some references to elements not mentioned before; sentences are as long as they are in patents and it may be difficult to adjust the content of the summary when sentences are so long. But when the output are segments, then they are incomplete and unconnected portions of text that may be difficult to read, in a fluent way, for this reason they need to be aggregated and processed to become a coherent discourse.

Linguistic processing of selected segments:

For the aggregation of the selected segments into a coherent and cohesive text, the summarization module uses the following input:

- the full sentences which contain the selected segments, that is, not only the segments for the final summary but also the context they appear in;
- the order between the words and between the segments;
- the syntactic dependencies between the words;
- the morpho-syntactic features associated to each word (part-of-speech, number, gender, lemma);
- the origin of the segments (claims or description) and their relevance score;
- the type of the invention described in the patent (apparatus or method); each mention of the invention through the patent is marked;
- a unique identifier for each word, chunk, lexical chain, segment, and sentence;
- all detected chunks and the lexical chains they are involved in (if any);
- the antecedent of the segments (if any).

The generation module comprises two types of resources: (i) mapping grammars, and (ii) dictionaries, which are used to transform the input segments into a readable text (i.e. full-fledged sentences), in several steps. This task is handled as graph transduction thanks to a new transducer currently being developed at UPF. This transducer is an improved implementation of the MATE platform (Bohnet and Wanner 2010) already used in previous projects. It allows for editing grammars and dictionaries (as well as graphs) and for applying them to the input graphs.

The mapping grammars are gathering of rules that identify configurations in the input structure and perform adequate changes (e.g. conjugate a verb, substitute a pronoun by its antecedent, etc.). The rules use all the information which is present in the input structure: syntactic relations, morpho-syntactic features, word order, word labels themselves, etc. The grammars apply in a sequence: each grammar creates a structure that is taken as input by the next one. This allows great flexibility in the application of the transformations to be performed: a first grammar identifies every change that could possibly be made; a second grammar chooses which change(s) to perform, ensuring that no contradictory changes are performed on the same bit of the text; finally, a third grammar actually performs the chosen change(s). This sequential approach also allows for having other modules introduced in between the existing grammars, if necessary.

The first type of change is paraphrasis, which can be triggered by: (i) the presence of an antecedent marker in the structure; indeed, when segmentation is performed, some segments are left incomplete but an antecedent is sometimes identified and stored in the structure (antecedent-based paraphrasis); (ii) the presence of an explanation marker in the structure; the explanation feature indicates if an element is one of the mentions of the invention in the patent, and if this element is a device or a method (explanation-based paraphrasis); (iii) the syntactic configuration of the selected segment (syntactic-based paraphrasis); (iv) the lexical chains configuration segment (coreference-based paraphrasis).

Antecedent-based paraphrasis consists in introducing a noun or a noun group in a segment which needs it in order to be complete. If a segment starts with a gerund or a bare infinitive with no syntactic governor, the antecedent is introduced and the verb is inflected so as to agree with the antecedent. If the antecedent-bearing segment starts with a relative clause, the relative pronoun is simply replaced with the antecedent.

- (i.a) [contain a signal processing unit]Ant=device
- (i.b) [containing a signal processing unit]Ant=device
- (i.c) [which contains a signal processing unit]Ant=device

If the segment starts with a past participle, the copula be is also introduced in order to create a passive sentence.

- (i.d) [contained in a rectangular device]Ant=unit

All these transformations happen if and only if the syntactic structure of the segment is connected. If not, it means that the syntactic analysis failed, and it is not possible to manipulate the segment's elements anymore. In this case, we introduce the segment with a generic sentence One feature of the followed by the antecedent. For example, if the segment exemplified in (i.b) is not analyzed correctly by the syntactic parser, it will be paraphrased as One feature of the device: containing a signal processing unit.

Explanation-based paraphrasis simply consists in adding a sentence for introducing an invention or a part of an invention in the summary. The first mention of the invention in the patent is introduced by What is claimed is. The next mentions which have been marked as components (of a device) or steps (of a method) are respectively introduced by The device contains and The method consists in. If the sentence does not mention a device or a method, the segment is introduced by the phrase “The invention covers …”.

Syntactic-based paraphrasis only applies to segments to which explanation-based and antecedent-based paraphrases cannot apply. This module looks for segments which look incomplete, based on its syntactic structure exclusively. It performs the same kind of changes as the antecedent-based paraphrasis, that is, inflecting some non-finite verbs (gerunds and infinitives), combining past participles with be and adding introductory groups to some segments. It also removes some relative pronouns and substitutes the determiner said by its regular for the.

- (ii.a) [a device (for) containing a signal processing unit]
- (ii.b) [a device to contain a signal processing unit]
- (ii.c) [a device which contains a signal processing unit]
- (ii.d) [a unit contained in a rectangular device]
- (ii.e) [a unit for the shading of the light]

Around 20 syntactic configurations are covered by the final system of rules, which can be improved easily with any kind of new configuration.

Finally, coreference-based paraphrasis uses the information encoded as chains of coreferring elements in order to introduce some pronouns or discourse markers which make the final text more fluid. For example, if after the application of the three above-mentioned types of paraphrasis, two consecutive sentences have syntactic subjects which are part of the same coreference chain, the second subject is pronominalized, and sometimes an adverb is added to the sentence.

(iii.a)
The devicei contains a signal processing unit […].
The devicei is located next to the shading unit […].

(iii.b)
The devicei contains a signal processing unit […].
The devicei contains a shading unit […].

The second type of change performed by the grammars is filtering, which applies only when none of the three types of paraphrasis can apply. The objective of this module is to remove words or sentences in order to correct or eliminate ungrammatical sentences.

Word-based filtering aims at removing some words used as delimiters during segmentation which were left at the end of some selected segments. This means that all the beginning of the sentence is kept in the final summary. The words or groups of words which are filtered are the following: when, wherein, such that, characterized in that, gerunds and coordinating conjunction and.

Segment-based filtering aims at removing entire segments which are identified as being ungrammatical, mainly because they are incomplete. A segment is considered incomplete if it begins with: (1) a relative pronoun without the noun it modifies; (2) a finite verb with no subject; (3) a coordinating conjunction without the first conjunct; (4) a non finite verb (infinitive, gerund or past participle) which is at the same time the syntactic root of the segment and have no first argument in it; (5) a noun group directly followed by a coordination conjunction and a verb. A segment is also filtered out if it does not contain any finite verb which is not subordinated, if it ends with a modal verb, a conjunction or a determiner alone, or if it contains incomplete constructions, such as the preposition between not followed by a coordination or an element expressing duality. For instance, all the segments shown in (i.a-d) would be filtered out if they had no antecedent feature.

Sentences which cannot be eliminated or paraphrased are left as they are in the final summary, which is why in spite of the good coverage that the grammars provide, some sentences can look awkward or incomplete.

Dictionaries contain language-specific information that allows for generating the right words and their correct form in the summary without multiplying the number of rules in the grammar.

The first kind of dictionary we use in TOPAS is one that describes some parts of the syntax and generic combinatorial of a particular language. For instance, some rules introduce a definite determiner, a copula, or a coordinating conjunction. We store these in the dictionary and can retrieve them easily with a rule.

language{
id {
name = english
iso = ENG}
syntax {
pronominalize = yes
verbform = analytic
coord_conj = "and"
copula = "be"
determiners = {
definite=the
indefinite=a}
exist = {
singular = "There is"
plural = "There are"}}
The second kind of dictionary we use is actually an interactive morphological dictionary. It is a two-level automaton which builds the inflected form of a word and its morpho-syntactic features. For instance, consider the copula be with its part-of-speech (verb), its mood (indicative), its tense (present), its number (plural) and its person (3rd): be<V><IND><PRES><PL><3>. The automaton receives this information and returns the inflected form. It is able to inflect regular and many kinds of irregular verbs, as well as nouns, determiners, adjectives, etc.
As already mentioned, dictionaries are language specific. The syntactic dictionaries of German and French have exactly the same structure as the English one, only the words have been translated; see e.g. the French dictionary:
language{
id {
name = french
iso = FRE}
syntax {
pronominalize = yes
verbform = analytic
coord_conj = "et"
copula = "être"
determiners = {
definite=le
indefinite=un}
exist = {
singular = "Il y a"
plural = "Il y a"}}

The morphologic automata are different for English, German and French, since the inflection patterns are also highly language-specific. Only the dictionaries of the requested language are loaded for one particular generation.

However, the grammars are widely multilingual. Paraphrasis rules are based on the syntactic structure of the segments and on features introduced during segmentation and lexical chain detection, and this information is most of the time encoded the same way independently of the language. One case of divergence is that of the syntactic tags: since the training corpora are not the same in the three languages, the syntactic relations between the nodes are not necessarily named the same from a language to another. Fortunately, it is easy to solve in our grammars: if a syntactic tag in French or German differs from an English one, it is only a matter of allowing for matching several tags instead of just one in a particular rule. For instance, the rule condition ?Xl {c:"SBJ"-> c:?Yl} matches a part of the syntactic structure that is a word ?Xl which has a syntactic relation “SBJ” (subject) with another word ?Yl (the question mark ‘?’ defines a variable in our graph-transduction language). If the subject is named “subj” as it is the case in the French annotation, the rule can remain the same, only the condition has to be adapted: ?Xl ({c:"SBJ"-> c:?Yl} OR {c:"subj"-> c:?Yl}). For the other three types of paraphrasis (antecedent-, explanation- and coreference-based), the markers are exactly the same in the three languages, so the rules remain the same; but if a rule also uses some syntactic information, it has to be updated the same way as for SBJ/subj.

The previous paragraph shows inter-lingual divergence in the matching part of the rules. The other case of divergence between languages concerns the transducing part of the rules, in other words the part which introduces new information in the output structure. Obviously a copula or a determiner is different in English, German or French, so a rule that introduces such elements in the output structure has to take this into account. But since some information is encoded in a dictionary, the same rule can retrieve different values from different dictionaries. For instance, a rule which introduces a copula in the structure will retrieve the word be in the English dictionary, but the word être in the French dictionary (because they are located under the same path language-> syntax-> copula, see the dictionary samples above), depending on which dictionary was loaded, that is, depending on what language the user requested.

Finally, the word-based filtering rules have to be language specific, since it is necessary to target specific words, but the segment-based filtering rules can be shared among different languages because they rely mainly on syntactic configuration.

Here is a sample of an abstractive summary after the full process of extracting the segments and process them in a coherent discourse, the word “either” is the only word that is left after segmenting the original sentence that should not be there, but the remaining of the text is highly fluent:

What is claimed is a vehicle brake equipment for a vehicle fitted with fluid suspension. The invention covers a first braking-force generating means for generating a braking force of a maximum value which is that required to effect maximum braking of the vehicle [either]. It also covers a fluid pressure responsive member to which an applied fluid pressure generates a force which is subtractive from the braking force generated by the first braking-force generating means. The vehicle is fully loaded and the output pressure being applied to the pressure responsive member of the second fluid-operated braking means. The total braking force exerted by the output member will be that appropriate for maximum braking of the vehicle in accordance with its degree. [...].

Evaluation of the summarization module:

In what follows, the evaluation of the selection of segments for the summaries and the linguistic post processing of these segments is presented.

The evaluation of the summarization module is a non-trivial task, as there is not a unique “correct summary”, and there are different aspects to evaluate, therefore different evaluations have been done. First a quantitative evaluation was done comparing the segments selected to the contents of the Derwent abstracts. Then a qualitative evaluation of the summaries was performed, and the users where asked to evaluate the segments selection and how the segments where combined to construct a coherent discourse.

Quantitative evaluation of the selection of segments for summarization: This evaluation measures the overlap of the segments selected with the contents of the derwent abstract. As a result what is obtained is that the use of patent specific features improves the summaries that can be obtained using state of the art general discourse features.

Qualitative evaluation of the selection of segments together with the evaluation of the linguistic postprocessing of the segments of the summary was done by SME’s users by answering a questionnaire. These users evaluated positively the selection performed and consider that the sentences are well ordered, that the summary can be well understood while text fluency, readability, and sense fidelity are fairly well evaluated.

Operational platform:

At the beginning of the project, GATE was selected as a framework for the development of all the functionalities. GATE(Bontcheva et al. 2002), is a natural language processing framework that includes: a set of NLP packages; customizable tools like JAPE; and an API that allows the development or integration of other processing units.

All the components in TOPAS share their information as annotations over the Patent text that GATE manages These components have been done using GATE modules, developed from scratch or doing wrappers that adapt the input/output needs of an existing component to GATE annotation API.

GATE allows its execution using a user interface (GATE-Developer) or embedded. Using GATE-Developer the RTD have written several pipelines, which combine different components with the optimized parameters, and saved them as GATE’s “gapp” files. A gapp file is thus the specification of the sequence of steps to perform in order to do a desired analysis, including where are located all the resources needed.

The integration of TOPAS in any existing system is based on the integration of GATE. Once GATE is embedded in a system, executing the TOPAS’ gapps over a patent is straight forward. Some sample programs where done at the beginning of the project to show this integration, as a way to show the SME’s how to perform the integration and obtain the results. As all components are executed into GATE accessing the results of any of the components is done with a single and common API: the one of GATE.

Next, an explanation of how the different components have been integrated into GATE is given.

Integrating the Basic Natural Language Processing Tools:

The many advanced linguistic processing, starts from a text that has been tokenized, segmented into sentences and has been parsed. Gate already had a tokenizer (ANNIE tokenizer from GATE) and a sentence splitter (OpenNLP(ASF 2013)), for parsing a wrapper was performed using Bohnet’s Parser(Bohnet 2011). All of them adapted to the patent domain

A Nominal Chunks detector was developed as a GATE component, using as input the dependencies. The development of an in-house chunker was done because existing chunkers do not provide embedded chunks or are not prepared to detect chunks as long as the ones that can appear in patents.

Integrating the Lexical Chain Recognition Module:

The adapted Stanford Coref is integrated as a GATE plugin in the project platform. The plugin reads the annotations found in the input GATE document and executes the coreference module using the information contained in them. The resulting lexical chains are annotated in the GATE document.

More specifically, the plugin reads the token and sentence annotations from the document and converts them to token and sentence annotations in Stanford format, incorporating information such as form, lemma and POS tag. The nominal chunks annotated in the GATE document are converted to a shallow constituency structure in Penn Treebank bracketed annotation format and passed to the coref module along each sentence annotation. The adapted Stanford Coref module processes and annotates each sentence, one at a time.

Converting the chunks into a constituency structure guarantees that the mentions detected by the coref module will match the chunks, given that the mention detection stage takes as input the NPs found in the constituency parse of each sentence. Once a sentence has been processed, the plugin iterates over the resulting lexical chains, finding the corresponding chunk for each mention in the chain and adding information to the chunk annotation in the GATE document about what chains it is part of and their type. The set of chains is filtered so that only chains with two or more mentions are annotated and only top-level mentions (not embedded into other mentions) are considered.

Integrating the Summarization Module

The summarization module is composed of different tools. From one side there is the SUMMA component(Saggion 2008) already developed for GATE that has been tuned and expanded to include patent specific features in order to generate extractive summaries of patent segments. From the other side these segments are combined into a coherent discourse, using MATE.

MATE(Bohnet and Wanner 2010) a graph and tree transducer was used to perform the regeneration of the text. To do it, MATE was also integrated into GATE. The selected segments with the surrounding sentence to which belong, the mentions (composed by nominal chunks) and the coreference chains are extracted from GATE to build a MATE graph. After the processing the resulting summary is included in the patent as a GATE’s annotation. It is important to notice that extractive summaries consist in a selection of already existing text and they can be annotated as a highlighting of the original text, while generative summaries generate a new text that was not previously present with the same wordings in the patent itself, and has to be stored as a feature of the global patent annotation.

Integration of TOPAS Technologies into the Environments of the SMEs

The TOPAS technologies have been integrated into the Working Environments of the SMEs. This integration is described in this section.

Integrating TOPAS into BS’ Environment

As a first step BS integrated and applied TOPAS technologies for augmenting patent documents in PDF format as they are usually published. From the technical point of view the augmentation developed by BS comprises two techniques: annotation for specific parts of the content and for a document as a whole.

The first is realized by means of the annotation feature provided by the PDF format standard. Annotations for the document in general are placed on new pages which are added to the PDF document afterwards. The annotation of parts of the content is done as follows:

- Text Recognition

The first step of the process covers the recognition of textual information and its subsequent alignment to the graphical representation. This is only needed if the PDF document does not include aligned textual information. Unfortunately many patent authorities still publish patent documents as image data. By means of OCR software textual information can be conveniently extracted and attached to the respective graphical objects. BS uses Abbyy Finereader for this task. An internal study suggested that this software provides high quality results especially with respect to technical formulas.

- Text extraction

Next, the text to be processed is extracted. Provided that a PDF document contains aligned textual information every graphical object representing textual information should be assigned to the respective character string as well as to its location on the page in terms of coordinates. Based on this information the number and dimensions of columns are detected to determine the word ordering and exclude other textual information as headers, footers and line numbers. This is done by means of statistical analysis of the local distribution of graphical objects on the page.

- Identification of sections and claims

When the relevant textual content is fully restored, the parts for the description and the claims are identified. The claims part undergoes further processing to frame each claim separately. For this, particularly the numbers introducing a claim are used.

- Conversion to XML

Since TOPAS requires structured patent documents the textual information is then converted to XML format.

- TOPAS processing

Once the patent document is available in XML format it can be passed to TOPAS for processing. As a result TOPAS returns a GATE document which is also based on XML but comprises comprehensive additional data semantically describing specific parts of the content. The XML document is encapsulated in the PDF.

- Alignment & Annotation in PDF

To be able to annotate specific parts of the original PDF document it must be aligned accordingly with the GATE document generated by the previous processing step. For this, the character strings resulting from process step 2 are identified in the GATE document since these strings are bound to graphical objects. (Parts of) character strings being annotated are collected and grouped with respect to annotation type. Finally, by means of PDF’s annotation feature the graphical objects representing the textual information are marked. In this regard all PDF annotations are assigned to specific layers so that users can show and hide annotations with respect to their type on demand.

The extension of document by additional pages with information somehow summarizing the content comprises in particular the following issues:

Type of patent:

On a separate page the type of patent is printed. TOPAS provides to identify patents’ type in terms of artefact, process and/or substance based on independent claims.

Tag Clouds:

Tag Clouds become very popular to summarize textual information. BS uses TOPAS to build multiple tag clouds for technical terms, qualities and measurement units and attaches these tag clouds on separate pages to the documents.

Title & Summarizations:

Finally, there is another page providing a title and several summaries for the respective patent. The summaries differ in terms of length. Hence, steps one to four cover pre-processing to transform original documents even without any textual information to structured data so that a) TOPAS technology can be applied to it and b) calculated annotations can be merged with the original document. The screenshot below (Fig. 4) shows an annotated patent document in PDF format by means of the Acrobat Reader. In this example terms with respect to terminology, process, substance, physical measurement and technical quality are marked in different colours. Via a selection box placed at the top of the document the user is able to select specific semantic classes. Only annotations of selected classes are shown. All others are hided. For any annotation additional information is provided. A pop-up window presents the information when the user selects the yellow balloon attached to it. With respect to a measurement for example, the balloon provides the recognized number or range and the unit separately.

Fig. 4

In order to expose the passages of the description explicating particular claims, the matching fragments are highlighted similarly on user’s request. As an example, the two following screenshot (Fig. 5) illustrate this functionality for the first claim of a patent document.

Fig. 5

Another type of annotation are so-called “mentions” of technical terms. TOPAS determines which terms and pronouns either reference the same entity, an entity which is part of another or an attribute of an entity. In order to present this information adequately there is a visualization highlighting and connecting associated terms. Exemplarily, the screenshot below (Fig. 6) shows this feature for the term highlighted in green colour and connected by a green spline. The design of the line exhibits the type of relation. A solid line connects all terms representing the same entity, a tight dotted line two terms in an attribute-of relationship and a sparse dotted line in case that term A can be semantically regarded as part of term B.

Fig. 6

As indicated by the screenshots BS integrated successfully TOPAS technology. The kind of integration however turned out to be more complex than the result might suggest. Particularly greater effort was needed to identify the correct order of text items due to the various layouts of patent documents in terms of line numbering, headers, footers, columns and paragraphs. Moreover, technical formulas with sub- and superscripted items complicated this task.

As second integration task BS approached meanwhile indexing the documents by means of PatOrg’s search engine being part of its document management system. For this, the indexer is currently adapted to cope particularly with the XML documents including the semantic information encapsulated in the PDF document (processing step 5).

In this regard tagged chunks of text are stored in separate fields. This provides on the one hand to narrow search with respect to text fragments related to a special semantic. For users this is particular useful with respect to document segments so that e.g. the state-of-the-art part may be excluded from search. On the other hand faceted search can be realized. A search engine featured with faceted search shows the distribution of the data for a search regarding one or more fields. Imagine a general search is done e.g. for documents containing the term “fuel cell” the search engine would provide in addition to the results the information which materials are mostly used by the patents.

Integrating TOPAS into IALE’s Environment:

The integration of TOPAS technologies by IALE is twofold: On the one hand, IALE internally benefits from using TOPAS technologies to support key tasks related to its consultancy services. On the other hand, IALE integrates TOPAS technologies into VIGIALE, its strategic environment watch web platform.

Integration of TOPAS technologies to support consultancy services:

Applying TOPAS content extraction technologies allows for the identification of entities such as product and brand names, components of machines, substances, technical fields, processes, advantages, etc. from patents. This kind of information when related to players (inventors, assignees) or countries in specific technology areas is useful for Competitive Intelligence purposes. Therefore it significantly widens the analytical scope of some of the indicators considered in the commonly performed studies and adds extra value to the resulting patent-based reports that IALE is now able to provide.

To achieve that, IALE has defined a set of strategies to combine patent annotations provided by TOPAS with other patent metadata to obtain different kinds of indicators or insights according to different motivations (e.g. for analysing patents within a technology area or for comparing portfolios amongst companies (as it was exemplified in the use cases described). For example, Named entities and content relations are applied and observed together with publication years in order to check their evolution in time. This allows us to obtain relevant insights on which assignees use which materials, processes and techniques, etc. Physical measurements ranges can be used for clustering patents by scope of activity (e.g. in the Energy Sector, MW would relate to the Energy generation level -nuclear, carbon, wind, etc.- whereas kW would more likely relate to the urban distribution level, smart grids, etc.).

The analytical process carried out is the following: The studied patents set is conveniently searched and retrieved from whatever the original source (as long as they contain claims, description and abstract) and processed with TOPAS linguistic modules. The results are extracted and imported in convenient data formats into an analysis and visualization software, such as Matheo analyzer, Tableau or KMX (being these support tools used by IALE as a daily basis). Valid indicators are obtained such as: top players (applicants, inventors, countries, ...) publication dates (evolutions), top topics, top substances, main measurements and mesurement categories, top processes, technical qualities mostly mentioned, what technologies are used by whom (applicants, countries,etc.), what technologies (materials/processes,etc.) are associated, how patents are grouped by technical fields, what areas are covered by competitive patents, which patents share the similar or related advantages, etc. (Fig. 7).

Fig. 7: Competitive Intelligence visualization of indicators obtained from Topas processed patents

Integration of TOPAS technologies into VIGIALE.

VIGIALE platform monitors patents, amongst other relevant information sources for specific domains or technology areas. VIGIALE users can conveniently generate automatic reports -called Smartscan- displaying the activity from the monitored sources for a given period. These reports contain concise, digested information related to the specific domain.

Fig. 8: TOPAS annotated patent listings in the new Smartscan page design

Before TOPAS, patents where simply listed in Smartscans and featured with basic metadata such as title, number, applicant and abstract. The goal has always been to facilitate the user’s overall and quick understanding of patent contents. Now with TOPAS a further analytical step can be offered. TOPAS linguistic annotations and summarization are incorporated in the patent listings and this is materialized in the new Smartscans page layout (Fig. 8 above) through four elements characterizing each patent being monitored: i) An automatic summary, ii) the technical field sentence, iii) the main advantages and iv) the associated terminology. These four features are offered for each patent listed and accompanied in the page layout design by the statement: “Analysis powered with TOPAS technology“. It is also foreseen to integrate hyperlinks redirecting to representative annotated parts for each patent.

The technology is integrated but not fully automatized yet. One of the restrains is the amount of time currently required for processing summaries. Thus, monitoring and processing tasks are not carried out together and in real-time, but rather patents are continously monitored and sent at the end of each monitoring period for processing. The TOPAS module on VIGIALE consist of an API that is being implemented and by means of which VIGIALE outcasts the list of patents monitored in a given period and gets back as a result the anotated and summarised patents which will then be displayed in the Smartscan when generated.

Integrating into IS’s Environment

Intellisemantic is developing extensions of its solution MyIntelliPatent using TOPAS technologies. The extensions shall ease patent analysis by supporting both patent search and inspection. The first task comprises particularly preliminary analysis of results and in consequence the refinement of queries in order to localize patent sets that are potentially relevant. It takes usually multiple iterations to identify those patents. The second task covers detailed analysis of patent documents which principally requires accessing the whole content.

Basically TOPAS technology provides to facilitate patent search as inspection significantly. As a tool for patent search MyIntelliPatent can thus substantially profit from TOPAS technology. In the scope of the project IntelliSemantic integrated some TOPAS technologies supporting users with respect to these tasks.

As far as the patent set analysis and refinement screenshot is concerned, IntelliSemantic significantly extended the visualization of search results in text form presently available in MyIntelliPatent.

Without TOPAS technology, search results in MyIntelliPatent are previously shown only in terms of title, abstract and bibliographic data. By means of TOPAS’ segmentation and abstracting functionalities results are now featured with either “technical field”, “objective” “background art“, „summary” or “automatic abstract” according to users’ choice. This kind of presentation makes it easier to evaluate search results because it already provides quick and target-oriented access to the content. The following screenshots (Fig. 9, 10) illustrate this feature for the segments “technical field” and “background art”.

Fig. 9
Fig. 10

As the next step of integration in this regard IntelliSemantic intends to exploit the segmentation functionality also for search so that users can narrow search on specific segments. When the patent searcher identifies a patent document included in a search result that might by relevant, a detailed patent analysis follows usually. Leveraging TOPAS technology MyIntelliPatent provides now more fine-grained navigation in patent documents structured so far only in the general sections abstract, description, claims and figures. As the screenshot below shows users can now conveniently navigate thru the document by selecting specific segments using the buttons in the sidebar on the left. Moreover, there are two additional buttons to extract citations and measurements possibly included in the document.

Fig. 11

Potential Impact:

Analysis of the impact achieved by TOPAS technologies within TOPAS’ lifetime

The achievements of the TOPAS Project already show a clear positive impact for both the SMEs and the RTD Performers of the Consortium. In accordance with the projection of the impact in the Work Plan, the impact is expected to further increase considerably when the  releases of the developed components are consolidated into and  releases within the first years after the life time of the project.

The impact for the SMEs has been so far of three kinds. Firstly, benefits already originated from the preliminary integration of selected TOPAS technologies into the existing product lines of the SMEs. For instance, BS makes accessible for their clients the functionality of TOPAS technologies for patent analysis and summarization via an interactive interface incorporated into Adobe pdf. So far, this feature has been launched as an illustration of the innovation drive of the company and demonstration of its reputation as one of the cutting edge companies in patent processing SW. The clients acknowledge the effect of this demonstration. With the consolidation and extension of the technologies in the time to come, the clients will surely increasingly use this feature not only as an “add on” feature, but as a central feature of their work with patents.

Secondly (and as expected in the Work Plan), the possibility to advertise TOPAS technologies as future part of their product lines already helped the SMEs to enter into new market segments respectively attract attention of new potential clients in the same market segment. For instance, IS was invited to present TOPAS technologies at the Annual 2013 EPO Conference and at the Annual Meeting of the working group “Networks and Online Retrieval” of the Patent Documentation Group, which is the umbrella organization of the World’s largest companies with outstanding patent portfolios. Especially the latter generated for IS valuable contacts to potential clients from many different industry areas. Furthermore, IS is currently in negotiation with EPO about a potential use of TOPAS technologies in EPO’s tools.

Thirdly, the SMEs obtained through TOPAS an operational infrastructure for language-oriented document processing, based on the GATE development environment. This infrastructure will directly benefit any future activity of the individual SMEs in the area of patent processing.

The RTD Performers already benefitted from TOPAS by the increase of their know-how and skills in the domain of patent processing.

Prospective impact of TOPAS technologies:

The impact achieved during the lifetime of the project is expected to further intensify during the first four years after the completion of the project (see Work Plan). In particular, as already observed, the impact for SMEs will be of benefit for both intensification of the business relationship with the existing clients and acquisition of new clients in the established and new market segments. This impact will be reflected in: (i) the increase of revenues, (ii) the increase of the market sector share they will dominate with their products, (iii) the intensified European collaboration they built up with the other SMEs of the consortium, (iv) the support to consolidate their incipient entry in new international markets.

In the midterm, TOPAS will further enhance the competitiveness of consortium SMEs by providing them with (i) the enabling technologies which allow future and broader applications for their existing products; (ii) competitive advantage by having ready access for their staff to unique training on the start-of-the-art technologies on natural language processing which represents the basis for own incremental work on new applications exploiting these technologies.

The impact for the clients of the SMEs will be reflected in the fact that they will be able to ensure cost-effective high quality IPR protection in-house, avoiding expensive outsourcing and purchase of information or the use of low quality free means.

As already observed, the tremendous strategic advantage of TOPAS with respect to its introduction into the market is that it is perceived as yet another tool in patent processing from yet another provider, but rather as a complementation of established products of well-known players in the field by functionalities requested by their clients.

List of Websites:

www.topasproject.eu