Application of NLP in entity extraction from structured documents

This one's written by Igor Jovin, our machine learning engineer at Vivify Ideas with over 3 years of experience as a developer and data scientist. As a teaching assistant, he also teaches data science, scientific computing and web programming at the Faculty of Technical Sciences in Novi Sad, Serbia.

One of the most common problems solved in Natural Language Processing nowadays is entity extraction: identification of specific named entity types, such as persons, organizations and locations, in a body of text. Such extraction can help give focus in identifying important parts of text, what the text is about and save the reader’s time.

While this problem has been tackled many times, there isn’t really a well-suited general solution that works for all types of text content out there and that can extract all entity types out of the box.

In this post, I will briefly demonstrate how to perform entity extraction on a specific type of structured text: legal text.

Let’s start with examples and look at a few key parts of a sample contract. 

This Agreement is made as of September 27, 2009 (the "START DATE"), between, LLC, a California limited liability company, located at 1361 5th Street, Suite 25, Santa Monica, CA 90401 ("COMPANY"), and MediaMarket, Ltd., a West Virginia corporation, located at 818 Santa Monica Blvd., Hollywood, CA 90046 ("CONSULTANT").

The arbitration shall be initiated in and take place in Los Angeles County, California, or any other place selected by mutual, written agreement. All costs incurred in connection with any proceedings shall be divided evenly between the parties.

Signed by:, LLC | MediaMarket, Ltd.

By: John Smith             By: Peter Johnson

Looking at the excerpts, what types of entities are interesting to be extracted from such blocks of text? Well, people who look through them would most probably like to examine the start date of a contract, who the parties involved are, their addresses, the jurisdiction location and who signed the contract. 

From a data scientist’s perspective, this looks like a problem that was solved before: we would need to extract dates, organizations, persons, addresses and a geographical location. 

In clause samples above we see typical occurrences of the entities we want: we see specific DATE, ORGANISATION, PERSON, LOCATION and ADDRESS entities. Those types of entities are pretty common in most text content out there. They can appear in blogs, reviews, Wikipedia posts, Facebook posts, etc. Considering that, it is a good idea to check out some pre-existing libraries and models that already implement the detection of such entity types and see if they will work for our domain.

Introducing spaCy

While searching for pre-existing models and libraries for entity extraction, you will most likely stumble upon SpaCy. SpaCy is a pretty amazing library to use in a wide variety of natural language processing tasks such as text classification, POS (parts of speech) tagging, as well as named entity recognition and extraction. Let’s explore how it works on our examples from the previous section:

For the first example:

SpaCy entitiy extraction library

For the second example:

Entity recognition example

For the third example:

SpaCy library example

Here I used a very neat tool embedded in SpaCy itself, called the displaCy visualizer. You can visualize relationships between entities, a dependency parse tree (how the words are connected in a sentence, what are their types), POS tags, named entity tags and so on. Here I activated and displayed the entity visualization. Let’s see how this extraction and visualization is done in code:

NLP extraction and visualization in code

With just a few lines of code we could perform the extraction of (almost) all of our desired entities. Pretty good, huh? Let’s analyze exactly how spaCy works in our examples.

When we first import SpaCy (and previously successfully install it with pip) we need to load the pretrained model we want to work with. SpaCy offers a wide variety of models, for a number of different languages. Here we want to extract entities in English legal text, so we need to choose a model that was trained on an English text. In this example, I loaded the large spaCy English model (en_core_web_lg), which is actually a convolutional neural network trained on OntoNotes (an annotated dataset that encompasses different text sources, such as blogs, news and comments). This model assigns POS tags, dependency parse and named entities, which we want to use here.

The load function from spacy returns an object that contains the spaCy pipeline (all of the steps described in the previous sentence). Using this object, we can invoke the pipeline on our examples. We define the examples as strings and just call the NLP pipeline by passing them as arguments. The return value of calling the pipeline will be our examples, with assigned entities, POS tags and dependency parse tags.

To see only our desired entities and filter them out, we would need to print them out manually (without the displaCy visualizer). We also see that some of the entities are broken down into parts. To assemble these parts, we would need to write a helper function. Let’s see how we can write a function that displays only the text and type of entity, correctly merging entities using our custom rules:

Merging entities using custom rules

Assemble entities using custom rules

As a first step, we implemented a function that can iterate pairwise through an iterable in python (here we will use a list). 

We used this function to iterate through the list of recognized entities. We access them via the .ents property on our NLP object.

We go through the entity list and form tuples that consist of the entity text and entity label. As you can see, the label of the entity is accessed via the label_ property, which prints out the exact label of the entity, as per spaCy’s documentation (ORG, PERSON, etc.). If you want to see a full list of the entities spaCy supports, you can go to their documentation page.

If the entity should be connected, we refer to our connection_map. In the connection_map we defined which entities should be merged. Here we want to merge and LLC, as they are clearly parts of the same entity, but spaCy was unable to detect them as one.

To merge the entities, we use their start and end offsets. The text of the merged entity should start from the beginning index of the first entity and end with the end index of the second entity. Entities we are iterating through are spaCy’s Span objects (slices of the document we applied the pipeline on), so we have access to the start and end properties of the Span (document slice). With them, we can take a slice of our NLP object, which will give us the wanted text of the merged entity.

We see a pretty big disadvantage here, though. Some of the entities are mislabeled. We also see no ADDRESS entity being recognized. Part of the mislabeled entities belongs to an address. This entity is definitely useful in legal bodies of text and we would like to be able to extract it. Let’s see how we can do that with spaCy.

Using spaCy to introduce new entity types

As we’ve seen in the examples above, an address entity type is not supported by spaCy. One way to support it can be to write some custom rules and use the entity types that are supported by spaCy. We will do this by introducing a new rule in the connection_map. As we’ve seen in the 1st example, our address is usually comprised of CARDINAL, FAC, LOC and GPE spaCy entities. CARDINAL is any number in a text, FAC stands for facility, LOC for a location and GPE for geographical location. Let’s write a rule to merge such entities:

Merge Cardinal, FAC, LOC and GPE

Let’s test how this works on our previous examples:

Successful NLP extraction sample

Seems like we succeeded in the extraction of addresses in this example.

While this kind of a rule-based extraction will work well on a number of examples, we would probably see it failing on quite a few examples. A rule-based approach cannot be expected to work that well in general. To achieve high accuracy, we need to build our own model for address extraction. That is of course tricky because we need a lot of manually labeled data. It is especially tricky to do it for addresses, as they can be very specific and their structure can vary. 


Using existing libraries for NLP can definitely do the trick of building a quick prototype of our custom entity recognition and extraction systems. SpaCy is a great library, providing us functionalities such as POS tagging, named entity recognition and dependency tree parsing. It also enables us easy visualization of all of the steps in the process.

There are, however, many drawbacks if we decide to rely only on a library when building an accurate entity extraction system. The rules that we’ve written for entity merging and extracting new entities can only be partly reliable and accurate, generating many false positives.

To build a more reliable and robust entity extraction system we would need to train our own models and use more state-of-the-art solutions. I will go through some techniques, algorithms and tools you can use to build an entity recognition system from scratch, as well as how to extend some pre-existing models and libraries to support new entity types in the next part.