Non-Named Entities – The Silent Majority

This is a joint work between Pierre-Henri Paris and Fabian Suchanek (both at Télécom Paris, DIG team).

Non-Named Entities?

Human texts usually contain a lot of noun phrases that often function as verb subjects and objects, as predicative expressions and as the complements of prepositions. When building knowledge bases, proper nouns in noun phrases are extracted and used as entities equipped with facts to populate knowledge bases. During this process, noun phrases that are not proper nouns are left unmapped, thus ignoring a vast amount of information. Consider the following example which contains a named entity and two non-named entities:

« The Arab Spring resulted in contentious battle between a consolidation of power by religious elites and the growing support for democracy. »

The Problem

The previous example contains a lot of information that will not be extracted despite its importance. Indeed, the two non-named entities explain the Arab Spring event, but will not be mapped to the knowledge bases.

In this position paper:

  1. We manually analyzed this phenomenon in 30 Wikipedia abstracts, allowing us to quantify the information lost,
  2. We also proposed a partial solution for simple non-named entities,
  3. Finally, we discuss the remaining challenges to represent all non-named entities.

Manual Study of Non-Named Entities in Wikipedia

We conducted a manual analysis of the noun phrases in Wikipedia articles. Our choice is motivated by the fact that Wikipedia is a widely used standard reference, in both research and industry applications. We focus on the Wikipedia articles with highest quality (the “featured articles”).

Annotation workflow.
Figure 1: Annotation workflow.

We choose one article from each of the 30 topics, and automatically extract (and manually verify) noun phrases from the abstract of the article. We consider noun phrases that are sequences of nouns, adjectives, adverbs, determiners, and prepositions (see Figure 1).

Overall, we annotated 1925 noun phrases, at an inter-annotator agreement (Cohen’s kappa) of 0.88, which is considered excellent.

Repartition of named and non-named entities.
Figure 2: Repartition of named and non-named entities.

Non-named entities by Yago class.
Figure 3: Non-named entities by Yago class.

Non-named entities by nature and modifiers.
Figure 4: Non-named entities by nature and modifiers.

A Partial Solution

As a solution for the simplest non-named entities, we propose the following steps:

  1. Replace anaphoras and determined noun phrases with their referent: « the coffee brand » becomes the G. Washington Coffee Company entity in this context.
  2. Simple nested noun phrases are linked together: « a mansion in Brooklyn » becomes an anonymous entity link to the Brooklyn entity.
  3. Use of ad-hoc classes for plural noun phrases based on the head noun : « German scientists » becomes a subclass of the scientist class.
  4. Make singular noun phrases anonymous instances of ad-hoc classes or top classes (as in Figure 3): « a German scientist » becomes an anonymous instance of the German scientist class.
  5. Replication of numbered noun phrases: « the four royal houses… » becomes four instances of the royal house class.
  6. For each noun phrase, a list of modifiers (adjectives, adverbs) is attached, allowing for a rich description of the entity.

The Gap that remains

However, many pitfalls remain before we can represent all non-named entities in knowledge bases.

Knowledge representation

  • Some knowledge bases contain mainly classes, others mix classes and instances, and others duplicate them (like dbr:Book and dbo:Book). These different models are not all usable as is and may need to be adapted before adding non-named entities.
  • Plural noun phrases like “hundreds of soldiers” needs axioms to express “hundreds”. OWL 2 could help for this task.
  • Vague noun phrases like “large-scale settlement” cannot be represented in current knowledge bases unless using Generalized Quantifiers or Fuzzy Logic.
  • Nested entities like “the growing support for democracy in many Muslim-majority states” or “contentious battle between a consolidation of power by religious elites”.

Canonicalization

  • Similar non-named entities in different contexts, e.g., two different rises of the same stock market.
  • Distinct entities must be kept apart, e.g., two rises of different stock markets.

Facts

  • Comparatives, superlatives, and temporal comparisons, “A rose is more beautiful than a daisy” need elaborate axioms.
  • Complex statements about classes: How to express statements for non-named entities such that “dormant volcano” or “his characteristic surrealist style”?

To cite this work

Pierre-Henri Paris, Fabian M. Suchanek. Non-named entities - the silent majority. In ESWC 2021.

Pierre-Henri Paris
Pierre-Henri Paris
Postdoctoral Researcher in Artificial Intelligence

My research interests include Knowlegde Graphs, Information Extraction, and NLP.