This is a joint work between Pierre-Henri Paris and Fabian Suchanek (both at Télécom Paris, DIG team).
Human texts usually contain a lot of noun phrases that often function as verb subjects and objects, as predicative expressions and as the complements of prepositions. When building knowledge bases, proper nouns in noun phrases are extracted and used as entities equipped with facts to populate knowledge bases. During this process, noun phrases that are not proper nouns are left unmapped, thus ignoring a vast amount of information. Consider the following example which contains a named entity and two non-named entities:
« The Arab Spring resulted in contentious battle between a consolidation of power by religious elites and the growing support for democracy. »
The previous example contains a lot of information that will not be extracted despite its importance. Indeed, the two non-named entities explain the Arab Spring event, but will not be mapped to the knowledge bases.
In this position paper:
- We manually analyzed this phenomenon in 30 Wikipedia abstracts, allowing us to quantify the information lost,
- We also proposed a partial solution for simple non-named entities,
- Finally, we discuss the remaining challenges to represent all non-named entities.
Manual Study of Non-Named Entities in Wikipedia
We conducted a manual analysis of the noun phrases in Wikipedia articles. Our choice is motivated by the fact that Wikipedia is a widely used standard reference, in both research and industry applications.
We focus on the Wikipedia articles with highest quality (the “featured articles”).
We choose one article from each of the 30 topics, and automatically extract (and manually verify) noun phrases from the abstract of the article. We consider noun phrases that are sequences of nouns, adjectives, adverbs, determiners, and prepositions (see Figure 1).
Overall, we annotated 1925 noun phrases, at an inter-annotator agreement (Cohen’s kappa) of 0.88, which is considered excellent.
Figure 2 shows the repartition between named and non-named entities. We found that 78% of the entities are non-named, and thus left alone. Most of the non-named entities are singular (63%), and thus refer to a single entity (« a bus », « the tall girl »). The plural non-named entities refer to ad-hoc concepts (« German scientists ») or groups of entities (« the four horses »).
Figure 3 shows the repartition of non-named entities between classes from schema.org and BioSchema.org. As expected, intangible and creative work entities are ubiquitous.
Figure 4 shows the repartition of non-named entities by nature and modifiers. 32% of non-named entities are determined (« the man »), which makes it more likely that they are central to the text. Only a few entities are anaphoras or qualified by anaphoras.
A Partial Solution
As a solution for the simplest non-named entities, we propose the following steps:
- Replace anaphoras and determined noun phrases with their referent: « the coffee brand » becomes the G. Washington Coffee Company entity in this context.
- Simple nested noun phrases are linked together: « a mansion in Brooklyn » becomes an anonymous entity link to the Brooklyn entity.
- Use of ad-hoc classes for plural noun phrases based on the head noun : « German scientists » becomes a subclass of the scientist class.
- Make singular noun phrases anonymous instances of ad-hoc classes or top classes (as in Figure 3): « a German scientist » becomes an anonymous instance of the German scientist class.
- Replication of numbered noun phrases: « the four royal houses » becomes four instances of the royal house class.
- Mass nouns became classes: « the propagation of light » becomes an instance of the propagation class or « the fame that Elvis achieved » becomes an instance of the fame class.
The Gap that remains
However, many pitfalls remain before we can represent all non-named entities in knowledge bases.
- Some knowledge bases contain mainly classes, others mix classes and instances, and others duplicate them (like dbr:Book and dbo:Book). These different models are not all usable as is and may need to be adapted before adding non-named entities.
- Plural noun phrases like « hundreds of soldiers » needs axioms to express « hundreds ». OWL 2 could help for this task.
- Vague noun phrases like « large-scale settlement » cannot be represented in current knowledge bases unless using Generalized Quantifiers or Fuzzy Logic.
- Nested entities like « the growing support for democracy in many Muslim-majority states » or « contentious battle between a consolidation of power by religious elites ».
- Similar non-named entities in different contexts, e.g. two different rises of the same stock market.
- Distinct entities must be kept apart, e.g. two rises of different stock markets.
- Comparatives, superlatives, and temporal comparisons, « A rose is more beautiful than a daisy » need elaborate axioms.
- Complex statements about classes: How to express statements for non-named entities such that « dormant volcano » or « his characteristic surrealist style » ?