Introduction

Why graphs?

Graphs are a general language for describing and analyzing entities with relations/interactions

Knowledge graphs

Semantic Perspective

A knowledge graph is a formal representation of knowledge, where entities are represented as nodes, and the relationships between them are represented as edges. These relationships are often imbued with semantic meaning, drawn from controlled vocabularies, taxonomies, or ontologies, allowing for rich, contextual understanding of the data.

Data Integration Perspective

A knowledge graph is a framework for integrating heterogeneous, distributed, and complex data into a unified, easy-to-understand visual structure. It creates a single ‘source of truth’ where relationships and connections between disparate data points can be analyzed and exploited.

Machine Learning Perspective

A knowledge graph is a structured and interconnected data repository that can be used to enhance machine learning algorithms. It provides contextual and relational data that improve algorithmic accuracy and prediction capability, particularly in tasks like recommendation systems or natural language understanding.

Business Intelligence Perspective

A knowledge graph is a tool for turning data into actionable insights. It structures and connects data in ways that align with business objectives, thereby supporting decision-making, predictive analytics, and operational efficiency.

User Interface Perspective

A knowledge graph is a means of presenting complex data in a visually intuitive and interactive way, making it easier for users to navigate, explore, and derive insights from large data sets.

Logic

What is reasoning?

Inductive reasoning
- “If it has wheels, doors, seats, windows, engine, it must be a car.”
Deductive reasoning
- “All cars are vehicles. My Lada™ is a car. Therefore, my Lada™ is a vehicle.”
Abductive reasoning
- “My car is not in my garage and my wife is at home. It may have been stolen!”
In this course, we mainly focus on deductive reasoning!

There are different forms of reasoning
- Inductive reasoning
  - Build general theories (knowledge) from particular observations (facts) e.g. If it has wheels, doors, seats, windows, engine, it must be a car.
- Deductive reasoning
  - Use general theories (knowledge) on particular cases to infer new facts e.g. All cars are vehicles. Emmanuel Macron’s DS 7 Crossback is a car. Therefore, Macron’s DS 7 Crossback is a vehicle.
- Abductive reasoning
  - Use general theories and incomplete observations to get most plausible conclusion e.g. My car is not in my garage. Most likely, my wife must still be at her work place. e.g. My car is not in my garage and my wife is at home. It may have been stolen!

What is an ontology?

Domain of discourse of a particular domain
Structured representation (with symbols)
Concepts, categories, properties, and relationships
Shared understanding, facilitate interoperability, and support reasoning

Logic Programming

In Datalog / Prolog style

Symbols	`RobertDowneyJr`, `NewYork`, `SanFrancisco`
Predicates	`lives/2`, `born/2`
Dataset	`lives(RobertDowneyJr,SanFrancisco)`, `born(RobertDowneyJr,NewYork)`

Logically consistent collection of facts

Description Logics

Based on logical formalisms, e.g., Description Logics (DL), RDFS, OWL

TBox (schema, ontology, theory):
- SuccessfulAuthor ⊑ ≥1 notableWork.Bestseller
ABox (instances, facts, assertions):
- SuccessfulAuthor(StanLee)
RBox (restrictions, constraints):
- notableWork ⊑ created

Databases and Data Integration

Author	Notable work	Date of birth
Stan Lee	Iron Man	12/28/1922
Bob Kane	Batman	10/24/1915

Entities
- Cells - classes, instances
Relations
- Column headers

NLP

Building KGs from texts

Albert Einstein was a German-born theoretical physicist who developed the theory of relativity.

Einstein

Named Entity Recognition

Coco, studying at Sorbonne, designs for Chanel in Paris, France, dreaming of the Champs-Élysées. Meanwhile, Paris of Troy admires Paris, Texas’s own Eiffel Tower, and a couple explores cobblestoned Paris, Ontario, inspired by Paris Hilton.

Relation Linking

Name all the movies in which Robert Downey Jr acted?
Find me all the films casting Robert Downey Jr ?
List all the movies starring Robert Downey Junior?
RDJ has acted in which movies?

Question Answering

How many Marvel movies was Robert Downey Jr. casted in?


SELECT COUNT(?uri) WHERE {
  ?uri dbp:studio dbr:Marvel_Studios .
  ?uri dbo:starring dbr:Robert_Downey_Jr .
}

Language Modeling

Robert Downey Jr. portrayed [MASK] in the Marvel movie in 2008.

Knowledge Graph

Precise facts
Entities & relations
Explainability

Unstructured Sources

Large-scale text corpora
- Wikipedia,
- OpenBooks,
- Reddit,
- CommonCrawl,
- etc.

Examples of knowledge graphs

Google Knowledge Graph
Amazon Product Graph
Facebook Graph API
IBM Watson
Microsoft Satori
Project Hanover/Literome
LinkedIn Knowledge Graph
Yandex Object Answer
IKEA Knowledge Graph

Applications

Serving information:

The Beatles

Applications

Question answering and conversation agents

Source: Medium

Applications

information extraction,
semantic search,
knowledge injection into language models
…

Summary

RDF, RDFS, OWL

Based on Antoine Zimmermann course

RDF

RDF is a data model (not a file format!)
RDF is a logical formalism (formal semantics)
RDF is a Web standard
RDF is the HTML of the Web of Data

RDF basics

Identify things (resources)
Express relations
Assign data values to things (literals)
Organise things in categories (i.e., classes or types)
Add simple knowledge about categories and relations

Identify things

RDF is used to describe resources
A resource may be anything (a real or imaginary entity, abstract or concrete)
To describe a resource, it must be named or identified
On the Web, the identification mechanism must be uniform at Web scale: an identifier must identify the same thing everywhere on the Web
RDF uses Internationalized Resource Identifiers or IRIs (RFC 3987)

Internationalized Resource Identifiers

IRIs generalise URIs (Uniform Resource Identifiers, RFC 3986) by allowing any UNICODE characters
IRIs and URIs identify things but may be used as locators (i.e., as URLs) at the same time
Examples:
- urn:ietf:rfc:3987
- svn://yadiyada.foo.bar/
- mailto:antoine.zimmermann@emse.fr
- ftp://ftp.liris.fr/#meta
- http://en.wikipedia.org/wiki/User:Wikiuser100
Note: to shorten notations, we use namespace prefixes
- rdf: is for http://www.w3.org/1999/02/22-rdf-syntax-ns#

How to choose an IRI for something?

If possible, reuse an existing IRI from an authoritative source, e.g.:
- from a national library for books (library of congress, BNF, BNL, DNB)
- from a government website for a ministry
If not, make your own IRI:
- use HTTP IRIs
- use a namespace under your control
- Cool URIs don’t change
- Refer to the guide on Cool URIs for the Semantic Web

Relate things

Binary relations between things
- “Laura loves Helmut”
- “Steven works for Google Inc.”
This is written as a triple:
(subject, predicate, object)
where subject and object identify the resources in the relationship, and predicate identifies the relation
The subject and the predicate in an RDF triple are always an IRI

Example

“Laura loves Helmut”

(http://example.org/data/Laura,         subject
  http://social.relations.com/loves,    predicate
    http://exmple.org/data/Helmut)      object

RDF triples

Compact syntax:
- use namespace prefixes
- write subject, predicate, and object side by side, separated by spaces
- ex:Laura rel:loves ex:Helmut

Data values

As everything else, a data value (number, string, date) is a resource
A specific data value can be identified with a literal, a character string that represents the value
Every literal is typed such that its string representation can be interpreted as the correct value
- “42” represents the number fourty two if this is of type decimal integer, but represents sixty six if it is an hexadecimal integer

RDF literals

An RDF literal has 2 or 3 components which are:
- A lexical form which is a UNICODE string
- A datatype IRI that can be any IRI
- When the datatype IRI is rdf:langString, there is a language tag which is a BCP 47 tag
Usually, we use standard datatype IRIs from the xsd: namespace (XML Schema Datatypes) and the rdf: namespace
We will write literals "lexical form"^^datatypeIRI and when it is an rdf:langString, "lexical form"@langTag

RDF literals - Examples

“42”^^xsd:integer
"THX 1138""^^xsd:string
"chat"@fr,"chat"@en
"<p>The <em>beautiful</em> literal!</p>"^^rdf:HTML

RDF graphs

An RDF graph is a set of RDF triples
RDF graphs can be drawn as directed, edge-labelled multi-graphs

Unidentified resources

RDF can describe entities that are known to exist but whose identity is unknown (or is irrelevant/unimportant)
- E.g., a book has at least an author, but they may not be known
The existence of a thing can be indicated in the subject or object position of a triple with a blank node
- E.g. “something is in my bag”

The Turtle syntax (1)

Full IRIs: http://www.example.com/test#this

A simple triple:

<http://www.example.com/test#this>
      <http://relations.example.com/in>
              <http://www.example.com/test#box> .

Abbreviated IRIs (declare prefixes at the beginning of the file):

# This is a comment
@prefix ex:  . # end dot!
PREFIX rel:  # alternative notation (no dot!)
ex:this rel:in ex:box . # dot ends statement

The Turtle syntax (2)

# Literals:
ex:this rel:date "2019-09-13"^^xsd:date . # normal literal
ex:this rel:name "this"@en . # language-tagged literal
ex:this rel:code "TX32" . # xsd:string can be omitted
ex:this rel:number 42 . # xsd:integer (no quotes)
ex:this rel:sizeInMeters 3.75 . # xsd:decimal (use a dot)
ex:this rel:isGood true . # xsd:boolean
ex:this rel:isBorring false . # xsd:boolean

# Blanknodes:
[] rel:in ex:box .
_:b1  rel:in ex:box . # a blank node identifier...
ex:me rel:likes _:b1 . # ...allows to reuse the same blank node

The Turtle syntax (3)

# Repeat the same subject and predicate:
ex:box rel:contains ex:this .
ex:box rel:contains ex:that .
# can be written
ex:box rel:contains ex:this, ex:that . # comma


# Repeat subject:
ex:this rel:date "2019-09-13"^^xsd:date;
    rel:name "this"@en; # new lines are optional
    rel:code "TX32";
    rel:nextTo ex:that, ex:thoot, ex:thus .

The Turtle syntax (4)

# More on blank nodes:

# assume prefixes are declared
ex:johnDoe rel:worksFor [
        a ex:University; # the IRI rdf:type can be replaced by 'a'
        rel:name "Berkley";
    rel:locatedIn ex:California
] .

# is the same as:
ex:johnDoe rel:worksFor _:bnode .
_:bnode rdf:type ex:University . # 'a' and 'rdf:type' represents the same IRI
_:bnode rel:name "Berkley" .
_:bnode rel:locatedIn ex:California .

The Turtle syntax (5)

#Declaring a base IRI:
@base <http://example.com/base/> . # ends with dot
BASE <http://example.com/base/> # alternative syntax (no dot!)
# prefixes must be declared
<bob> a vocab:Person; # relative IRI
        rel:knows <claire> .
BASE <http://example.com/base2#> # base can be redefined
<bob> rel:knows <http://example.com/base/bob> . # different bobs

# is the same as:
<http://example.com/base/bob> a vocab:Person;
    rel:knows <http://example.com/base/claire> .
<http://example.com/base2#bob>
    rel:knows <http://example.com/base/bob> .

RDFS (RDF Schema)

RDFS is a semantic extension of RDF, and it provides a way to describe semantic relationships between things and provides a basic type system for RDF models.

Basic Components

Resources: Anything can be a resource such as a person, a car, a website, etc.
Classes: They are used to categorize resources.
Properties: They describe the relationship between resources.
Literals: They are basic values such as strings, numbers, etc.

Classes and Subclasses

In RDFS, we can define a class using the rdfs:Class. The rdfs:subClassOf property is used to represent inheritance between classes.

ex:Person a rdfs:Class .
ex:Student a rdfs:Class ;
    rdfs:subClassOf ex:Person .

Properties

RDFS includes the ability to describe properties (also called predicates), which are the named relations that link resources together:

rdf:Property: The class of all RDF properties.
rdfs:domain: The class of the subject in a triple.
rdfs:range: The class of the object in a triple.

Example:

ex:author rdf:type rdf:Property;
    rdfs:domain ex:Book;
    rdfs:range ex:Person .

Inference in RDFS

One of the key advantages of RDFS is the ability to make inferences, or to derive additional information from the existing knowledge base.

Example:

ex:HarryPotter ex:author ex:JKRowling.
ex:author rdfs:domain ex:Book.
ex:author rdfs:range ex:Person.

From this information, we can infer:

ex:HarryPotter rdf:type ex:Book.
ex:JKRowling rdf:type ex:Person.

Labels and Comments

In RDFS, it is possible to add human-readable labels and comments to resources. This makes the RDF document easier to understand for individuals reviewing the data. It can be particularly helpful for understanding the semantics of an RDF document without needing to look up the definitions of resources and properties in the schema.

rdfs:label: Provides a human-readable version of a resource’s name.
rdfs:comment: Gives a brief description of a resource.

Example:

ex:Person a rdfs:Class;
    rdfs:label "Person";
    rdfs:comment "Represents a person" .

ex:Student a rdfs:Class;
    rdfs:subClassOf ex:Person;
    rdfs:label "Student";
    rdfs:comment "Represents a student, which is a type of person" .

In the above example, rdfs:label and rdfs:comment are used to provide a human-readable name and description for the ex:Person and ex:Student classes.

RDFS Limitations

It doesn’t allow the description of properties of properties (i.e., it cannot say that a property is transitive, symmetric, etc.).
It doesn’t allow the definition of constraints (i.e., it cannot limit the number of instances of a class, cannot enforce a property to have a single value, etc.).
It doesn’t support logical operators to combine classes (i.e., it cannot create a new class as a union, intersection, or complement of other classes).

OWL (Web Ontology Language)

OWL is a more expressive language than RDFS and is used to create ontologies. An ontology is a specification of a conceptualization, or a way of representing knowledge.

Overview of OWL

OWL provides more complex classes and relationships than RDFS, including:

Symmetry, transitivity, and inverses for properties: for example, if A is a brother of B, then B is a brother of A.
Enumerated classes: that is, classes that have a specific, predefined list of members.
Boolean combinations of classes: intersections (AND), unions (OR), and complements (NOT) of classes.
Cardinality restrictions: for example, stating that each instance of a certain class must be related to exactly two instances of another class.

OWL 2 Profiles

OWL 2, the most recent version of the Web Ontology Language includes three profiles designed to meet different use case requirements and computational needs.

OWL 2 EL

This profile is designed for applications that require very large ontologies. The expressivity of the language is restricted to ensure that all reasoning tasks can be performed in polynomial time. This profile is especially relevant in fields like bioinformatics where ontologies can contain millions of classes.

OWL 2 QL

This profile is optimized for query answering over large datasets. It is mainly intended for applications that use data repositories managed through relational database systems. OWL 2 QL is a tractable language with a lower computational complexity, ensuring queries can be answered efficiently even when dealing with voluminous data.

OWL 2 RL

This profile is aimed at rule-based reasoning. The expressivity of the language is reduced to enable implementation of reasoners using rule-based technologies. It is designed for scalable reasoning while maintaining an acceptable level of expressivity.

OWL Classes and Properties

OWL builds upon RDFS by adding additional class and property types.

ex:Parent a owl:Class ;
    rdfs:subClassOf [
      a owl:Restriction ;
      owl:onProperty ex:hasChild ;
      owl:someValuesFrom ex:Person 
    ] .

Equivalent Classes and Properties

OWL allows for specifying that two classes or properties are equivalent.

owl:equivalentClass: The classes have the same instances.
owl:equivalentProperty: The properties relate the same pairs of instances.

ex:Mother a owl:Class ;
    owl:equivalentClass [
      a owl:Class ;
      rdfs:subClassOf ex:Parent ;
      rdfs:subClassOf [
        a owl:Restriction ;
        owl:onProperty ex:gender ;
        owl:hasValue ex:Female
      ]
    ] .

Disjoint Classes

In OWL, the owl:disjointWith property allows to specify that two classes have no instances in common.

ex:Male a owl:Class .
ex:Female a owl:Class .
ex:Male owl:disjointWith ex:Female .

In the above example, the classes ex:Male and ex:Female are specified as being disjoint, meaning an individual cannot be an instance of both these classes.

OWL Individual

OWL individuals represent the instances of the class. OWL individuals can have properties associated with them.

ex:John a ex:Person ;
    ex:name "John Doe" ;
    ex:hasChild ex:David .

Identity and Interlinking

owl:sameAs is used to declare that two URI references actually refer to the same thing. If you state that A owl:sameAs B, you’re stating that any property that A has, B also has, and vice versa.

ex:JohnDoe owl:sameAs ex:JohnathanDoe .

In this example, ex:JohnDoe and ex:JohnathanDoe are considered to be the same individual.

Property Characteristics

OWL allows for the specification of certain characteristics of properties.

owl:FunctionalProperty

This specifies that a property is functional, meaning that for a given subject, there can only be one unique value of this property. For example, a person has exactly one biological mother.

owl:InverseFunctionalProperty

This specifies that a property is inverse-functional, meaning that for a given value, there can only be one unique subject of this property. For example, a biological mother can have many children, but each child has exactly one biological mother.

owl:TransitiveProperty

This specifies that a property is transitive, meaning that if A is related to B, and B is related to C, then A is related to C. An example would be the property “ancestorOf”.

Negative Assertions

OWL 2 introduces the possibility of stating negative information. This is done through the owl:NegativePropertyAssertion construct:

[ a owl:NegativePropertyAssertion ;
  owl:sourceIndividual ex:John ;
  owl:assertionProperty ex:hasSibling ;
  owl:targetIndividual ex:Mary ]

In the above example, the statement asserts that John does not have Mary as a sibling.

RDFS and OWL Conclusion

In conclusion, RDFS and OWL are critical components in creating and maintaining the Semantic Web. RDFS provides a basic way to create a vocabulary for describing resources and their relationships. OWL takes it a step further, allowing for a more expressive and detailed way to describe resources and the relationships between them.

SPARQL

SPARQL basics

The syntax looks similar to SQL
The features are similar to SQL
A family of standards:
- SELECT queries
- Update (INSERT / DELETE) queries
- Protocols
- Reasoning at query time
Standards for managing RDF data in general
SQL and SQL DBMS are to the relational data model what SPARQL and its standards are to the RDF data model

SPARQL SELECT

Variable: an element of a set disjoint from IRIs, literals and blank nodes
Basic graph pattern: an RDF graph where subject,predicate or object can be replaced by a variable
An answer to a SELECT query is a mapping from variables in the query to IRIs union literals union blank nodes in the queried graph

TODO: put example of graph pattern with their respective query and image

SPARQL example

#Ex. 1
#Associate URIs with prefixes
PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

#Example of a SELECT query, retrieving 2 variables
#Variables selected MUST be bound in graph pattern
SELECT ?subject ?label
WHERE {
    #This is our graph pattern
    ?subject rdfs:label ?label;
        rdf:type space:Discipline .
}

SPARQL example

#Ex. 2
PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

#Example of a SELECT query, retrieving all variables
SELECT *
WHERE {
    ?subject rdfs:label ?label;
        rdf:type space:Discipline .
}

OPTIONAL bindings

How do we allow for missing or unknown information?

#Ex. 3
PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?name ?country
WHERE {
    #This pattern must be bound
    ?thing rdfs:label ?name .
    #Anything in this block doesn't have to be bound
    OPTIONAL {
        ?thing space:country ?country .
    }
}

UNION queries

How do we allow for alternatives or variations in the graph?

#Ex. 4
PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?subject ?displayLabel
WHERE {
    {
        ?subject foaf:name ?displayLabel .
    }
    UNION
    {
        ?subject rdfs:label ?displayLabel .
    }
}

Sorting & Restrictions

How do we apply a sort order to the results and restrict the number of results returned?

#Ex. 5
#Select the uri and the mass of the 11-20th most heaviest spacecraft
PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?spacecraft ?mass
WHERE {
    ?spacecraft space:mass ?mass .
}
#Use an ORDER BY clause to apply a sort. Can be ASC or DESC
ORDER BY DESC(?mass)
#Limit to ten results
LIMIT 10
#Apply an offset to get next "page"
OFFSET 10

Filtering

How do we restrict results based on aspects of the data rather than the graph, e.g., string matching?

#Sample data for Sputnik launch
<http://purl.org/net/schemas/space/launch/1957-001> rdf:type space:Launch;
#Assign a datatype to the literal, to indicate it is a date
    space:launched "1957-10-04"^^xsd:date;
    space:spacecraft
        <http://purl.org/net/schemas/space/spacecraft/1957-001B>.

Filtering

How do we restrict results based on aspects of the data rather than the graph, e.g., string matching?

#Ex. 6
#Select name of spacecraft launched between 1st Jan 1969 and 1st Jan 1970
PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?name
WHERE {
    ?launch space:launched ?date;
        space:spacecraft ?spacecraft .
    ?spacecraft foaf:name ?name .
    FILTER (?date > "1969-01-01"^^xsd:date &&
        ?date < "1970-01-01"^^xsd:date)
}

Filtering

How do we restrict results based on aspects of the data rather than the graph, e.g., string matching?

#Ex. 7
#Select spacecraft with a mass of less than 90kg
PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?spacecraft ?name
WHERE {
    ?spacecraft foaf:name ?name;
        space:mass ?mass .
    #Note that we have to cast the data to the right type
    #As it is not declared in the data
    FILTER( xsd:double(?mass) < 90.0 )
}

Filtering

How do we restrict results based on aspects of the data rather than the graph, e.g., string matching?

#Ex. 8
#Select spacecraft with a name like “ollo”
PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?name
WHERE {
    ?spacecraft foaf:name ?name .
    FILTER( regex(?name, "ollo", "i" ) )
}

Built-In Filters

Logical: !, &&, ||
Math: +, -, *, /
Comparison: =, !=, >, <, …
SPARQL tests: isURI, isBlank, isLiteral, bound
SPARQL accessors: str, lang, datatype
Other: sameTerm, langMatches, regex

DISTINCT

How do we remove duplicate results?

#Ex. 9
#Select spacecraft with a mass of less than 90kg
PREFIX space: <http://purl.org/net/schemas/space/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT DISTINCT ?agency
WHERE {
    ?spacecraft space:agency ?agency .
}

Extended Query Language Power (SPARQL 1.1)

Aggregates
Sub-queries
Negation and filtering
Property paths
Introducing new variables
Basic federated query
Graph Patterns inside FILTERs

Aggregates

AVG(expr)
COUNT(*) and COUNT(expr)
GROUP_CONCAT(expr)
MAX(expr)
MIN(expr)
SAMPLE(expr)
SUM(expr)

Aggregates (cont.)

All are allowed with and without DISTINCT across the arguments.
Grouping of results is optionally done with GROUP BY otherwise the entire result set is 1 group (like SQL). This may bind a variable too.
HAVING executes a filter expression over the results of an aggregation (like SQL)

Sub-queries

SPARQL 1.1 allows sub-SELECTs

#Ex. 10
PREFIX : <http://people.example/>

SELECT ?y ?minName
WHERE {
    :alice :knows ?y .
    {
        SELECT ?y (MIN(?name) AS ?minName)
        WHERE {
            ?y :name ?name .
        }
        GROUP BY ?y
    }
}

Negation and Filtering

3 new ways to negate / exclusion:
- OPTIONAL { graph-pattern } (1.0)
- FILTER … !expr (1.0)
- FILTER … NOT EXISTS { graph-pattern } (1.1)
Aggregation using HAVING with either of the above (1.1)
graph-pattern MINUS graph-pattern (1.1)
(Some of these can be done with complex UNION and OPTIONAL patterns)

Property path

This changes the fundamental SPARQL matching
- From: Triple pattern matches a triple to bind variables.
- To: Triples with property paths regex-like match multiple triples to bind variables.
Depending on the data, the query engine could do a simple match or do a lot of searching for matches.
New syntax to select different properties from a subject node:
- a/b ^a a|b a* a+ a? a{m,n} a{n} a{m,} a{,n} where a and b are property IRIs.

Basic Federated Queries

A graph pattern that invokes a SPARQL protocol call and remote query returning the usual result formats

Allows querying multiple SPARQL databases in one query

#Ex. 11
SELECT ?person
WHERE {
  ?person knows ?x
  SERVICE  <http://social-db.com/sparql/> {
      ?x foaf:name ?name;
          ex:birthdate ?b .
  }
}

More

More functions and operators
Introducing new variables
RDF graph database management:
- INSERT triples / graphs
- DELETED triples / graphs
ASK, DESCRIBE, CONSTRUCT

Storage

In Files

Turtle: a compact, human-friendly format.
N-Triples: a very simple, easy-to-parse, line-based format that is not as compact as Turtle.
TriG: an extension of Turtle to datasets.
N-Quads: a superset of N-Triples, for serializing multiple RDF graphs.
RDF/XML: the first standard format for serializing RDF.
RDF/JSON: an alternative syntax for expressing RDF triples using a simple JSON notation.

Embedded Annotations

Embedded annotations refer to the process of integrating structured data into web pages. This integration is crucial for web crawlers or other machines to understand the content of the web page and its context better.

Schema.org

Schema.org is a collaborative effort, founded by Google, Microsoft, Yahoo, and Yandex, aiming to create, maintain, and promote schemas for structured data on the Internet. It provides a collection of shared vocabularies webmasters can use to mark up their pages in ways that can be understood by the major search engines.

JSON-LD

It is a World Wide Web Consortium (W3C) standard to encode Linked Data using JSON

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://example.com/article"
  },
  "headline": "Example Article",
  "image": "https://example.com/photos/1x1/photo.jpg",
  "author": {
    "@type": "Person",
    "name": "John Doe"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Example.com",
    "logo": {
      "@type": "ImageObject",
      "url": "https://example.com/logo.jpg"
    }
  },
  "datePublished": "2022-05-10T08:00:00+08:00",
  "dateModified": "2022-05-20T09:20:00+08:00"
}
</script>

Microdata

Microdata is an HTML specification used to nest structured data within HTML content.

<div itemscope itemtype="http://schema.org/Person">
  <span itemprop="name">John Doe</span>
  <span itemprop="email">john@example.com</span>
</div>

RDFa

RDFa (Resource Description Framework in Attributes) is an HTML5 extension that supports linked data.

<div vocab="http://schema.org/" typeof="Person">
  <span property="name">John Doe</span>
  <span property="email">john@example.com</span>
</div>

Use Cases

In addition to SEO and search engine discovery:

Rich Search Results: Inclusion to search results of features like images, reviews, and more.
Social Media Cards: Use of structured data to create a preview of the link with a title, description, and image.
Voice Search and Virtual Assistants: Use of structured data to understand and respond to voice queries.
Email Marketing: To provide users with actions they can take directly from their inbox.

Rich Search Results: Google uses structured data to generate rich search results, which include features like images, reviews, and more. This can lead to higher click-through rates and more traffic.
Social Media Cards: When you share a link on platforms like Facebook, Twitter, or LinkedIn, they use structured data to create a preview of the link with a title, description, and image.
Voice Search and Virtual Assistants: Devices like Google Home and Amazon Echo use structured data to understand and respond to voice queries.
Email Marketing: Gmail uses Schema.org structured data to provide users with actions they can take directly from their inbox, like tracking a package or reviewing a product.

Implementing structured data on your website is a powerful way to control how your information is displayed and interacted with across the web.

Native RDF Stores

Also known as triplestores
Specifically designed to store, retrieve, and manage RDF data
Optimized for SPARQL queries
Highly efficient

Virtuoso
Jena TDB
Stardog

Native RDF stores, also known as triplestores, are specifically designed to store, retrieve, and manage RDF data. They natively understand RDF triples and are often optimized for SPARQL queries. Native RDF stores are highly efficient when working with complex and interconnected RDF datasets due to their underlying design optimized for graph-like data structures.

Examples of Native RDF Stores:

Virtuoso: An open-source native RDF store with high-performance characteristics. It supports SPARQL, RDF graphs, and has an in-built inference engine. Virtuoso can also serve as a relational database or a document store.
Jena TDB: Apache Jena TDB is another efficient native RDF store. It provides full SPARQL support and a Java API for direct interaction with your data.
Stardog: Stardog is a graph database designed for enterprise data unification. Stardog supports SPARQL and includes an inference engine that can handle OWL 2 and user-defined rules.

RDF-Enabled Relational Databases

Traditional relational databases that have been extended to store RDF data
Different techniques like specific relational schemas

D2RQ
Virtuoso RDF Views

RDF-enabled relational databases are traditional relational databases that have been extended to store RDF data. They typically use a variety of techniques, like storing RDF triples in specific relational schemas, to support SPARQL queries and other operations on RDF data.

Examples of RDF-Enabled Relational Databases:

D2RQ: D2RQ is a system for accessing relational databases as virtual, read-only RDF graphs. It offers SPARQL query support over the mapped relational database.
Virtuoso RDF Views: In addition to being a native RDF store, Virtuoso can also create RDF views over traditional relational databases, providing a way to query existing relational data with SPARQL.

NoSQL Databases for RDF Storage

Graph databases
Scalability
Schema-less

AllegroGraph
Neo4j

RDF in the Cloud

Cost-effectiveness
Scalability
Distribution

Amazon Neptune
Google Cloud Datastore
RDF4J Server

As more organizations move their infrastructure to the cloud, there are now options for RDF storage in the cloud, which provide scalability, distribution, and often cost-effectiveness.

Examples of RDF in the Cloud:

Amazon Neptune: Amazon Neptune is a fully managed graph database service that supports RDF. It provides high availability, scalability, and supports SPARQL.
Google Cloud Datastore: While not an RDF store per se, Google Cloud Datastore is a NoSQL database that can be used to store RDF data.
RDF4J Server: The RDF4J Server is a server application that allows clients to interact with RDF databases over HTTP. It can be deployed on any cloud platform that supports Java applications.

These RDF storage options offer different trade-offs of performance, scalability, complexity, and cost, making it essential to choose the right solution for your specific RDF data needs.

Programming languages

Python

rdflib

from rdflib import Graph, Literal, BNode, Namespace, RDF, URIRef

n = Namespace("http://example.org/people/")
g = Graph()

john = BNode()
g.add((john, RDF.type, n.Person))
g.add((john, n.name, Literal('John')))

RDF example

Python

from rdflib import Graph

g = Graph()
g.parse("http://example.org/")

qres = g.query(
    """
    SELECT ?subject ?predicate ?object
    WHERE {
        ?subject ?predicate ?object.
    }
    """)

for row in qres:
    print("%s knows %s" % row)

SPARQL example

Java

Apache Jena

import org.apache.jena.rdf.model.*;

Model model = ModelFactory.createDefaultModel();
Resource johnSmith = model.createResource(
          "http://example.org/people/JohnSmith");
johnSmith.addProperty(VCARD.FN, "John Smith");

RDF example

Java

import org.apache.jena.query.*;

String sparqlQueryString = 
    "SELECT ?subject ?predicate ?object\n" +
    "WHERE {\n" +
    "    ?subject ?predicate ?object .\n" +
    "}\n";

Query query = QueryFactory.create(sparqlQueryString);
QueryExecution qexec = QueryExecutionFactory.create(query, dataset);
ResultSet results = qexec.execSelect();
ResultSetFormatter.out(System.out, results, query);

SPARQL example

C#

dotNetRDF

using VDS.RDF;

var g = new Graph();
var dotNetRDF = g.CreateUriNode(
  UriFactory.Create("http://example.org/people/JohnSmith"));
g.Assert(new Triple(
  dotNetRDF, 
  g.CreateUriNode(
    UriFactory.Create("http://www.w3.org/1999/02/22-rdf-syntax-ns#type")), 
  g.CreateUriNode(UriFactory.Create("http://example.org/Person"))));

RDF example

C#

using VDS.RDF.Query;

SparqlQueryParser parser = new SparqlQueryParser();
SparqlQuery query = parser.ParseFromString(
  @"SELECT ?subject ?predicate ?object 
  WHERE { ?subject ?predicate ?object . }");

SparqlResultSet resultSet = endpoint.QueryWithResultSet(query);
foreach (SparqlResult result in resultSet)
{
    Console.WriteLine(result.ToString());
}

SPARQL example

JavaScript

rdflib.js

var $rdf = require('rdflib');

var store  = $rdf.graph();
var person = $rdf.sym('http://example.org/people/JohnSmith');
var name = $rdf.sym('http://schema.org/name');

store.add(person, name, 'John Smith', person.doc());

RDF example

JavaScript

var $rdf = require('rdflib');

var store  = $rdf.graph();
store.parse(`your RDF data here`, "text/turtle", 'http://example.org/');

var query = $rdf.SPARQLToQuery(
  `SELECT ?subject ?predicate ?object 
  WHERE { ?subject ?predicate ?object . }`, false, store);

store.query(query, function(result) {
    console.log(
      result['?subject'].value, 
      result['?predicate'].value, result['?object'].value);
});

SPARQL example

RDF-star and SHACL

RDF-star and SPARQL-star

RDF-star Basics

Extension of RDF
Express more complex RDF graphs
Triples about triples
Soon a W3C standard

RDF and RDF-star: A Quick Comparison

ex:Alice ex:knows ex:Bob .

RDF Graph

<< ex:Alice ex:knows ex:Bob >> ex:assertedBy ex:Carol .

RDF-star Graph

In RDF, the basic unit of information is a triple, which consists of a subject, a predicate, and an object. Every statement (or fact) is represented using these triples. For example, consider the statement: “Alice knows Bob.” This could be represented in RDF as:

<Alice> <knows> <Bob> .

RDF-star enhances this model by allowing a triple to be a subject or object. So, for instance, we could represent the statement: “Alice asserts that Bob knows Carol,” using RDF-star as follows:

<<Bob knows Carol>> <is asserted by> <Alice> .

Here, <<Bob knows Carol>> is a nested triple that acts as the subject of the outer triple.

Key Concepts of RDF-star

IRIs (Internationalized Resource Identifiers)
Literals
Blank Nodes
Triples: In RDF-star, triples consist of a subject, a predicate, and an object, where the subject and the object can be either an IRI, a literal, a blank node, or another triple.

IRIs (Internationalized Resource Identifiers): IRIs are a form of URLs that uniquely identify resources in the RDF-star model. They are used for both subject and object of a triple, as well as the predicate.
Literals: Literals represent concrete data values such as strings, numbers, dates, etc. They are generally used as objects in RDF-star triples.
Blank Nodes: These are placeholders for things that exist, but are not identified by any IRI.
Triples: In RDF-star, triples consist of a subject, a predicate, and an object, where the subject and the object can be either an IRI, a literal, a blank node, or another triple.

Application of RDF-star

To represent metadata about statements
- The source of a statement
- The time the statement was made
- The level of confidence in the statement
- …
More precise and nuanced knowledge representation

SPARQL-star Basics

Extension of SPARQL
Supports RDF-star data
Soon a W3C standard

SPARQL and SPARQL-star: A Quick Comparison

SELECT ?object
WHERE {
    ex:Alice ex:knows ?object .
}

SPARQL query

SELECT ?assertedBy
WHERE {
    << ex:Alice ex:knows ex:Bob >> 
        ex:assertedBy  ?assertedBy .
}

SPARQL-star query

In SPARQL, queries are designed to match patterns in RDF data. For example, you might query for all triples where the subject is “Alice” and the predicate is “knows” like so:

SELECT ?object
WHERE {
    <Alice> <knows> ?object .
}

This would return all objects that Alice knows, according to the dataset.

SPARQL-star enhances this model by allowing queries over triples that have other triples as subjects or objects. So, for instance, we could find who has asserted that “Bob knows Carol” using SPARQL-star:

SELECT ?assertedBy
WHERE {
    << <Bob> <knows> <Carol> >> <is asserted by> ?assertedBy .
}

Here, << <Bob> <knows> <Carol> >> is a nested triple that acts as the subject of the outer triple, and ?assertedBy is a variable that will match with the object of the outer triple.

Key Concepts of SPARQL-star

Query Forms: SELECT, CONSTRUCT, DESCRIBE, and ASK
Variables: placeholders used to capture and return parts of the data, including nested triples
Triple Patterns

Query Forms: SPARQL-star supports the same query forms as SPARQL, including SELECT (return specific variables), CONSTRUCT (create new RDF data), DESCRIBE (return a description of a resource), and ASK (return a boolean indicating whether a pattern matches).
Variables: Variables are placeholders used to capture and return parts of the data. In a SPARQL-star query, variables can match with any component of an RDF-star triple, including nested triples.
Triple Patterns: In SPARQL-star, a triple pattern is a triple in which each of the subject, predicate, and object may be either a variable, an IRI, a literal, a blank node, or another triple.

SHACL

Shapes Constraint Language
W3C standard
Validation of RDF
- Users can describe and enforce constraints on RDF graphs
- Ensure data quality and consistency

Why is SHACL Useful?

Data Quality Assurance: ensure data integrity
Schema Documentation: self-documentation (implicitly define the schema of the RDF graph)
Form Generation: generation of forms in a UI
Data Integration: validation of the results of data integration processes

As the Semantic Web grows and evolves, maintaining the consistency and quality of data becomes increasingly important. This is where SHACL plays an integral role:

Data Quality Assurance SHACL provides mechanisms to define constraints that ensure data integrity. This could range from simple checks like verifying the datatype of a property to complex conditions involving multiple nodes and properties. If the data doesn’t meet these constraints, SHACL validation engines can reject it or flag it for review, thereby enhancing data quality.

Schema Documentation With SHACL, the shape of the data acts as a form of documentation. By defining the shape of the data, you implicitly create a blueprint of the RDF graph’s structure. This serves as a self-documenting guide, which is especially helpful for large projects where the RDF graphs may have complex structures and dependencies.

Form Generation SHACL shapes are useful beyond validation. They can be used to generate forms in a User Interface (UI). By defining the structure and properties of the data using SHACL, forms can be automatically generated to match these definitions. This not only simplifies the process of data entry but also ensures that the entered data follows the specified structure.

Data Integration SHACL can play a significant role in data integration. When merging data from different sources, there could be variations in data structure and format. SHACL can validate the integrated data to ensure it conforms to a predefined shape. This guarantees a consistent and uniform structure of data post-integration.

Interoperability As a W3C standard, SHACL enables better interoperability across different systems. When different systems agree on the same shape definitions, they can exchange data more reliably, knowing that the data adheres to certain constraints.

Overall, SHACL is a powerful tool for enhancing data quality, documentation, and interoperability in RDF-based systems. Its ability to define complex constraints on RDF graphs makes it an essential tool for any Semantic Web project.

Basic Concepts

SHACL describes the shape of an RDF graph through a set of constraints. Each constraint is associated with a SHACL shape, and each shape is associated with one or more target nodes.

SHACL Shapes

A SHACL shape is a collection of conditions that the data must satisfy.

the type of data that a property can have
the number of values a property can have
the format of a string (regex)
…

Target Nodes

Target nodes are the RDF nodes that a shape applies to.

node type
property value
explicit declaration

SHACL example

@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

# The following will pass
ex:John a ex:Person ;
  ex:age 30 ;
  ex:email "john@example.com"^^xsd:string .

# The following will fail
ex:Bob a ex:Person ;
  ex:age 17 .

# The following will fail
ex:Alice a ex:Person ;
  ex:age 21 ;
  ex:email "sdffsd"^^xsd:string .

@prefix ex: <http://example.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ex:PersonShape a sh:NodeShape ;
  sh:targetClass ex:Person ;
  sh:property [
    sh:path ex:age ;
    sh:datatype xsd:integer ;
    sh:minInclusive 18 ;
    sh:maxInclusive 99 ;
    sh:severity sh:Violation ;
  ] ;
  sh:property [
    sh:path ex:email ;
    sh:pattern ".+@.+\\..+" ;
    sh:severity sh:Violation ;
  ] .

Advanced SHACL Concepts

Advanced SHACL concepts include logical constraints (such as AND, OR, and NOT), complex property paths, and shape hierarchies.

Comparison: SHACL vs OWL

Complexity and Performance
- OWL has more expressive power
- SHACL offers better performance
Validation vs Inference
- OWL is an inference language (derive new facts)
- SHACL checks whether data conforms to a specific shape or not (no new facts)
User Friendliness: SHACL is easier to learn
Tool Support: broader tool support for OWL

OWL (Web Ontology Language) and SHACL both offer ways to describe and validate RDF graphs. However, there are several key differences between them.

Complexity and Performance: OWL has more expressive power, meaning it can represent more complex relationships and constraints. However, this expressive power comes with a performance cost. SHACL, on the other hand, is less expressive but offers better performance. Therefore, it’s often more suitable for large RDF graphs.
Validation vs Inference: SHACL is primarily a validation language, which means it checks whether data conforms to a specific shape or not. OWL, in contrast, is an inference language, meaning it can derive new facts based on the existing ones.
User Friendliness: SHACL tends to be more straightforward and user-friendly than OWL. The learning curve for SHACL is generally considered to be less steep. In conclusion, SHACL and OWL have different strengths and are suited to different tasks. Depending on the specific needs of a project, one might be more appropriate than the other.
Tool Support: OWL has been around for longer and has broader tool support across various programming languages. SHACL is relatively new and is still growing its ecosystem of tools and libraries. However, the ones available are usually more optimized for performance, keeping in mind the requirements of modern web applications.
Specification Clarity: SHACL has a relatively simpler and clearer specification compared to OWL. This can make it easier to understand and implement, especially for newcomers to the Semantic Web field.

The ultimate choice between SHACL and OWL will depend on your specific needs. If performance and ease of use are your primary concerns, SHACL is likely the better choice. However, if you need the powerful expressiveness and inference capabilities of OWL and are willing to invest more time into understanding its complexities, then OWL may be the better fit.

R2RML and RML

R2RML

R2RML Basics

RDB to RDF Mapping Language
W3C standard
Language for expressing customized mappings from relational databases to RDF graphs

R2RML, or RDB to RDF Mapping Language, is a W3C recommended language for expressing customized mappings from relational databases to RDF datasets. R2RML allows data stored in relational databases to be integrated with and published on the Semantic Web, enabling the data to be interpreted as meaningful and machine-readable information according to the rules and standards of the RDF data model.

R2RML plays a significant role when it comes to linking relational databases to the Semantic Web, enabling structured data to be interpreted and shared according to Semantic Web standards. This contributes to the creation of a web of linked data, enabling machines and people to uncover more meaningful insights. It’s especially valuable for organizations that want to integrate their existing relational data into a semantic technology stack, or that want to share their data as linked open data.

In summary, understanding R2RML is key for those working with relational databases and Semantic Web technologies. It provides the means to convert structured relational data into a format that can be widely shared and easily combined with other datasets.

Key Components of R2RML

Triples Maps: rules to generate RDF triples
- from the rows of a database table
- or a SQL query’s result
Logical Table: base table (or view) in the database, or a custom SQL query that provides the data
Subject Map: a template that generates the RDF subject for each row
Predicate-Object Maps (POM): define how to generate the RDF predicate and object for each row
RefObjectMap: generate RDF triples when you have foreign key relationships

R2RML mappings consist of several key components, each serving a unique purpose in the transformation process:

Triples Maps: These form the backbone of an R2RML mapping. A triples map describes the rules to generate RDF triples from the rows of a database table or a SQL query’s result.
Logical Table: A logical table is either a base table (or view) in the database, or a custom SQL query that provides the data to be mapped. Each triples map must be linked to one logical table.
Subject Map: The subject map is a template that generates the RDF subject for each row in the logical table. The subject can be either an IRI or a blank node.
Predicate-Object Maps (POM): POMs define how to generate the RDF predicate and object for each row in the logical table.
RefObjectMap: RefObjectMap is used to generate RDF triples when you have foreign key relationships in your database schema. It helps to describe the relationship between different entities in RDF format.

Example of R2RML

Let’s assume we have a simple database table “Student” with columns “id”, “name”, and “email”.

<#TriplesMap1>
  rr:logicalTable [ rr:tableName "Student" ];
  rr:subjectMap [ rr:template "http://example.com/student/{id}" ];
  rr:predicateObjectMap [
    rr:predicate ex:name;
    rr:objectMap [ rr:column "name" ]
  ];
  rr:predicateObjectMap [
    rr:predicate ex:email;
    rr:objectMap [ rr:column "email" ]
  ].

RML

RML Basics

RDF Mapping Language
Extends R2RML
Mappings from various structured data formats (such as JSON, CSV, XML) to RDF datasets
The key components of RML are the same than those of R2RML
- Except for Logical Source (replace R2RML’s logical table)

RML, or RDF Mapping Language, is a mapping language based on the popular R2RML standard. It extends R2RML by allowing mappings from various structured data formats (such as JSON, CSV, XML) to RDF datasets, not just from relational databases. RML facilitates the integration and publication of a wide range of data sources on the Semantic Web, enabling them to be interpreted as meaningful, machine-readable information conforming to the RDF data model.

Logical Source: This replaces R2RML’s logical table. A logical source can be a database table, a file (like a CSV or JSON file), or even the results of an HTTP request. Each triples map must be linked to one logical source.

Example of RML

Suppose we have a simple CSV file “students.csv” with columns “id”, “name”, and “email”.

<#TriplesMap1>
  rml:logicalSource [
    rml:source "students.csv";
    rml:referenceFormulation ql:CSV
  ];
  rr:subjectMap [
    rr:template "http://example.com/student/{id}"
  ];
  rr:predicateObjectMap [
    rr:predicate ex:name;
    rr:objectMap [ rml:reference "name" ]
  ];
  rr:predicateObjectMap [
    rr:predicate ex:email;
    rr:objectMap [ rml:reference "email" ]
  ].

KG Embeddings

What are KG embeddings?

a technique used to represent the entities and relations in a knowledge graph as vectors in a continuous vector space
translate the high-dimensional, sparse, and often symbolic information in a knowledge graph ⇒ a low-dimensional, dense, and continuous space where semantic relationships are preserved

A Knowledge Graph (KG) embedding is a technique used to represent the entities (nodes) and relations (edges) in a knowledge graph as vectors (or embeddings) in a continuous vector space. The primary goal of KG embeddings is to translate the high-dimensional, sparse, and often symbolic information in a knowledge graph into a low-dimensional, dense, and continuous space where semantic relationships are preserved.

In a knowledge graph, knowledge is usually represented in the form of triples, consisting of two entities and one relation. For example, a triple could be (Barack Obama, Born In, Hawaii), where “Barack Obama” and “Hawaii” are entities, and “Born In” is the relation.

Knowledge graph embedding methods learn embeddings for these entities and relations such that the structure and semantics of the graph are reflected in the vector space. For instance, if the method works correctly, entities that are similar or closely related in the graph will have embeddings that are close in the vector space.

Different KG embedding methods use different strategies to achieve this goal. Some methods, like TransE, interpret relations as translations in the embedding space; others, like DistMult or ComplEx, model relations as interactions between entity embeddings. These methods are usually trained to minimize the difference between the embeddings of observed triples and maximize the difference between the embeddings of corrupted triples (triples that are assumed to be false).

Once trained, these embeddings can be used for various tasks such as link prediction (predicting missing relationships in the graph), entity resolution (identifying whether different names refer to the same entity), entity classification, recommendation systems, and more.

Common features

Vector Space Representation
Learning from Triples
Distance or Similarity Measure
Predictive Modeling
Optimization Problem
Unsupervised Learning
Scalability

Vector Space Representation: All these methods represent entities and relations in a knowledge graph as vectors (or embeddings) in a continuous vector space. The main goal is to encode the semantics of the entities and the relations into these continuous representations.
Learning from Triples: All these methods learn embeddings from triples that consist of two entities and a relation that connects them (usually denoted as (head, relation, tail)). These triples represent factual knowledge.
Distance or Similarity Measure: The embeddings are learned such that the distance or similarity measure in the embedding space reflects the likelihood of the triples. For example, in TransE, the embeddings are learned such that the sum of the head and relation embeddings is close to the tail embedding for true triples.
Predictive Modeling: Once the embeddings are learned, they can be used to make predictions about new triples, such as inferring missing links in the knowledge graph (link prediction), or predicting the most likely type of relationship between two entities.
Optimization Problem: The learning process for these embeddings usually involves solving an optimization problem, where the objective is to maximize the likelihood of the observed triples and minimize the likelihood of corrupted triples (triples that are assumed to be false).
Unsupervised Learning: The learning process is generally unsupervised, which means it does not require labeled data. The models learn purely from the structure of the knowledge graph and the triples it contains.
Scalability: These methods are designed to be scalable and able to handle large knowledge graphs with millions of entities and relations.

Why KG embeddings?

Link Prediction
Entity Resolution
Entity Classification
Recommendation Systems
Question Answering Systems
Drug Discovery

Knowledge graph embeddings have wide applications in various fields. Here are a few examples of their usage:

Link Prediction: This is one of the most common applications of knowledge graph embeddings. It involves predicting missing relationships (or links) between entities in a knowledge graph based on the embeddings. The vector representations capture semantic similarities and can therefore be used to infer possible relationships.
Entity Resolution: Knowledge graph embeddings can be used to determine whether different names or representations actually refer to the same real-world entity (also known as deduplication). For instance, “Barack Obama” and “Obama, Barack” should be recognized as the same entity.
Entity Classification: The embeddings can also be used to classify entities into various categories. For instance, in a scientific literature knowledge graph, you might want to classify entities into categories like “researchers,” “papers,” “conferences,” etc.
Recommendation Systems: Knowledge graph embeddings can be used to develop recommendation systems. For example, given a user’s past behavior on a website, you can use embeddings to suggest items that are similar to the user’s previous interests.
Question Answering Systems: Knowledge graph embeddings can also be used to develop systems that answer questions. This is done by mapping the question to a query in the embedded space and retrieving the most relevant answers.
Drug Discovery: In the biomedical domain, knowledge graph embeddings can be used to predict unknown drug-drug interactions or drug-disease associations, which could be critical for drug discovery and repurposing.

Knowledge Graph Embedding Techniques

TransE

How it works: Represents relationships as translations in the embedding space.
Pros: Simple, efficient, effectively captures semantic relationships between entities.
Cons: Struggles with modeling 1-to-N, N-to-1, and N-to-N relationships, assumes relations are transitive.
Paper: Translating Embeddings for Modeling Multi-relational Data

TransR

How it works: Introduces relation-specific embedding spaces.
Pros: Handles complex relational patterns more effectively than TransE.
Cons: Involves higher computational costs due to additional mapping matrices.
Paper: Learning Entity and Relation Embeddings for Knowledge Graph Completion

DistMult

How it works: Uses a bilinear model to represent relations, treats relation as a diagonal matrix.
Pros: Simplifies the tensor product operation, reduced computational complexity.
Cons: Struggles with asymmetric relationships because it’s inherently symmetric.
Paper: Embedding Entities and Relations for Learning and Inference in Knowledge Bases

HolE

How it works: Uses circular correlation of entity embeddings to model relationships.
Pros: Effective at modeling complex and asymmetric relationships, reduces number of parameters.
Cons: Could be computationally intensive due to the correlation operation.
Paper: Holographic Embeddings of Knowledge Graphs

ComplEx

How it works: Uses complex-valued embeddings to better handle asymmetric relationships.
Cons: The complex embeddings can be more challenging to interpret, requires more computational resources due to the need to handle complex numbers.
Paper: Complex Embeddings for Simple Link Predictions

RDF2Vec

How it works: Generates sequences of entities (walks) from the graph, and then applies the Word2Vec model on these walks to create the embeddings.
Pros: Captures both local and global semantic information, flexible, can work with different types of graphs.
Cons: Quality of embeddings depends on the walks generated, does not explicitly model relations.
Paper: RDF2Vec: RDF Graph Embeddings for Data Mining

RESCAL

How it works: Models relationships as full-rank matrices.
Pros: Expresses complex interaction types.
Cons: Can be computationally expensive and prone to overfitting due to large number of parameters.
Paper: Modeling Relational Data with Graph Convolutional Networks

RotatE

How it works: Represents relations as rotations in the complex space.
Pros: Simplifies the TransE and ComplEx models, can model various types of relations.
Cons: May require fine-tuning of hyperparameters.
Paper: RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space

QuatE

How it works: Uses quaternion algebra to model entities and relations.
Pros: Captures more complex interactions and dependencies.
Cons: More computationally intensive and complex.
Paper: Quaternion Knowledge Graph Embedding

KG embeddings VS OWL & SPARQL

KG embeddings: Pros

Scalability
Predictive Power
Robustness to Noise
Inductive and abductive reasoning

KG embeddings: Cons

Lack of Explicit Semantics
Difficulty in Incorporating Prior Knowledge
No deductive reasoning

OWL and SPARQL: Pros

Explicit Semantics
Incorporation of Prior Knowledge
Standardization
Deductive reasoning

OWL and SPARQL: Cons

Scalability
Lack of Predictive Power
Sensitivity to Noise
No inductive and abductive reasoning

Tools

W3C Tools: tool repository
YATE: Turtle editor
LOD-Cloud: LOD repository
Protégé: ontology editor
BioPortal: biology repository
JSON-LD Generator
Google Structured Data Testing Tool
French Government tools: collection of publications, practical guides and tools

TL;DR

RDF, RDFS, OWL

RDF (Resource Description Framework), RDFS (RDF Schema), and OWL (Web Ontology Language) are the foundational technologies that enable us to define and structure our data in a way that is both human-readable and machine-interpretable, creating rich, interconnected webs of data.

SPARQL

SPARQL is a powerful query language for RDF. SPARQL allows us to interrogate our data, ask complex questions, and extract valuable insights.

KG Storage

A great number of storage solutions for our Knowledge Graphs. Understanding different approaches to KG storage is crucial for ensuring the performance, scalability, and long-term maintainability of our datasets.

RDF-star and SHACL

RDF-star is an extension of RDF that allows for more complex statements about other statements.

SHACL (Shapes Constraint Language), a language for validating RDF graphs against a set of conditions. These technologies provide us with even more expressivity and reliability in our data handling.

R2RML and RML

In the R2RML and RML section, we learnt how to map our existing relational databases to RDF using R2RML (RDB to RDF Mapping Language), and how to transform various data formats (CSV, JSON, XML, etc.) into RDF using RML (RDF Mapping Language).

KG Embeddings

KG Embeddings allow us to represent nodes and relationships from our Knowledge Graph in a numerical, dense vector space. This is a powerful technique for applying machine learning methods to our KG, opening up possibilities for tasks like link prediction, entity resolution, and recommendation systems.

Outro

Not seen

Ontology Design and Engineering
Data Integration and Interlinking
KG Visualization
Privacy and Ethics in KGs

Additional resources

This course is largely based on the following resources:

Knowledge Graphs by Hogan et al. (2021)

CS 520 Knowledge Graphs and CS 224 Machine Learning with Graphs (Stanford courses)

Knowledge Representation and Reasoning and Semantic Web by Antoine Zimmermann

Embedding Knowledge Graphs with RDF2vec by Heiko Paulheim , Petar Ristoski , Jan Portisch

Contact

If you have any questions or comments, please do not hesitate to contact me:

pierre[dash]henri[dot]paris[at]telecom[dash]paris[dot].fr