Machine Learning on Linked Data

Tensors and their Applications in Graph-Structured Domains

Maximilian Nickel, Ludwig Maximilian University Munich

What the tutorial will be about

This tutorial is dedicated to how machine learning on Linked Data can be realized using tensor factorizations. In general, the focus will rest on applications important to Linked Data and we will introduce necessary theory when needed along that way.

From Linked Data to Tensors
Some tensor theory and data representation of Linked Data as tensors
Explanatory Analysis of Linked Data
Starting from graphs and moving to multigraphs for ranking of nodes in RDF data
Relational Learning & Link Prediction
Prediction of unknown links, a central problem when learning on relational data.
Entity Resolution & Taxonomy Learning
Unsupervised instance matching and learning of taxonomies
Scaling to complete Knowledge Bases
Scaling tensor factorizations to the size of complete knowledge bases
Analysis of time-evolving graphs

Machine Learning and Linked Data

Machine Learning depends on repeatable statistical patterns to generalize from examples
The party membership of a U.S. vice president can be predicted reasonably well, given knowledge about his president's party
Predicting the first name of US (vice) presidents is not meaningful (one-to-one relationship, no recurring patterns)
Machine Learning depends strongly on the data at hand and the task that should be solved

Linked Data and Machine Learning

The Linked Data principles are:

Use URIs as names for things
Use HTTP URIs, so that people can look up those names
When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
Include links to other URIs, so that they can discover more things

From a machine learning perspective:

Linked Data is (semi-)structured, i.e. (subject, predicate, object) triples
Data consists of entities and relations of different types between those entities
- Entities = All subjects and objects that are not literals
  (In this tutorial: no special treatment of ontological knowledge)
- Relations = Individual links between entities
- Relation Types = Properties in RDF

Relations, the Beauty & the Challenge

Knowledge models such as Linked Data and many problems in machine learning have a natural representation as relational data
Relations between entities are often more important for a prediction task than attributes

For instance, can be easier to predict the party of a vice-president from the party of his president than from his attributes
Linked Data is not independent and identically distributed (iid), as
- it does not consist of only one type of objects (e.g only persons)
- the entities are related to each other

Learning, Linked Data & Heterogenity

Learning, Linked Data & Heterogeneity

Heterogeneity introduces additional noise to the data
- Duplicate entities: URIs from different data sources refer to the same underlying entity (without specifying owl:sameAs)
- Duplicate predicates: Same case as for duplicate entities
- Predicates with similar semantics: Probably more common case, for instance, different domain/range constraints, super/sub-predicates etc.
Need of Linked Data Instance matching and Ontology Alignment to remove the noise introduced by heterogeneity
Need of Machine Learning: Joint disambiguation of predicates and entities

The Open World of Linked Data

Linked Data follows an open-world assumption, i.e. data cannot generally be assumed to be complete
In fact, Linked Data can be quite incomplete
Need of Linked Data: Prediction of unknown triples in incomplete knowledge bases $\Rightarrow$ Link Prediciton
Problem for ML: Linked Data rarely contains negative examples or negations (i.e. !partyOf(Al Gore, Republican))

Challenges !?

Many of the features that make Linked Data so valuable (e.g. relations, linked data sources, size), can be a problem for machine learning and also Linked Data itself

The rest of this tutorial is dedicated to show how most of these problems can be approached by using
tensor factorizations

Data Representation

Linked Data
$\downarrow$
Tensors

Linked Data is a Multigraph

From a mathematical perspective, Linked Data can be regarded as a labeled directed multigraph. A minimal definition for the purpose of this tutorial is

A labeled directed multigraph is a tuple $\mathcal{G} := (V, E, L)$
where

$V$ is a set of vertices
$L$ is a set of edge labels
$E \subseteq V \times V \times L$ is a set of ordered triples

$V$ corresponds to the entities in the domain
$L$ corresponds to the different relation types
$E$ corresponds to the known facts about the entities (RDF triples)

Multigraph Example

presidentOf
vicePresidentOf
partyOf
knows

The set of vertices, edge labels and edges for this multigraph are \begin{align} V = \{&\text{Bill Clinton}, \text{Al Gore}, \text{Democratic Party }\} \\ \\ L = \{&\text{presidentOf}, \text{vicePresidentOf}, \text{partyOf}, \text{knows }\}\\ \\ E = \{&\text{(Bill Clinton, Al Gore, presidentOf)},\\ & \text{(Al Gore, Democratic Party, partyOf)}, \\ & \vdots \\ & \text{(Bill Clinton, Al Gore, knows) } \} \end{align}

Matrix Representation of Graphs

Adjacency matrices are standard representations for graphs, i.e. data with only one type of edges or relations

$A = $

Likes relation

Legend

$=1$

$=0$

Some graph properties translate directly to adjacency matrix
- Undirected $\rightarrow$ Symmetric
- Directed $\rightarrow$ Asymmetric
- Unweighted $\rightarrow$ Binary
Analysis of graphs using adjacency matrices
- Ranking: PageRank, HITS
- Link Prediction: Common Neighbours, Katz, Adamic/Adar, low-rank matrix factorization

Multigraphs and Matrices

Multigraphs can not be represented by a single matrix without information loss, given no constraints on structure are enforced

Loves

$+$

Hates

$+$

Indifferent

Moving from Matrices to Tensors

Underlying problem: Two modes, as in the matrix case, are not expressive enough to model multi-relational data without information loss
Possible solution: Move to tensors, which are $n$-modal generalizations of matrices

What are tensors?

Formally, tensors are defined as

An $n$-th order tensor is an element of the tensor product of $n$ vector spaces, i.e.

$\mathcal{T} \in V_1 \otimes ... \otimes V_n$

Less formally: tensors are multidimensional arrays, i.e. data structures with (possibly) more than two indices ($\mathcal{T}_{ijqrs}$)
The order of an tensor is the number of modes (i.e, number of indices needed to identify elements in the array)

What are tensors?

Vectors are 1st order tensors: $x_i$ (one mode)
Matrices are 2nd order tensors: $X_{ij}$ (two modes)
Tensors of order $\geq$ 3 are called higher-order tensors $\mathcal{T}_{ijqrs}$

In the following we will only use third-order tensors, i.e. "cubes"

Example for a third-order tensor

$\mathcal{T} \in \mathbb{R}^{10 \times 7 \times 5}$

$= \mathcal{T}_{1,1,1}$

$= \mathcal{T}_{3,7,2}$

$= \mathcal{T}_{1,5,4}$

Mode-1 has dimension $I = 10$
Mode-2 has dimension $J = 7$
Mode-3 has dimension $K = 5$

Tensor Representation of RDF Data

RDF data, or any other multigraph, can be represented as a third-order tensor
Two modes refer to the entities, one mode to the relation types \begin{align} \Rightarrow & |Mode\ 1| = |Mode\ 2| = |V|\\ & |Mode\ 3| = |L| \end{align}
We also require that the ordering of the entities on both modes is identical

$\mathcal{T}_{ijk}=\cases{1, & if triple ($\color{orangered}{i}$-th entity, $\color{yellow}{k}$-th predicate, $\color{greenyellow}{j}$-th entity) exists\cr 0, & otherwise\cr}$

In the following we will refer to this modeling as adjacency tensor

Running Example $\rightarrow$ Tensor

The adjacency tensor for the running example used so far is

presidentOf
vicePresidentOf
partyOf

\mathcal{G}

\rightarrow

\mathcal{T}

An adjacency tensor contains a cell for every possible triple that could be created from the entities and relation types that are present in the data

Notation & Terminology

Slices are two-dimensional sections of a tensor (i.e. matrices)


Horizontal slices $\mathcal{T}_{i,:,:}$	Lateral slices $\mathcal{T}_{:,j,:}$	Frontal slices $\mathcal{T}_{:,:,k}$ = stacked adjacency matrices in adjacency tensors

Notation & Terminology

Fibers are higher-order analogues of rows and columns in matrices


Mode-1 fibers $\mathcal{T}_{:,j,k}$	Mode-2 fibers $\mathcal{T}_{i,:,k}$	Mode-3 fibers $\mathcal{T}_{i,j,:}$

Machine Learning on Tensors

The general focus of this tutorial will be learning on adjacency tensors of multigraphs with tensor factorizations

Tensor factorizations can be considers higher-order generalizations of matrix factorization methods
Matrix factorizations have a long tradition in machine learning
Matrix factorizations try to express data in terms of factor matrices
Factorization methods have an appealing interpretation as latent variable models

The CP Decomposition & Explanatory Analysis

Hubs, Authorities, Topics & Co.

The Beatles on DBpedia

What are the important nodes of this multigraph?

In the Case of Graphs

Authoritative Ranking with HITS

Task: Ranking of nodes relative to their relevance for a query
Main Idea: Describe nodes in the graph in terms of hub- and authority-scores
Authority score $a_j$: Importance of the node $i$ for incoming links (web context: value of the content of a page), i.e. $a_j = \sum_{i \rightarrow j} h_i$
Hub scores $h_i$: Importance of the node $j$ for outgoing links (web context: value of links to other pages), i.e. $h_i = \sum_{i \rightarrow j} a_j$
Mutual Reinforcement: Good hubs link to many good authorities, good authorities receive many links from good hubs

(rank-reduced)

Singular Value Decomposition

Any matrix $X$ has a decomposition into a weighted sum of the outer products of pairwise orthonormal vectors, i.e. $X = \sum_r \lambda_r u_r \circ v_r$, such that

$X$

$=$

\lambda_1 u_1 \circ v_1

$ + \dots + $

\lambda_r u_r \circ v_r

$\lambda_{i}$ is the $i$-th singular value of $X$
$u_i$ is the $i$-th left singular vector of $X$
$v_i$ is the $i$-th right singular vector of $X$
the sets $u_i$, $v_i$ are pairwise orthonormal, e.g. $u_i^Tu_i = 1$, $u_i^Tu_{j\neq i} = 0$
the rank of $X$ can be defined as the minimal number $r$ of outer products $\lambda_r u_r \circ v_r$ that generate $X$ is their sum

Some Properties of SVD

The left singular vectors of X, $u_r$ are the eigenvectors of the kernel matrix $XX^T$, i.e $$\lambda_r u_r = XX^T u_r$$
The right singular vectors of X, $v_r$ are the eigenvectors of covariance matrix $X^TX$, i.e. $$\lambda_r v_r = X^TX v_r$$
SVD has many applications, e.g. LSI, dimensionality reduction, recommendation engines, HITS

In the Case of Graphs

Authoritative Ranking with HITS

The hub- and authority-scores are defined as \begin{align*} a_j & = \sum_{i \rightarrow j} h_i & h_i & = \sum_{i \rightarrow j} a_j \\ \end{align*}
In terms of the adjacency matrix $A$, this can be stated as \begin{align*} a & = A^T h & h & = A a \\ \Rightarrow a & = A^TAa & h & = AA^Th \end{align*}
$\Rightarrow$ The vectors $a$ and $h$ correspond to the singular vectors of $A$, i.e.

$$A = \sum_r \lambda_r \color{greenyellow}{u_r} \circ \color{orangered}{v_r}$$

$=$

$ +\, \dots \, + $

where, $h = u_1$ and $a = v_1$ is the dominant aspect in the graph, for $\lambda_1 \geq \lambda_{r \neq 1}$

Toy Data

The Beatles Soap Opera

dislikes
doesn't like
likes

A completely made-up social network for demonstration purposes

HITS on the Beatles Soap Opera

likes

dislikes

doesntlike

Graphs $\rightarrow$ Multigraphs

Authoritative Ranking with CP

Difficulties with HITS
- HITS is a graph algorithm and doesn't know about multigraphs
TOPHITS: Models a multigraph not only in terms of hubs and authorities but also in terms of its edge labels, i.e. the relation types
TOPHITS is computed by appyling the CP decomposition to the adjacency tensor

Outer Product / Rank-1 Tensors

Rank-1 $n$-order tensors are the result of the outer product of $n$ vectors. For example, in the three-way case

\begin{align*} \mathcal{T} & = {\color{orangered}a} \circ {\color{yellow}b} \circ {\color{turquoise}c} \\ \mathcal{T}_{ijk} & = a_i b_j c_k \end{align*}

$=$

where $\circ$ denotes the outer product

CANDECOMP / PARAFAC (CP)

CP is a decomposition of a tensor $\mathcal{T}$ into sum of rank-1 tensors

\begin{align*} \mathcal{T} & \approx \sum_{r=1}^{R} \lambda_r a_r \circ b_r \circ c_r \\ \mathcal{T}_{ijk} & \approx \sum_{r=1}^{R} \lambda_r a_{ir} b_{jr} c_{kr} \end{align*}

\mathcal{T}

\approx

\lambda_1 a_1 \circ b_1 \circ c_1

$+$

\lambda_2 a_2 \circ b_2 \circ c_2

$+$

\lambda_3 a_3 \circ b_3 \circ c_3

The rank $r$ of a CP decomp. is the number of rank-1 tensors used to approximate $\mathcal{T}$
"$\approx$" is the best approximation under some loss, e.g. $\|\mathcal{T} - \sum_r \lambda_r a_r \circ b_r \circ c_r \|^2$

Tensor rank $rank(\mathcal{T})$ is the smallest number $r$ of rank‑1 tensors that are needed to exactly reconstruct $\mathcal{T}$ as their sum
CP can be regarded a multilinear generalization of SVD

Authoritative Ranking with CP

For a rank-1 decomposition of an adjacency tensor $\mathcal{T} = \lambda h \circ a \circ t$ it holds that \begin{align} h_i & = \sum_{i \overset{k}{\rightarrow} j} a_j t_k & a_j & = \sum_{i \overset{k}{\rightarrow} j} h_i t_k & t_k & = \sum_{i \overset{k}{\rightarrow} j} h_i a_j \end{align}
Rank-reduced factorization performs simultaneous dimensionality reduction on entities and relation types, such that similar entities or predicates are grouped together

\mathcal{T} \in \mathbb{3 \times 3 \times 3}

$\approx$

\lambda_1 a_1 \circ b_1 \circ c_1

$+$

\lambda_2 a_2 \circ b_2 \circ c_2

TOPHITS on the Beatles Soap Opera

TripleRank - CP on Semantic Web Data

Basic Idea: Apply TOPHITS approach to RDF data to enable faceted browsing in knowledge bases
Example: The Beatles in DBpedia
TripleRank is applied to a subset of the data (e.g. all nodes with at most 2 hops distance to the query node)
Approach closer to original HITS application
Correlations between predicates can be important

TripleRank - CP on Semantic Web Data

(some unfortunately problematic)

Properties of CP / Tensor Rank

Determining tensor rank is NP-hard! In practice: find tensor rank numerically by fitting multiple models
Non-nestedness: Rank-$(n+1)$ approximation does not necessarily Rank-$n$ approximation (means no overspecification)
Weak upper bound of tensor rank for third-order tensors: $rank(\mathcal{T}) \leq \min\{IJ, IK, JK\}$
Degeneracy: A tensor may be approximated arbitrarily well by a lower rank factorization

(true)

Relational Learning

Or "tell me what is your friend's Erdös-Number"

Collective Learning

Relations between entities are a rich source of information, as they allow to include information that are distant in the graph

Collective Learning is the ability of a learning algorithm to include information such as classes, relations or attributes of related entities in the learning and prediction task for a particular entity

Example: Party membership of US presidents

Factorizations & Relational Learning

Using a Link Prediction Example

Matrix factorizations decompose a observed matrix $X$ into latent (or hidden) factors, e.g. \begin{align*}X \approx SO^T && X_{ij} \approx s_i^T o_j\end{align*}$S$ and $O$ are called factor matrices and $s_i$, $o_j$ are the $i$-th and $j$-th row as column-vectors
Latent factors can be interpreted as new features that have been invented to describe the data (statistical predicate invention)
The rows of the factor matrices $s_i$, $o_j$ can be considered latent-variable representations of (in our case) entities that explain the observed variables $X_{ij}$
The columns of the factor matrices can be considered the invented latent features
The entries of the factor matrices specify how much an entity participates in a latent feature

Collective Learning & Bipartite Model

CP actually models a bipartite multigraph, i.e. CP assumes that the entries on the 1st mode are not identical to the entries on the 2nd mode (different latent representations $a_r$, $b_r$ in $\sum_r a_r \circ b_r \circ c_r$)
$\Rightarrow$ CP does not account for identities between subjects and objects
For general relational data, such as Linked Data, this assumption is violated
Bipartite modeling prevents information propagation over latent variables, or at least makes it very difficult

Collective Learning and CP

CP model for relational data with unique entity-representation would be \[ \mathcal{T} \approx \sum_r e_r \circ e_r \circ p_r \]
All frontal slices would be symmetric
$\Rightarrow$ CP can not simultaneously enforce the unique entity-representation constraint and still model directed relations
Tucker Decompositions offer more expressiveness than CP, such that the required constraints can be enforced $\Rightarrow$ RESCAL

Interlude - Unfolding

The mode-$n$ unfolding of a tensor $\mathcal{T}$, denoted by $\mathcal{T}_{(n)}$, reorders the elements of $\mathcal{T}$ into a new matrix $M$, by using the mode-$n$ fibers as columns of $M$. For $\mathcal{T} \in \mathbb{R}^{2 \times 3 \times 2}$

Mode-1 fibers	$\overset{\mathcal{T}_{(1)}}{\rightarrow}$	$\in \mathbb{R}^{2 \times 6}$
Mode-2 fibers	$\overset{\mathcal{T}_{(2)}}{\rightarrow}$	$\in \mathbb{R}^{3 \times 4}$
Mode-3 fibers	$\overset{\mathcal{T}_{(3)}}{\rightarrow}$	$\in \mathbb{R}^{2 \times 6}$

Interlude - Mode-n Product

The mode-$n$ product, denoted by $\times_n$, is the multiplication of a tensor by a matrix in mode $n$. Let $\mathcal{T} \in \mathbb{R}^{I_1 \times \dots \times I_N}$ and $M \in \mathbb{R}^{{\color{yellow}Q} \times I_n}$

$$\mathcal{Y} = \mathcal{T} \times_n M \in \mathbb{R}^{I_1 \dots I_{n-1} \dots \times {\color{yellow}Q} \times I_{n+1} \dots I_{N}}$$ $$Y_{(n)} = MT_{(n)}$$

	$\times_2$		$\overset{\mathcal{T_{(2)}}}{\Large \leftrightarrows}$
$\mathbb{R}^{2 \times 3 \times 2}$		$\mathbb{R}^{2 \times 3}$		$\triangleq \mathbb{R}^{2 \times 2 \times 2}$

RESCAL

Main Idea: RESCAL is a factorization of multi-relational data, or multigraphs, into unique entity and predicate representations

\begin{align} \mathcal{T} \approx & R \times_1 A \times_2 A \\ \mathcal{T}_k \approx & A R_k A^T \\ \mathcal{T}_{ijk} \approx & \sum\nolimits_{q,r} a_{iq} \mathcal{R}_{qrk} a_{jr} \end{align}

\mathcal{T}_k

\approx

$A R_k A^T$

$A \in \mathbb{R}^{|V| \times r}$ represents the entity-latent-component space
$R_k \in \mathbb{R}^{r \times r}$ is an asymmetric matrix that specifies the interaction of the latent components for the $k$-th predicate
where $r$ is the number of latent-components of the factorization
$\approx$ means approximate best under some loss, e.g. $\sum_k \|\mathcal{T}_k - A\mathcal{R}_kA^T\|^2$

Link Prediction

Link Prediction using RESCAL

\mathcal{T}_k

$\approx$

$A R_k A^T$

Basic procedure of link prediction with RESCAL
- Compute "rank-reduced" factorization ($\triangleq$ latent-variable representations)
- Entries after reconstruction (i.e. $\mathcal{\hat{T}}_{ijk} = a_i R_k a_j^T$) can be regarded as confidence values of the model that the corresponding triple exists
Method can be motivated from a latent variable interpretation of factorizations
Due to the independence of random variables given the latent variables, very fast prediction, since it is equivalent to simple matrix-vector products
Similar procedures possible for CP and other tensor decompositions (without or reduced collective learning effect)

U.S. Presidents - Data

Data extracted from DBpedia
Consists of URIs of (vice) presidents and parties, the party memberships and the information who was (vice) president of whom
Importantly, the data contains no other information, therefore only a collective learning algorithm will learn something meaningful on this dataset
Task: Predict party-membership for persons in dataset (link prediction)

U.S. Presidents - Experiments

Experimental setting: 10-fold cross-validation over all persons
Evaluation measure: Area under precision-recall curve (AUC-PR)

AUC-PR

Link Prediction Benchmarks

AUC-PR

Kinships

UMLS

Nations

Entity Resolution & Taxonomy Learning

Entity Resolution

Problem: Data contains duplicate instances of the true underlying entities
Example: Names as identifiers in citation databases: "A. Einstein", "Einstein, A.", "Albert Einstein" all refer to the identical underlying entity Albert Einstein
Task: Determine which instances refer to the identical underlying entity
Important task in the Linked Data context, where this task is commonly referred to as Instance Matching
Relational data provides important information for this task, e.g. how similar are the co-authors of "A. Einstein" and "Einstein, A."
General approach: Use some similarity measure to determine which instances are most likely to be identical

Similarity of Entities in RESCAL

\mathcal{T}_k

$=$

$A R_k A^T$

RESCAL has a unique representation of entities via $A$
$A$ can be regarded as an embedding of the entities into a simple vector space
Similarity of entities in the latent-space $A$ corresponds to similarity in relational domain. Why?
$\Rightarrow$ Any feature-based machine-learning method (e.g. for clustering, classification etc.) can already be applied to $A$

Interlude - Kronecker Product

The Kronecker product $\otimes$ is a generalization of the outer product to matrices. Let $A \in \mathbb{R}^{Q \times R}, B \in \mathbb{R}^{S \times T}$. The Kronecker Product is defined as

$A \otimes B = \begin{bmatrix} a_{11} \mathbf{B} & \cdots & a_{1n}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1} \mathbf{B} & \cdots & a_{mn} \mathbf{B} \end{bmatrix}$

where $A \otimes B$ is a $QS \times RT$ block-matrix

RESCAL = Constrained Tucker

RESCAL can be regarded a variant of the Tucker Decomposition with the constraint that two factor matrices have to be identical
Equivalent formulations for $\mathcal{T} = \mathcal{R} \times_1 A \times_2 A$ and $I \in \mathbb{R}^{|L| \times |L|}$ \begin{align} \mathcal{T}_{(1)} & = A R_{(1)} (I \otimes A)^T \\ \mathcal{T}_{(2)} & = A R_{(2)} (I \otimes A)^T \\ \mathcal{T}_{(3)} & = R_{(3)}(A \otimes A)^T \end{align}
For instance \[ \mathcal{T}_{(1)} = A \begin{bmatrix}R_1 & R_2\end{bmatrix} \begin{bmatrix} A^T & 0 \\ 0 & A^T \end{bmatrix} = \begin{bmatrix} AR_1A^T & AR_2A^T \end{bmatrix} \]

Tucker Decomposition

Decomposition of a tensor $\mathcal{T} \in \mathbb{R}^{I \times J \times K}$ into a core tensor, multiplied by factor matrices along each mode

\begin{align} \mathcal{T} & = \mathcal{G} \times_1 A \times_2 B \times_3 C \\ \mathcal{T}_{\color{greenyellow}{ijk}} & = \sum_{p,q,r} g_{pqr} a_{ip} b_{jq} c_{kr} \end{align}

\mathcal{T}

\approx

\color{yellow}{G} \times_1 \color{deepskyblue}{A} \times_2 \color{greenyellow}{B} \times_3 \color{orangered}{C}

The core tensor $\mathcal{G} \in \mathbb{R}^{P \times Q \times R}$ can be regarded a compressed version of $\mathcal{T}$
The factor matrices $A \in \mathbb{R}^{I \times P}$, $B \in \mathbb{R}^{J \times Q}$, $C \in \mathbb{R}^{K \times R}$ can be thought of the principal components in the respective modes

Uniqueness

Only possible combination of rank-1 tensors that sums to $\mathcal{X}$

Except basic indeterminacies of scaling and permutation
Matrix decompositions generally are not unique, e.g. \[M \approx AB = (AQ^{-1}) \, (QB)\]
Interesting property for explanatory data analysis
For 3-way tensors a sufficient condition
for uniqueness of CP is
$k_A + k_B + k_C \geq 2 rank + 1$,
where $k_A$ is the k-rank of $A$
CP is unique under mild conditions
Uniqueness is not identical to a global optimum!

Properties of Tucker

Tucker is not unique \begin{align} \mathcal{G} \times_1 A \times_2 B \times _3 C = & \hat{\mathcal{G}}\times_1 AQ^{-1} \times_2 BR^{-1} \times_3 CS^{-1} \\ \mbox{where } & \hat{\mathcal{G}} = (\mathcal{G} \times_1 Q \times_2 R \times_3 S) \end{align}
Tucker subsumes CP, i.e. CP is a Tucker model with the additional constraints that the core tensor $\mathcal{G}$ is superdiagonal and the factor matrices have equal rank, i.e. $P = Q = R$
Different variants of Tucker, based on number of modes that are factorized, i.e. Tucker-3, Tucker-2, Tucker-1
Tucker does not enforce unique entity constraints

Similarity of Entities in RESCAL

General approach: similarity of entities should be based on the similarity of their relational patterns

Outgoing links

\begin{align*} X_{i,:,:} = & \color{orangered}{a_i} R_{(1)} (I \otimes A^T) \\ X_{j,:,:} = & \color{greenyellow}{a_j} R_{(1)} (I \otimes A^T) \end{align*}

Incoming links

\begin{align*} X_{:,i,:} = & \color{orangered}{a_i} R_{(2)} (I \otimes A^T) \\ X_{:,j,:} = & \color{greenyellow}{a_j} R_{(2)} (I \otimes A^T) \end{align*}

For both incoming and outgoing links, the similarity of entities $e_i$ and $e_j$ is uniquely determined by the similarity of their latent-component representations $a_i$ and $a_j$
Due to the collective learning effect of RESCAL, the latent-variable representation also contains information about distant relations

Cora Publication Data

Cora is a hand-labeled dataset consisting of machine learning publications, authors, venues and paper title
Prior to entity resolution, the data contains ~90 authors, ~210 venues, ~400 publications
True number of entities: 50 authors, ~100 venues and ~130 publications
In total ~61.000 matching decisions
Instance matching: Ranking of entities by their similarity in the entity-latent-component space

Cora Publication Data

AUC-PR

Author

Citation

Venue

Due to noisy citations, a collective learning effect can significantly improve entity resolution results for authors and venues

Learning Taxonomies

Taxonomies are an integral part of ontologies
Taxonomies consist of
- a list of concepts
- a concept hierarchy
- an assignment of entities to concepts
Approach: Learn taxonomies completely unsupervised by hierarchical clustering of entities
Since similarity of entities is completely contained within the matrix $A$, any feature based hierarchical clustering method can be used

IIMB 2010 Benchmark Data

Instance matching benchmark from movie domain, consisting of $\approx$ 1400 entities, organized in 80 concepts
Top Level Concepts: Budget, Creature, Film, Language, Location
The concepts Creature, Film and Location are again subdivided into concepts like Person, Character, Country or Action Movie
Task: Unsupervised taxonomy learning via hierarchical clustering of entities
Method: Hierarchical clustering on $A$ using the OPTICS method

Level 1	0.982
Budget	1.0
Creature	1.0
Film	1.0
Language	0.963
Location	0.953
Level 2	0.806
Character	1.0
City	0.992
Country	0.632
Person	0.991
Parody	0.33
Zombie	1.0
Level 3	0.946
Actor	0.987
Capital	0.992
Character Creator	0.533
Director	0.781

Scalability

or "Tensors? I thought these things don't scale?"

The Size of Graphs

Multigraphs / Linked Data can be of large scale, consisting of millions of entities, thousands of predicates, billions of known facts
Number of random variables in a multigraph:
$|V| \times |V| \times |L|$ !
Real graphs are sparse, i.e. $|E| = O(|V|)$ instead of $O(|V|^2)$
Same property holds for most multigraphs and Linked Data, e.g.:
- DBpedia: ~3.75 million entities, 800 object properties, 1.89 Billion triples XXX TODO XXX: report only entity-entity triples
- YAGO 2: ~3 million entities, ~40 object properties, 124 Million triples XXX TODO XXX: report only entity-entity triples
By using sparse linear algebra, Alternating Least Squares (ALS) algorithms can easily exploit the sparsity of graphs

Alternating Least Squares

Typically, factorization methods are computed by optimizing the squared Frobenius norm, e.g. $\|Y - AB\|^2 = \sum_{ij}(Y_{ij} - a_i^T b_j)^2$
Alternating least squares is an alternating optimization scheme (non-linear block Gauss-Seidel method) to compute such least-squares problems in multiple variables
Basic procedure: Alternatingly keep all but one variable fixed and solve only for that variable
"Alternating Least Squares" because the algorithm solves a linear least squares subproblem for each factor matrix
Scales well for sparse matrices and tensors, easy to implement
Can show slow convergence (swamps), Using normal equations (next slide) ill-conditioned matrices can occur $\rightarrow$ Regularization

column-wise

Interlude - Khatri-Rao Product

Let

\begin{align*} A \in \mathbb{R}^{Q \times {\color{yellow}R}} & = \begin{bmatrix} a_1 & a_2 & \dots & a_R \end{bmatrix} ,\\ B \in \mathbb{R}^{S \times {\color{yellow}R}} & = \begin{bmatrix} b_1 & b_2 & \dots & b_R \end{bmatrix} \end{align*}

be two column-wise partitioned matrices.
The Khatri-Rao Product is defined as

$A \odot B = \begin{bmatrix} a_1 \otimes b_1 & a_2 \otimes b_2 & \dots & a_R \otimes b_R \end{bmatrix}$

where $A \odot B$ is a $QS \times R$ block-matrix

Scalability: CP-ALS

CP-ALS: Model $\min_{a_r, b_r, c_r}\|\mathcal{T} - \sum_r a_r \circ b_r \circ c_r\|^2$
Equivalent formulations in unfolded form, by using the Khatri-Rao product and by forming new matrices $A$, $B$, $C$ from $a_r$, $b_r$, $c_r$, where e.g. \[ A = \begin{bmatrix}a_1 & a_2 & \dots & a_R\end{bmatrix} \] \begin{align*} T_{(1)} & = {\color{greenyellow}A} (C \odot B)^T & \Rightarrow \min_A \|T_{(1)} - A(C \odot B)^T\|^2 \\ T_{(2)} & = {\color{greenyellow}B} (C \odot A)^T & \Rightarrow \min_B \|T_{(2)} - B(C \odot A)^T\|^2 \\ T_{(3)} & = {\color{greenyellow}C} (B \odot A)^T & \Rightarrow \min_C \|T_{(3)} - C(B \odot A)^T\|^2 \end{align*}
CP-ALS algorithm: solve alternatingly for $A$, $B$, $C$

Normal Equations and Sparsity

Let $\min_F \|Y - WX\|^2$ be a linear least squares problem, where $Y \in \mathbb{R}^{P \times Q}$, $W \in \mathbb{R}^{P \times rank}$, $X \in \mathbb{R}^{rank \times Q}$
$F$ can be computed by using the method of normal equations \[ W \leftarrow Y X^T (XX^T)^{-1} \]

$P$ and $Q$ can be large, e.g. for a Mode-1 unfolding of an adjacency tensor: $P = |V|$, $Q = |V| \times |L|$
Normal equations scale well for sparse tensor factorizations, as
- $XX^T \in \mathbb{R}^{rank \times rank} \Rightarrow$ Inversion is cheap
- $Y$ is large but sparse $\Rightarrow$ matrix multiplication $YX^T$ can be computed in only $O(nnz(Y) \times rank)$
- $X$ is large in one dimension ($Q$) and dense but structured $\Rightarrow$ allows for very efficient optimizations to compute $XX^T$ for CP, Tucker and RESCAL

Normal Equations and CP

CP Model: $\mathcal{T}_{(1)} = A(C \odot B)^T$
For instance, the update of $A$ is computed by \[ A \leftarrow \underbrace{\vphantom{C_{(1)}^{-1}}\mathcal{T}_{(1)}}_{\vphantom{(X^T)^{-1}}=Y} \underbrace{\vphantom{C_{(1)}^{-1}}(C \odot B)}_{\vphantom{(X^T)^{-1}}=X} \underbrace{\vphantom{C_{(1)}^{-1}}(C^TC * B^TB)^{-1}}_{=(XX^T)^{-1}} \] where $C \odot B \in \mathbb{R}^{|V||L| \times rank}$ and $C^TC * B^TB \in \mathbb{R}^{rank \times rank}$
Structure of tensor factorization decreases complexity \begin{align*} C^TC * B^TB & = O\left((|V| \color{greenyellow}{+} |L|)rank^2\right) \\ (C \odot B)^T (C \odot B) & = O(\left |V||L|rank^2 \right) \end{align*}
Similar, even slightly more scalable results exist for RESCAL and Tucker decompositions

They do scale!

(given suitable data & the right algorithm)

RESCAL-ALS scales linearly with the number of entities, predicates and known facts, superlinearly with the number of latent components
Scalability experiments using RESCAL on synthetic data
Base tensor $\mathcal{T} \in \mathbb{R}^{|V| \times |V| \times |L|}$, with $|V| = 1000$, $|L| = 10$
In each experiment, vary one of the parameters $|V|$, $|L|$, $nnz(\mathcal{T})$ and the number of latent components

Implications of Using ALS

The objective function of linear least squares is of the form \[ \|T - FX\|^2 \] which has various implications
Normality Assumption: The objective is only valid under the assumption that errors on random variables are normally distributed. For relational data a better error model is binomial
Open-World Assumption: By default, the objective treats unobserved triples as negative examples. Due to the open-world assumption of LD, this is not entirely justified
Both issues can be approached with using different objective functions, e.g. choosing a pairwise preference ranking objective, or an binomial error term. However, there is usually a trade-of between better modeling and scalability.
Does not account for some properties of real world graphs, such as scale free degree distribution etc.

Tensors & Attributes of Entities

Literally

Tensors + Attributes of Entities

Much of the information in Linked Data is in form of literals
In a machine learning setting, literals correspond to attributes
Naive handling:
- Discretization of continuous or textual attributes
  - $20 \rightarrow$ young, $99 \rightarrow$ old
  - "Albert Einstein" $\rightarrow$ {"Albert", "Einstein"}
- Use results of discretization step as entities and fill tensor accordingly
  - (Bob, inAgeGroup, old)
  - (Albert_Einstein, hasWord, Albert)

Handling of Attributes

Naive handling has serious scalability issues
Include attributes as additional constraint on entity representation \[ D \approx AV \]
Joint or "coupled" tensor factorization
For instance, with RESCAL \[ \min_{A,R,V} \sum_k \|X_k - A R_k A^T \|^2 + \|D - AV\|^2 \]
Similar methods exist for CP and Tucker

Graphs, Time & Tensors

Time-dependent Analysis of Graphs

Common ways to model time and graphs
- Time-slices, i.e. assign points in time to bins (days, months etc.)
- Sequential modeling, such as user - item - last item
Common assumptions made
- Smoothness: Data changed smoothly over time
- Markov property: Data at time point $t_i$ depends only on $t_{(i-1)}$
Here we will only consider the time-slices approach

DEDICOM

Basic Idea: Factorization of relational data into unique entity and predicate representations

\begin{align} \mathcal{T}_k & = A D_k R D_k A^T \\ \mathcal{T}_{ijk} & = a_i^T D_k R D_k a_j \end{align}

$=$

DEDICOM has one global interaction model $R$
Global patterns vary linearily over frontal slices of $\mathcal{T}$
Well suited to model time-varying graphs
Only defined for 2nd- and 3rd-order tensors

Enron Email Communication Network

Email data from ~150 users, mostly Enron senior managment
Collected and made public by FERC during its investigation
Original Data ~500,000 messages, containing
- To: / From:
- Date
- Subject / Body
In the following smaller part of Data is used, consisting of 184 users and 34,427 messages
Modeling of data via 44 time-slices

Enron Network Analysis

$A$ assigns persons to roles, i.e. trade executive etc.
$R$ models global interactions of these roles

Assignment Quality

Global communication patterns

Enron Network Analysis

$D$ expresses the variation of patterns over time

Sequence Models

Modeling with time-slices does not take temporal order into account (Predicate fibers are exchangeable)
Sequence-oriented modeling: entity $\times$ entity $\times$ last entity
For instance: user $\times$ item $\times$ last item in recommendation engines
Ternary Relation $\rightarrow$ Hypergraph!

Concluding Remarks

Tensors offer a flexible and scalable way to machine learning on multi-relational data
Tensors offer a natural way to represent multi-graphs via adjacency tensors, without losing information as in the matrix case
Tensor factorizations like CP can be used for explanatory analysis on multi-graphs, e.g. to rank nodes by their importance
Rank-reduced factorizations off all three modes of a tensor offer a way to do joint disambiguation of entities and predicates
CP provides interpretable results via uniqueness property

Concluding Remarks

Tensor factorization methods have applications to many important problems in the Linked Data context such as link prediction, instance matching or information retrieval
- RESCAL shows state-of-the-art results in link prediction and entity resolution
- Embedding $A$ of RESCAL enables standard machine learning algorithms on relational data
Due to the sparsity of adjacency tensors, factorizations can scale up to complete knowledge bases
Time-evolving graphs and networks can be analysed by inspection of the factor matrices of appropriate decompositions like DEDICOM

Code & Tools

If You Want More...

Python and Matlab code for RESCAL available from my website and Github


from rescal import rescal
A, R = rescal(X, 20, init='nvecs')

Easy to use tools to convert from N-Triples or other formats to Python and Matlab tensors available from my website and Github
Tensor Toolbox for Matlab available from Tamara Kolda's site

Machine Learning on Linked Data

Tensors and their Applications in Graph-Structured Domains

What the tutorial will be about

Machine Learning and Linked Data

Linked Data and Machine Learning

Relations, the Beauty & the Challenge

Learning, Linked Data & Heterogenity

Learning, Linked Data & Heterogeneity

The Open World of Linked Data

Challenges !?

Data Representation

Linked Data$\downarrow$Tensors

Linked Data is a Multigraph

Multigraph Example

Matrix Representation of Graphs

Multigraphs and Matrices

Moving from Matrices to Tensors

What are tensors?

What are tensors?

Example for a third-order tensor

Tensor Representation of RDF Data

Running Example $\rightarrow$ Tensor

Notation & Terminology

Notation & Terminology

Machine Learning on Tensors

The CP Decomposition & Explanatory Analysis

Hubs, Authorities, Topics & Co.

The Beatles on DBpedia

In the Case of Graphs

Authoritative Ranking with HITS

(rank-reduced)

Singular Value Decomposition

Some Properties of SVD

In the Case of Graphs

Authoritative Ranking with HITS

Toy Data

The Beatles Soap Opera

HITS on the Beatles Soap Opera

Graphs $\rightarrow$ Multigraphs

Authoritative Ranking with CP

Outer Product / Rank-1 Tensors

CANDECOMP / PARAFAC (CP)

Authoritative Ranking with CP

TOPHITS on the Beatles Soap Opera

TripleRank - CP on Semantic Web Data

TripleRank - CP on Semantic Web Data

(some unfortunately problematic)

Properties of CP / Tensor Rank

(true)

Relational Learning

Or "tell me what is your friend's Erdös-Number"

Collective Learning

Factorizations & Relational Learning

Using a Link Prediction Example

Collective Learning & Bipartite Model

Collective Learning and CP

Interlude - Unfolding

Interlude - Mode-n Product

RESCAL

Link Prediction

Link Prediction using RESCAL

U.S. Presidents - Data

U.S. Presidents - Experiments

Link Prediction Benchmarks

Entity Resolution & Taxonomy Learning

Entity Resolution

Similarity of Entities in RESCAL

Interlude - Kronecker Product

RESCAL = Constrained Tucker

Tucker Decomposition

Uniqueness

Properties of Tucker

Similarity of Entities in RESCAL

Cora Publication Data

Cora Publication Data

Learning Taxonomies

IIMB 2010 Benchmark Data

Scalability

or "Tensors? I thought these things don't scale?"

The Size of Graphs

Linked Data
$\downarrow$
Tensors