Machine Learning on Linked Data

Tensors and their Applications in Graph-Structured Domains

Maximilian Nickel, Ludwig Maximilian University Munich

Volker Tresp, Siemens AG

What the tutorial will be about

This tutorial is dedicated to how machine learning on Linked Data can be realized using tensor factorizations. In general, the focus will rest on applications important to Linked Data and we will introduce necessary theory when needed along that way.

  1. From Linked Data to Tensors
    Some tensor theory and data representation of Linked Data as tensors
  2. Explanatory Analysis of Linked Data
    Starting from graphs and moving to multigraphs for ranking of nodes in RDF data
  3. Relational Learning & Link Prediction
    Prediction of unknown links, a central problem when learning on relational data.
  4. Entity Resolution & Taxonomy Learning
    Unsupervised instance matching and learning of taxonomies
  5. Scaling to complete Knowledge Bases
    Scaling tensor factorizations to the size of complete knowledge bases
  6. Analysis of time-evolving graphs

Machine Learning and Linked Data

  • Machine Learning depends on repeatable statistical patterns to generalize from examples
  • The party membership of a U.S. vice president can be predicted reasonably well, given knowledge about his president's party
  • Predicting the first name of US (vice) presidents is not meaningful (one-to-one relationship, no recurring patterns)
  • Machine Learning depends strongly on the data at hand and the task that should be solved

Linked Data and Machine Learning

The Linked Data principles are:

  • Use URIs as names for things
  • Use HTTP URIs, so that people can look up those names
  • When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
  • Include links to other URIs, so that they can discover more things

From a machine learning perspective:

  • Linked Data is (semi-)structured, i.e. (subject, predicate, object) triples
  • Data consists of entities and relations of different types between those entities
    • Entities = All subjects and objects that are not literals
      (In this tutorial: no special treatment of ontological knowledge)
    • Relations = Individual links between entities
    • Relation Types = Properties in RDF

Relations, the Beauty & the Challenge

  • Knowledge models such as Linked Data and many problems in machine learning have a natural representation as relational data
  • Relations between entities are often more important for a prediction task than attributes
  • For instance, can be easier to predict the party of a vice-president from the party of his president than from his attributes
  • Linked Data is not independent and identically distributed (iid), as
    • it does not consist of only one type of objects (e.g only persons)
    • the entities are related to each other

Learning, Linked Data & Heterogenity

Learning, Linked Data & Heterogeneity

  • Heterogeneity introduces additional noise to the data
    • Duplicate entities: URIs from different data sources refer to the same underlying entity (without specifying owl:sameAs)
    • Duplicate predicates: Same case as for duplicate entities
    • Predicates with similar semantics: Probably more common case, for instance, different domain/range constraints, super/sub-predicates etc.
  • Need of Linked Data Instance matching and Ontology Alignment to remove the noise introduced by heterogeneity
  • Need of Machine Learning: Joint disambiguation of predicates and entities

The Open World of Linked Data

  • Linked Data follows an open-world assumption, i.e. data cannot generally be assumed to be complete
  • In fact, Linked Data can be quite incomplete
  • Need of Linked Data: Prediction of unknown triples in incomplete knowledge bases $\Rightarrow$ Link Prediciton
  • Problem for ML: Linked Data rarely contains negative examples or negations (i.e. !partyOf(Al Gore, Republican))

Challenges !?

Many of the features that make Linked Data so valuable (e.g. relations, linked data sources, size), can be a problem for machine learning and also Linked Data itself

The rest of this tutorial is dedicated to show how most of these problems can be approached by using
tensor factorizations

Data Representation

Linked Data
$\downarrow$
Tensors

Linked Data is a Multigraph

From a mathematical perspective, Linked Data can be regarded as a labeled directed multigraph. A minimal definition for the purpose of this tutorial is

A labeled directed multigraph is a tuple $\mathcal{G} := (V, E, L)$
where

  • $V$ is a set of vertices
  • $L$ is a set of edge labels
  • $E \subseteq V \times V \times L$ is a set of ordered triples
  • $V$ corresponds to the entities in the domain
  • $L$ corresponds to the different relation types
  • $E$ corresponds to the known facts about the entities (RDF triples)

Multigraph Example

  • presidentOf
  • vicePresidentOf
  • partyOf
  • knows

The set of vertices, edge labels and edges for this multigraph are \begin{align} V = \{&\text{Bill Clinton}, \text{Al Gore}, \text{Democratic Party }\} \\ \\ L = \{&\text{presidentOf}, \text{vicePresidentOf}, \text{partyOf}, \text{knows }\}\\ \\ E = \{&\text{(Bill Clinton, Al Gore, presidentOf)},\\ & \text{(Al Gore, Democratic Party, partyOf)}, \\ & \vdots \\ & \text{(Bill Clinton, Al Gore, knows) } \} \end{align}

Matrix Representation of Graphs

Adjacency matrices are standard representations for graphs, i.e. data with only one type of edges or relations

$A = $
Likes relation

Legend

$=1$
$=0$
  • Some graph properties translate directly to adjacency matrix
    • Undirected $\rightarrow$ Symmetric
    • Directed $\rightarrow$ Asymmetric
    • Unweighted $\rightarrow$ Binary
  • Analysis of graphs using adjacency matrices
    • Ranking: PageRank, HITS
    • Link Prediction: Common Neighbours, Katz, Adamic/Adar, low-rank matrix factorization

Multigraphs and Matrices

Multigraphs can not be represented by a single matrix without information loss, given no constraints on structure are enforced

Loves
$+$
Hates
$+$
Indifferent

Moving from Matrices to Tensors

  • Underlying problem: Two modes, as in the matrix case, are not expressive enough to model multi-relational data without information loss
  • Possible solution: Move to tensors, which are $n$-modal generalizations of matrices

What are tensors?

Formally, tensors are defined as

An $n$-th order tensor is an element of the tensor product of $n$ vector spaces, i.e.

$\mathcal{T} \in V_1 \otimes ... \otimes V_n$
  • Less formally: tensors are multidimensional arrays, i.e. data structures with (possibly) more than two indices ($\mathcal{T}_{ijqrs}$)
  • The order of an tensor is the number of modes (i.e, number of indices needed to identify elements in the array)

What are tensors?

  • Vectors are 1st order tensors: $x_i$ (one mode)
  • Matrices are 2nd order tensors: $X_{ij}$ (two modes)
  • Tensors of order $\geq$ 3 are called higher-order tensors $\mathcal{T}_{ijqrs}$

In the following we will only use third-order tensors, i.e. "cubes"

Example for a third-order tensor

$\mathcal{T} \in \mathbb{R}^{10 \times 7 \times 5}$
$= \mathcal{T}_{1,1,1}$
$= \mathcal{T}_{3,7,2}$
$= \mathcal{T}_{1,5,4}$
Mode-1 has dimension $I = 10$
Mode-2 has dimension $J = 7$
Mode-3 has dimension $K = 5$

Tensor Representation of RDF Data

  • RDF data, or any other multigraph, can be represented as a third-order tensor
  • Two modes refer to the entities, one mode to the relation types \begin{align} \Rightarrow & |Mode\ 1| = |Mode\ 2| = |V|\\ & |Mode\ 3| = |L| \end{align}
  • We also require that the ordering of the entities on both modes is identical
$\mathcal{T}_{ijk}=\cases{1, & if triple ($\color{orangered}{i}$-th entity, $\color{yellow}{k}$-th predicate, $\color{greenyellow}{j}$-th entity) exists\cr 0, & otherwise\cr}$
In the following we will refer to this modeling as adjacency tensor

Running Example $\rightarrow$ Tensor

The adjacency tensor for the running example used so far is

  • presidentOf
  • vicePresidentOf
  • partyOf
$\mathcal{G}$
$\rightarrow$
$\mathcal{T}$

An adjacency tensor contains a cell for every possible triple that could be created from the entities and relation types that are present in the data

Notation & Terminology

Slices are two-dimensional sections of a tensor (i.e. matrices)
Horizontal slices
$\mathcal{T}_{i,:,:}$
Lateral slices
$\mathcal{T}_{:,j,:}$
Frontal slices
$\mathcal{T}_{:,:,k}$

Notation & Terminology

Fibers are higher-order analogues of rows and columns in matrices
Mode-1 fibers
$\mathcal{T}_{:,j,k}$
Mode-2 fibers
$\mathcal{T}_{i,:,k}$
Mode-3 fibers
$\mathcal{T}_{i,j,:}$

Machine Learning on Tensors

The general focus of this tutorial will be learning on adjacency tensors of multigraphs with tensor factorizations
  • Tensor factorizations can be considers higher-order generalizations of matrix factorization methods
  • Matrix factorizations have a long tradition in machine learning
  • Matrix factorizations try to express data in terms of factor matrices
  • Factorization methods have an appealing interpretation as latent variable models

The CP Decomposition & Explanatory Analysis

Hubs, Authorities, Topics & Co.

The Beatles on DBpedia

What are the important nodes of this multigraph?

In the Case of Graphs

Authoritative Ranking with HITS

  • Task: Ranking of nodes relative to their relevance for a query
  • Main Idea: Describe nodes in the graph in terms of hub- and authority-scores
  • Authority score $a_j$: Importance of the node $i$ for incoming links (web context: value of the content of a page), i.e. $a_j = \sum_{i \rightarrow j} h_i$
  • Hub scores $h_i$: Importance of the node $j$ for outgoing links (web context: value of links to other pages), i.e. $h_i = \sum_{i \rightarrow j} a_j$
  • Mutual Reinforcement: Good hubs link to many good authorities, good authorities receive many links from good hubs

(rank-reduced)

Singular Value Decomposition

Any matrix $X$ has a decomposition into a weighted sum of the outer products of pairwise orthonormal vectors, i.e. $X = \sum_r \lambda_r u_r \circ v_r$, such that

$X$
$=$
$\lambda_1 u_1 \circ v_1$
$ + \dots + $
$\lambda_r u_r \circ v_r$
  • $\lambda_{i}$ is the $i$-th singular value of $X$
  • $u_i$ is the $i$-th left singular vector of $X$
  • $v_i$ is the $i$-th right singular vector of $X$
  • the sets $u_i$, $v_i$ are pairwise orthonormal, e.g. $u_i^Tu_i = 1$, $u_i^Tu_{j\neq i} = 0$
  • the rank of $X$ can be defined as the minimal number $r$ of outer products $\lambda_r u_r \circ v_r$ that generate $X$ is their sum

Some Properties of SVD

  • The left singular vectors of X, $u_r$ are the eigenvectors of the kernel matrix $XX^T$, i.e $$\lambda_r u_r = XX^T u_r$$
  • The right singular vectors of X, $v_r$ are the eigenvectors of covariance matrix $X^TX$, i.e. $$\lambda_r v_r = X^TX v_r$$
  • SVD has many applications, e.g. LSI, dimensionality reduction, recommendation engines, HITS

In the Case of Graphs

Authoritative Ranking with HITS

  • The hub- and authority-scores are defined as \begin{align*} a_j & = \sum_{i \rightarrow j} h_i & h_i & = \sum_{i \rightarrow j} a_j \\ \end{align*}
  • In terms of the adjacency matrix $A$, this can be stated as \begin{align*} a & = A^T h & h & = A a \\ \Rightarrow a & = A^TAa & h & = AA^Th \end{align*}
  • $\Rightarrow$ The vectors $a$ and $h$ correspond to the singular vectors of $A$, i.e.

    $$A = \sum_r \lambda_r \color{greenyellow}{u_r} \circ \color{orangered}{v_r}$$
    $=$
    $ +\, \dots \, + $

    where, $h = u_1$ and $a = v_1$ is the dominant aspect in the graph, for $\lambda_1 \geq \lambda_{r \neq 1}$

Toy Data

The Beatles Soap Opera

  • dislikes
  • doesn't like
  • likes

A completely made-up social network for demonstration purposes

HITS on the Beatles Soap Opera

likes

dislikes

doesntlike

Graphs $\rightarrow$ Multigraphs

Authoritative Ranking with CP

  • Difficulties with HITS
    • HITS is a graph algorithm and doesn't know about multigraphs
  • TOPHITS: Models a multigraph not only in terms of hubs and authorities but also in terms of its edge labels, i.e. the relation types
  • \begin{align*} \text{HITS: }\quad A & \approx \sum\nolimits_r \lambda_r h_r \circ a_r\\ \text{TOPHITS: }\quad \mathcal{T} & \approx \sum\nolimits_r \lambda_r h_r \circ a_r \circ \color{greenyellow}{t_r} \end{align*}
  • TOPHITS is computed by appyling the CP decomposition to the adjacency tensor

Outer Product / Rank-1 Tensors

Rank-1 $n$-order tensors are the result of the outer product of $n$ vectors. For example, in the three-way case

\begin{align*} \mathcal{T} & = {\color{orangered}a} \circ {\color{yellow}b} \circ {\color{turquoise}c} \\ \mathcal{T}_{ijk} & = a_i b_j c_k \end{align*}
$=$

where $\circ$ denotes the outer product


CANDECOMP / PARAFAC (CP)

CP is a decomposition of a tensor $\mathcal{T}$ into sum of rank-1 tensors

\begin{align*} \mathcal{T} & \approx \sum_{r=1}^{R} \lambda_r a_r \circ b_r \circ c_r \\ \mathcal{T}_{ijk} & \approx \sum_{r=1}^{R} \lambda_r a_{ir} b_{jr} c_{kr} \end{align*}
$\mathcal{T}$
$\approx$
$\lambda_1 a_1 \circ b_1 \circ c_1$
$+$
$\lambda_2 a_2 \circ b_2 \circ c_2$
$+$
$\lambda_3 a_3 \circ b_3 \circ c_3$
  • The rank $r$ of a CP decomp. is the number of rank-1 tensors used to approximate $\mathcal{T}$
  • "$\approx$" is the best approximation under some loss, e.g. $\|\mathcal{T} - \sum_r \lambda_r a_r \circ b_r \circ c_r \|^2$
  • Tensor rank $rank(\mathcal{T})$ is the smallest number $r$ of rank‑1 tensors that are needed to exactly reconstruct $\mathcal{T}$ as their sum
  • CP can be regarded a multilinear generalization of SVD

Authoritative Ranking with CP

  • For a rank-1 decomposition of an adjacency tensor $\mathcal{T} = \lambda h \circ a \circ t$ it holds that \begin{align} h_i & = \sum_{i \overset{k}{\rightarrow} j} a_j t_k & a_j & = \sum_{i \overset{k}{\rightarrow} j} h_i t_k & t_k & = \sum_{i \overset{k}{\rightarrow} j} h_i a_j \end{align}
  • Rank-reduced factorization performs simultaneous dimensionality reduction on entities and relation types, such that similar entities or predicates are grouped together
$\mathcal{T} \in \mathbb{3 \times 3 \times 3}$
$\approx$
$\lambda_1 a_1 \circ b_1 \circ c_1$
$+$
$\lambda_2 a_2 \circ b_2 \circ c_2$

TOPHITS on the Beatles Soap Opera

TripleRank - CP on Semantic Web Data

  • Basic Idea: Apply TOPHITS approach to RDF data to enable faceted browsing in knowledge bases
  • Example: The Beatles in DBpedia
  • TripleRank is applied to a subset of the data (e.g. all nodes with at most 2 hops distance to the query node)
  • Approach closer to original HITS application
  • Correlations between predicates can be important

TripleRank - CP on Semantic Web Data

(some unfortunately problematic)

Properties of CP / Tensor Rank

  • Determining tensor rank is NP-hard! In practice: find tensor rank numerically by fitting multiple models
  • Non-nestedness: Rank-$(n+1)$ approximation does not necessarily Rank-$n$ approximation (means no overspecification)
  • Weak upper bound of tensor rank for third-order tensors: $rank(\mathcal{T}) \leq \min\{IJ, IK, JK\}$
  • Degeneracy: A tensor may be approximated arbitrarily well by a lower rank factorization

(true)

Relational Learning

Or "tell me what is your friend's Erdös-Number"

Collective Learning

Relations between entities are a rich source of information, as they allow to include information that are distant in the graph

Collective Learning is the ability of a learning algorithm to include information such as classes, relations or attributes of related entities in the learning and prediction task for a particular entity

Example: Party membership of US presidents

Factorizations & Relational Learning

Using a Link Prediction Example

  • Matrix factorizations decompose a observed matrix $X$ into latent (or hidden) factors, e.g. \begin{align*}X \approx SO^T && X_{ij} \approx s_i^T o_j\end{align*}$S$ and $O$ are called factor matrices and $s_i$, $o_j$ are the $i$-th and $j$-th row as column-vectors
  • Latent factors can be interpreted as new features that have been invented to describe the data (statistical predicate invention)
  • The rows of the factor matrices $s_i$, $o_j$ can be considered latent-variable representations of (in our case) entities that explain the observed variables $X_{ij}$
  • The columns of the factor matrices can be considered the invented latent features
  • The entries of the factor matrices specify how much an entity participates in a latent feature

Collective Learning & Bipartite Model

  • CP actually models a bipartite multigraph, i.e. CP assumes that the entries on the 1st mode are not identical to the entries on the 2nd mode (different latent representations $a_r$, $b_r$ in $\sum_r a_r \circ b_r \circ c_r$)
  • $\Rightarrow$ CP does not account for identities between subjects and objects
  • For general relational data, such as Linked Data, this assumption is violated
  • Bipartite modeling prevents information propagation over latent variables, or at least makes it very difficult

Collective Learning and CP

  • CP model for relational data with unique entity-representation would be \[ \mathcal{T} \approx \sum_r e_r \circ e_r \circ p_r \]
  • All frontal slices would be symmetric
  • $\Rightarrow$ CP can not simultaneously enforce the unique entity-representation constraint and still model directed relations
  • Tucker Decompositions offer more expressiveness than CP, such that the required constraints can be enforced $\Rightarrow$ RESCAL

Interlude - Unfolding

The mode-$n$ unfolding of a tensor $\mathcal{T}$, denoted by $\mathcal{T}_{(n)}$, reorders the elements of $\mathcal{T}$ into a new matrix $M$, by using the mode-$n$ fibers as columns of $M$. For $\mathcal{T} \in \mathbb{R}^{2 \times 3 \times 2}$

Mode-1 fibers
$\overset{\mathcal{T}_{(1)}}{\rightarrow}$ $\in \mathbb{R}^{2 \times 6}$
Mode-2 fibers
$\overset{\mathcal{T}_{(2)}}{\rightarrow}$ $\in \mathbb{R}^{3 \times 4}$
Mode-3 fibers
$\overset{\mathcal{T}_{(3)}}{\rightarrow}$ $\in \mathbb{R}^{2 \times 6}$

Interlude - Mode-n Product

The mode-$n$ product, denoted by $\times_n$, is the multiplication of a tensor by a matrix in mode $n$. Let $\mathcal{T} \in \mathbb{R}^{I_1 \times \dots \times I_N}$ and $M \in \mathbb{R}^{{\color{yellow}Q} \times I_n}$

$$\mathcal{Y} = \mathcal{T} \times_n M \in \mathbb{R}^{I_1 \dots I_{n-1} \dots \times {\color{yellow}Q} \times I_{n+1} \dots I_{N}}$$ $$Y_{(n)} = MT_{(n)}$$
$\times_2$ $\overset{\mathcal{T_{(2)}}}{\Large \leftrightarrows}$
$\mathbb{R}^{2 \times 3 \times 2}$ $\mathbb{R}^{2 \times 3}$ $\triangleq \mathbb{R}^{2 \times 2 \times 2}$

RESCAL

Main Idea: RESCAL is a factorization of multi-relational data, or multigraphs, into unique entity and predicate representations

\begin{align} \mathcal{T} \approx & R \times_1 A \times_2 A \\ \mathcal{T}_k \approx & A R_k A^T \\ \mathcal{T}_{ijk} \approx & \sum\nolimits_{q,r} a_{iq} \mathcal{R}_{qrk} a_{jr} \end{align}
$\mathcal{T}_k$
$\approx$
$A R_k A^T$
  • $A \in \mathbb{R}^{|V| \times r}$ represents the entity-latent-component space
  • $R_k \in \mathbb{R}^{r \times r}$ is an asymmetric matrix that specifies the interaction of the latent components for the $k$-th predicate
  • where $r$ is the number of latent-components of the factorization
  • $\approx$ means approximate best under some loss, e.g. $\sum_k \|\mathcal{T}_k - A\mathcal{R}_kA^T\|^2$

Link Prediction

Link Prediction using RESCAL

$\mathcal{T}_k$
$\approx$
$A R_k A^T$
  • Basic procedure of link prediction with RESCAL
    • Compute "rank-reduced" factorization ($\triangleq$ latent-variable representations)
    • Entries after reconstruction (i.e. $\mathcal{\hat{T}}_{ijk} = a_i R_k a_j^T$) can be regarded as confidence values of the model that the corresponding triple exists
  • Method can be motivated from a latent variable interpretation of factorizations
  • Due to the independence of random variables given the latent variables, very fast prediction, since it is equivalent to simple matrix-vector products
  • Similar procedures possible for CP and other tensor decompositions (without or reduced collective learning effect)

U.S. Presidents - Data

  • Data extracted from DBpedia
  • Consists of URIs of (vice) presidents and parties, the party memberships and the information who was (vice) president of whom
  • Importantly, the data contains no other information, therefore only a collective learning algorithm will learn something meaningful on this dataset
  • Task: Predict party-membership for persons in dataset (link prediction)

U.S. Presidents - Experiments

  • Experimental setting: 10-fold cross-validation over all persons
  • Evaluation measure: Area under precision-recall curve (AUC-PR)
AUC-PR

Link Prediction Benchmarks

AUC-PR
Kinships
UMLS
Nations

Entity Resolution & Taxonomy Learning

Entity Resolution

  • Problem: Data contains duplicate instances of the true underlying entities
  • Example: Names as identifiers in citation databases: "A. Einstein", "Einstein, A.", "Albert Einstein" all refer to the identical underlying entity Albert Einstein
  • Task: Determine which instances refer to the identical underlying entity
  • Important task in the Linked Data context, where this task is commonly referred to as Instance Matching
  • Relational data provides important information for this task, e.g. how similar are the co-authors of "A. Einstein" and "Einstein, A."
  • General approach: Use some similarity measure to determine which instances are most likely to be identical

Similarity of Entities in RESCAL

$\mathcal{T}_k$
$=$
$A R_k A^T$
  • RESCAL has a unique representation of entities via $A$
  • $A$ can be regarded as an embedding of the entities into a simple vector space
  • Similarity of entities in the latent-space $A$ corresponds to similarity in relational domain. Why?
  • $\Rightarrow$ Any feature-based machine-learning method (e.g. for clustering, classification etc.) can already be applied to $A$

Interlude - Kronecker Product

The Kronecker product $\otimes$ is a generalization of the outer product to matrices. Let $A \in \mathbb{R}^{Q \times R}, B \in \mathbb{R}^{S \times T}$. The Kronecker Product is defined as

$A \otimes B = \begin{bmatrix} a_{11} \mathbf{B} & \cdots & a_{1n}\mathbf{B} \\ \vdots & \ddots & \vdots \\ a_{m1} \mathbf{B} & \cdots & a_{mn} \mathbf{B} \end{bmatrix}$

where $A \otimes B$ is a $QS \times RT$ block-matrix

RESCAL = Constrained Tucker

  • RESCAL can be regarded a variant of the Tucker Decomposition with the constraint that two factor matrices have to be identical
  • Equivalent formulations for $\mathcal{T} = \mathcal{R} \times_1 A \times_2 A$ and $I \in \mathbb{R}^{|L| \times |L|}$ \begin{align} \mathcal{T}_{(1)} & = A R_{(1)} (I \otimes A)^T \\ \mathcal{T}_{(2)} & = A R_{(2)} (I \otimes A)^T \\ \mathcal{T}_{(3)} & = R_{(3)}(A \otimes A)^T \end{align}
  • For instance \[ \mathcal{T}_{(1)} = A \begin{bmatrix}R_1 & R_2\end{bmatrix} \begin{bmatrix} A^T & 0 \\ 0 & A^T \end{bmatrix} = \begin{bmatrix} AR_1A^T & AR_2A^T \end{bmatrix} \]

Tucker Decomposition

Decomposition of a tensor $\mathcal{T} \in \mathbb{R}^{I \times J \times K}$ into a core tensor, multiplied by factor matrices along each mode

\begin{align} \mathcal{T} & = \mathcal{G} \times_1 A \times_2 B \times_3 C \\ \mathcal{T}_{\color{greenyellow}{ijk}} & = \sum_{p,q,r} g_{pqr} a_{ip} b_{jq} c_{kr} \end{align}
$\mathcal{T}$
$\approx$
$\color{yellow}{G} \times_1 \color{deepskyblue}{A} \times_2 \color{greenyellow}{B} \times_3 \color{orangered}{C}$
  • The core tensor $\mathcal{G} \in \mathbb{R}^{P \times Q \times R}$ can be regarded a compressed version of $\mathcal{T}$
  • The factor matrices $A \in \mathbb{R}^{I \times P}$, $B \in \mathbb{R}^{J \times Q}$, $C \in \mathbb{R}^{K \times R}$ can be thought of the principal components in the respective modes

Uniqueness

Only possible combination of rank-1 tensors that sums to $\mathcal{X}$
  • Except basic indeterminacies of scaling and permutation
  • Matrix decompositions generally are not unique, e.g. \[M \approx AB = (AQ^{-1}) \, (QB)\]
  • Interesting property for explanatory data analysis
  • CP is unique under mild conditions
  • Uniqueness is not identical to a global optimum!

Properties of Tucker

  • Tucker is not unique \begin{align} \mathcal{G} \times_1 A \times_2 B \times _3 C = & \hat{\mathcal{G}}\times_1 AQ^{-1} \times_2 BR^{-1} \times_3 CS^{-1} \\ \mbox{where } & \hat{\mathcal{G}} = (\mathcal{G} \times_1 Q \times_2 R \times_3 S) \end{align}
  • Tucker subsumes CP, i.e. CP is a Tucker model with the additional constraints that the core tensor $\mathcal{G}$ is superdiagonal and the factor matrices have equal rank, i.e. $P = Q = R$
  • Different variants of Tucker, based on number of modes that are factorized, i.e. Tucker-3, Tucker-2, Tucker-1
  • Tucker does not enforce unique entity constraints

Similarity of Entities in RESCAL

General approach: similarity of entities should be based on the similarity of their relational patterns

Outgoing links

\begin{align*} X_{i,:,:} = & \color{orangered}{a_i} R_{(1)} (I \otimes A^T) \\ X_{j,:,:} = & \color{greenyellow}{a_j} R_{(1)} (I \otimes A^T) \end{align*}

Incoming links

\begin{align*} X_{:,i,:} = & \color{orangered}{a_i} R_{(2)} (I \otimes A^T) \\ X_{:,j,:} = & \color{greenyellow}{a_j} R_{(2)} (I \otimes A^T) \end{align*}
  • For both incoming and outgoing links, the similarity of entities $e_i$ and $e_j$ is uniquely determined by the similarity of their latent-component representations $a_i$ and $a_j$
  • Due to the collective learning effect of RESCAL, the latent-variable representation also contains information about distant relations

Cora Publication Data

  • Cora is a hand-labeled dataset consisting of machine learning publications, authors, venues and paper title
  • Prior to entity resolution, the data contains ~90 authors, ~210 venues, ~400 publications
  • True number of entities: 50 authors, ~100 venues and ~130 publications
  • In total ~61.000 matching decisions
  • Instance matching: Ranking of entities by their similarity in the entity-latent-component space

Cora Publication Data

AUC-PR
Author
Citation
Venue

Due to noisy citations, a collective learning effect can significantly improve entity resolution results for authors and venues

Learning Taxonomies

  • Taxonomies are an integral part of ontologies
  • Taxonomies consist of
    • a list of concepts
    • a concept hierarchy
    • an assignment of entities to concepts
  • Approach: Learn taxonomies completely unsupervised by hierarchical clustering of entities
  • Since similarity of entities is completely contained within the matrix $A$, any feature based hierarchical clustering method can be used

IIMB 2010 Benchmark Data

  • Instance matching benchmark from movie domain, consisting of $\approx$ 1400 entities, organized in 80 concepts
  • Top Level Concepts: Budget, Creature, Film, Language, Location
  • The concepts Creature, Film and Location are again subdivided into concepts like Person, Character, Country or Action Movie
  • Task: Unsupervised taxonomy learning via hierarchical clustering of entities
  • Method: Hierarchical clustering on $A$ using the OPTICS method
Level 10.982
Budget1.0
Creature1.0
Film1.0
Language0.963
Location0.953
Level 20.806
Character1.0
City0.992
Country0.632
Person0.991
Parody0.33
Zombie1.0
Level 30.946
Actor0.987
Capital0.992
Character Creator0.533
Director0.781

Scalability

or "Tensors? I thought these things don't scale?"

The Size of Graphs

  • Multigraphs / Linked Data can be of large scale, consisting of millions of entities, thousands of predicates, billions of known facts
  • Number of random variables in a multigraph:
    $|V| \times |V| \times |L|$ !
  • Real graphs are sparse, i.e. $|E| = O(|V|)$ instead of $O(|V|^2)$
  • Same property holds for most multigraphs and Linked Data, e.g.:
    • DBpedia: ~3.75 million entities, 800 object properties, 1.89 Billion triples XXX TODO XXX: report only entity-entity triples
    • YAGO 2: ~3 million entities, ~40 object properties, 124 Million triples XXX TODO XXX: report only entity-entity triples
  • By using sparse linear algebra, Alternating Least Squares (ALS) algorithms can easily exploit the sparsity of graphs

Alternating Least Squares

  • Typically, factorization methods are computed by optimizing the squared Frobenius norm, e.g. $\|Y - AB\|^2 = \sum_{ij}(Y_{ij} - a_i^T b_j)^2$
  • Alternating least squares is an alternating optimization scheme (non-linear block Gauss-Seidel method) to compute such least-squares problems in multiple variables
  • Basic procedure: Alternatingly keep all but one variable fixed and solve only for that variable
  • "Alternating Least Squares" because the algorithm solves a linear least squares subproblem for each factor matrix
  • Scales well for sparse matrices and tensors, easy to implement
  • Can show slow convergence (swamps), Using normal equations (next slide) ill-conditioned matrices can occur $\rightarrow$ Regularization

column-wise

Interlude - Khatri-Rao Product

Let
\begin{align*} A \in \mathbb{R}^{Q \times {\color{yellow}R}} & = \begin{bmatrix} a_1 & a_2 & \dots & a_R \end{bmatrix} ,\\ B \in \mathbb{R}^{S \times {\color{yellow}R}} & = \begin{bmatrix} b_1 & b_2 & \dots & b_R \end{bmatrix} \end{align*}
be two column-wise partitioned matrices.
The Khatri-Rao Product is defined as
$A \odot B = \begin{bmatrix} a_1 \otimes b_1 & a_2 \otimes b_2 & \dots & a_R \otimes b_R \end{bmatrix}$
where $A \odot B$ is a $QS \times R$ block-matrix

Scalability: CP-ALS

  • CP-ALS: Model $\min_{a_r, b_r, c_r}\|\mathcal{T} - \sum_r a_r \circ b_r \circ c_r\|^2$
  • Equivalent formulations in unfolded form, by using the Khatri-Rao product and by forming new matrices $A$, $B$, $C$ from $a_r$, $b_r$, $c_r$, where e.g. \[ A = \begin{bmatrix}a_1 & a_2 & \dots & a_R\end{bmatrix} \] \begin{align*} T_{(1)} & = {\color{greenyellow}A} (C \odot B)^T & \Rightarrow \min_A \|T_{(1)} - A(C \odot B)^T\|^2 \\ T_{(2)} & = {\color{greenyellow}B} (C \odot A)^T & \Rightarrow \min_B \|T_{(2)} - B(C \odot A)^T\|^2 \\ T_{(3)} & = {\color{greenyellow}C} (B \odot A)^T & \Rightarrow \min_C \|T_{(3)} - C(B \odot A)^T\|^2 \end{align*}
  • CP-ALS algorithm: solve alternatingly for $A$, $B$, $C$

Normal Equations and Sparsity

  • Let $\min_F \|Y - WX\|^2$ be a linear least squares problem, where $Y \in \mathbb{R}^{P \times Q}$, $W \in \mathbb{R}^{P \times rank}$, $X \in \mathbb{R}^{rank \times Q}$
  • $F$ can be computed by using the method of normal equations \[ W \leftarrow Y X^T (XX^T)^{-1} \]
  • $P$ and $Q$ can be large, e.g. for a Mode-1 unfolding of an adjacency tensor: $P = |V|$, $Q = |V| \times |L|$
  • Normal equations scale well for sparse tensor factorizations, as
    • $XX^T \in \mathbb{R}^{rank \times rank} \Rightarrow$ Inversion is cheap
    • $Y$ is large but sparse $\Rightarrow$ matrix multiplication $YX^T$ can be computed in only $O(nnz(Y) \times rank)$
    • $X$ is large in one dimension ($Q$) and dense but structured $\Rightarrow$ allows for very efficient optimizations to compute $XX^T$ for CP, Tucker and RESCAL

Normal Equations and CP

  • CP Model: $\mathcal{T}_{(1)} = A(C \odot B)^T$
  • For instance, the update of $A$ is computed by \[ A \leftarrow \underbrace{\vphantom{C_{(1)}^{-1}}\mathcal{T}_{(1)}}_{\vphantom{(X^T)^{-1}}=Y} \underbrace{\vphantom{C_{(1)}^{-1}}(C \odot B)}_{\vphantom{(X^T)^{-1}}=X} \underbrace{\vphantom{C_{(1)}^{-1}}(C^TC * B^TB)^{-1}}_{=(XX^T)^{-1}} \] where $C \odot B \in \mathbb{R}^{|V||L| \times rank}$ and $C^TC * B^TB \in \mathbb{R}^{rank \times rank}$
  • Structure of tensor factorization decreases complexity \begin{align*} C^TC * B^TB & = O\left((|V| \color{greenyellow}{+} |L|)rank^2\right) \\ (C \odot B)^T (C \odot B) & = O(\left |V||L|rank^2 \right) \end{align*}
  • Similar, even slightly more scalable results exist for RESCAL and Tucker decompositions

They do scale!

(given suitable data & the right algorithm)

  • RESCAL-ALS scales linearly with the number of entities, predicates and known facts, superlinearly with the number of latent components
  • Scalability experiments using RESCAL on synthetic data
  • Base tensor $\mathcal{T} \in \mathbb{R}^{|V| \times |V| \times |L|}$, with $|V| = 1000$, $|L| = 10$
  • In each experiment, vary one of the parameters $|V|$, $|L|$, $nnz(\mathcal{T})$ and the number of latent components

Implications of Using ALS

  • The objective function of linear least squares is of the form \[ \|T - FX\|^2 \] which has various implications
  • Normality Assumption: The objective is only valid under the assumption that errors on random variables are normally distributed. For relational data a better error model is binomial
  • Open-World Assumption: By default, the objective treats unobserved triples as negative examples. Due to the open-world assumption of LD, this is not entirely justified
  • Both issues can be approached with using different objective functions, e.g. choosing a pairwise preference ranking objective, or an binomial error term. However, there is usually a trade-of between better modeling and scalability.
  • Does not account for some properties of real world graphs, such as scale free degree distribution etc.

Tensors & Attributes of Entities

Literally

Tensors + Attributes of Entities

  • Much of the information in Linked Data is in form of literals
  • In a machine learning setting, literals correspond to attributes
  • Naive handling:
    • Discretization of continuous or textual attributes
      • $20 \rightarrow$ young, $99 \rightarrow$ old
      • "Albert Einstein" $\rightarrow$ {"Albert", "Einstein"}
    • Use results of discretization step as entities and fill tensor accordingly
      • (Bob, inAgeGroup, old)
      • (Albert_Einstein, hasWord, Albert)

Handling of Attributes

  • Naive handling has serious scalability issues
  • Include attributes as additional constraint on entity representation \[ D \approx AV \]
  • Joint or "coupled" tensor factorization
  • For instance, with RESCAL \[ \min_{A,R,V} \sum_k \|X_k - A R_k A^T \|^2 + \|D - AV\|^2 \]
  • Similar methods exist for CP and Tucker

Graphs, Time & Tensors

Time-dependent Analysis of Graphs

  • Common ways to model time and graphs
    • Time-slices, i.e. assign points in time to bins (days, months etc.)
    • Sequential modeling, such as user - item - last item
  • Common assumptions made
    • Smoothness: Data changed smoothly over time
    • Markov property: Data at time point $t_i$ depends only on $t_{(i-1)}$
  • Here we will only consider the time-slices approach

DEDICOM

Basic Idea: Factorization of relational data into unique entity and predicate representations

\begin{align} \mathcal{T}_k & = A D_k R D_k A^T \\ \mathcal{T}_{ijk} & = a_i^T D_k R D_k a_j \end{align}
$=$
  • DEDICOM has one global interaction model $R$
  • Global patterns vary linearily over frontal slices of $\mathcal{T}$
  • Well suited to model time-varying graphs
  • Only defined for 2nd- and 3rd-order tensors

Enron Email Communication Network

  • Email data from ~150 users, mostly Enron senior managment
  • Collected and made public by FERC during its investigation
  • Original Data ~500,000 messages, containing
    • To: / From:
    • Date
    • Subject / Body
  • In the following smaller part of Data is used, consisting of 184 users and 34,427 messages
  • Modeling of data via 44 time-slices

Enron Network Analysis

  • $A$ assigns persons to roles, i.e. trade executive etc.
  • $R$ models global interactions of these roles
Assignment Quality
Global communication patterns

Enron Network Analysis

$D$ expresses the variation of patterns over time

Sequence Models

  • Modeling with time-slices does not take temporal order into account (Predicate fibers are exchangeable)
  • Sequence-oriented modeling: entity $\times$ entity $\times$ last entity
  • For instance: user $\times$ item $\times$ last item in recommendation engines
  • Ternary Relation $\rightarrow$ Hypergraph!

Concluding Remarks

  • Tensors offer a flexible and scalable way to machine learning on multi-relational data
  • Tensors offer a natural way to represent multi-graphs via adjacency tensors, without losing information as in the matrix case
  • Tensor factorizations like CP can be used for explanatory analysis on multi-graphs, e.g. to rank nodes by their importance
  • Rank-reduced factorizations off all three modes of a tensor offer a way to do joint disambiguation of entities and predicates
  • CP provides interpretable results via uniqueness property

Concluding Remarks

  • Tensor factorization methods have applications to many important problems in the Linked Data context such as link prediction, instance matching or information retrieval
    • RESCAL shows state-of-the-art results in link prediction and entity resolution
    • Embedding $A$ of RESCAL enables standard machine learning algorithms on relational data
  • Due to the sparsity of adjacency tensors, factorizations can scale up to complete knowledge bases
  • Time-evolving graphs and networks can be analysed by inspection of the factor matrices of appropriate decompositions like DEDICOM

Code & Tools

If You Want More...

  • Python and Matlab code for RESCAL available from my website and Github
    
    from rescal import rescal
    A, R = rescal(X, 20, init='nvecs')
    				
  • Easy to use tools to convert from N-Triples or other formats to Python and Matlab tensors available from my website and Github
  • Tensor Toolbox for Matlab available from Tamara Kolda's site

Discussion ?!