This tutorial is dedicated to how machine learning on Linked Data can be realized using tensor factorizations. In general, the focus will rest on applications important to Linked Data and we will introduce necessary theory when needed along that way.
From Linked Data to Tensors
Some tensor theory and data representation of Linked Data as tensors
Explanatory Analysis of Linked Data
Starting from graphs and moving to multigraphs for ranking of nodes in RDF data
Relational Learning & Link Prediction
Prediction of unknown links, a central problem when learning on relational data.
Entity Resolution & Taxonomy Learning
Unsupervised instance matching and learning of taxonomies
Scaling to complete Knowledge Bases
Scaling tensor factorizations to the size of complete knowledge bases
Analysis of time-evolving graphs
Machine Learning and Linked Data
Machine Learning depends on repeatable statistical patterns to generalize from examples
The party membership of a U.S. vice president can be predicted reasonably well, given knowledge about his president's party
Predicting the first name of
US (vice) presidents is not meaningful
(one-to-one relationship, no recurring patterns)
Machine Learning depends strongly on the data at hand and the task that should be solved
Linked Data and Machine Learning
The Linked Data principles are:
Use URIs as names for things
Use HTTP URIs, so that people can look up those names
When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
Include links to other URIs, so that they can discover more things
From a machine learning perspective:
Linked Data is (semi-)structured, i.e. (subject, predicate, object) triples
Data consists of entities and relations of different types between those entities
Entities = All subjects and objects that are not literals (In this tutorial: no special treatment of ontological knowledge)
Relations = Individual links between entities
Relation Types = Properties in RDF
Relations, the Beauty & the Challenge
Knowledge models such as Linked Data and many problems in machine learning have a natural representation as relational data
Relations between entities are often more important for a prediction task than attributes
For instance, can be easier to predict the party of a vice-president from the party of his president than from his attributes
Linked Data is not independent and identically distributed (iid), as
it does not consist of only one type of objects (e.g only persons)
the entities are related to each other
Learning, Linked Data & Heterogenity
Learning, Linked Data & Heterogeneity
Heterogeneity introduces additional noise to the data
Duplicate entities: URIs from different data sources refer to the same underlying entity (without specifying owl:sameAs)
Duplicate predicates: Same case as for duplicate entities
Predicates with similar semantics: Probably more common case, for instance, different domain/range constraints, super/sub-predicates etc.
Need of Linked Data Instance matching and Ontology Alignment to remove the noise introduced by heterogeneity
Need of Machine Learning: Joint disambiguation of predicates and entities
The Open World of Linked Data
Linked Data follows an open-world assumption, i.e. data cannot generally be assumed to be complete
In fact, Linked Data can be quite incomplete
Need of Linked Data: Prediction of unknown triples in incomplete knowledge bases $\Rightarrow$ Link Prediciton
Problem for ML: Linked Data rarely contains negative examples or negations (i.e. !partyOf(Al Gore, Republican))
Challenges !?
Many of the features that make Linked Data so valuable (e.g. relations, linked data sources, size), can be a problem for machine learning and also Linked Data itself
The rest of this tutorial is dedicated to show how most of these problems can be approached by using tensor factorizations
Data Representation
Linked Data
$\downarrow$
Tensors
Linked Data is a Multigraph
From a mathematical perspective, Linked Data can be regarded as a labeled directed multigraph. A minimal definition for the purpose of this tutorial is
A labeled directed multigraph is a tuple $\mathcal{G} := (V, E, L)$
where
$V$ is a set of vertices
$L$ is a set of edge labels
$E \subseteq V \times V \times L$ is a set of ordered triples
$V$ corresponds to the entities in the domain
$L$ corresponds to the different relation types
$E$ corresponds to the known facts about the entities (RDF triples)
Multigraph Example
presidentOf
vicePresidentOf
partyOf
knows
The set of vertices, edge labels and edges for this multigraph are
\begin{align}
V = \{&\text{Bill Clinton}, \text{Al Gore}, \text{Democratic Party }\} \\ \\
L = \{&\text{presidentOf}, \text{vicePresidentOf}, \text{partyOf}, \text{knows }\}\\ \\
E = \{&\text{(Bill Clinton, Al Gore, presidentOf)},\\
& \text{(Al Gore, Democratic Party, partyOf)}, \\
& \vdots \\
& \text{(Bill Clinton, Al Gore, knows) } \}
\end{align}
Matrix Representation of Graphs
Adjacency matrices are standard representations for graphs, i.e. data with only one type of edges or relations
$A = $
Likes relation
Legend
$=1$
$=0$
Some graph properties translate directly to adjacency matrix
Undirected $\rightarrow$ Symmetric
Directed $\rightarrow$ Asymmetric
Unweighted $\rightarrow$ Binary
Analysis of graphs using adjacency matrices
Ranking: PageRank, HITS
Link Prediction: Common Neighbours, Katz, Adamic/Adar, low-rank matrix factorization
Multigraphs and Matrices
Multigraphs can not be represented by a single matrix
without information loss, given no
constraints on structure are enforced
Loves
$+$
Hates
$+$
Indifferent
Simple method: Projection of multigraph onto graph structure
Loses information about multiple predicates
Alternative method: Concatenation of adjacency matrices
Loses information about subject / object identities without additional constraints
Moving from Matrices to Tensors
Underlying problem: Two modes, as in the matrix case, are not expressive enough to model multi-relational data without information loss
Possible solution: Move to tensors, which are $n$-modal generalizations of matrices
What are tensors?
Formally, tensors are defined as
An $n$-th order tensor is an element of the
tensor product of $n$ vector spaces, i.e.
$\mathcal{T} \in V_1 \otimes ... \otimes V_n$
Less formally: tensors are multidimensional
arrays, i.e. data structures with (possibly)
more than two indices ($\mathcal{T}_{ijqrs}$)
The order of an tensor is the number of modes
(i.e, number of indices needed to identify elements in the array)
What are tensors?
Vectors are 1st order tensors: $x_i$ (one mode)
Matrices are 2nd order tensors: $X_{ij}$ (two modes)
Tensors of order $\geq$ 3 are called higher-order tensors $\mathcal{T}_{ijqrs}$
In the following we will only use third-order tensors, i.e. "cubes"
Mode-1 has dimension $I = 10$ Mode-2 has dimension $J = 7$ Mode-3 has dimension $K = 5$
Tensor Representation of RDF Data
RDF data, or any other multigraph, can be represented as a third-order tensor
Two modes refer to the entities, one mode to the relation types
\begin{align}
\Rightarrow & |Mode\ 1| = |Mode\ 2| = |V|\\
& |Mode\ 3| = |L|
\end{align}
We also require that the ordering of the entities on both modes is identical
In the following we will refer to this modeling as adjacency tensor
Running Example $\rightarrow$ Tensor
The adjacency tensor for the running example used so far is
presidentOf
vicePresidentOf
partyOf
$\mathcal{G}$
$\rightarrow$
$\mathcal{T}$
An adjacency tensor contains a cell for every possible triple that could be created from the entities and relation types that are present in the data
Notation & Terminology
Slices are two-dimensional sections of a tensor (i.e. matrices)
Horizontal slices $\mathcal{T}_{i,:,:}$
Lateral slices $\mathcal{T}_{:,j,:}$
Frontal slices $\mathcal{T}_{:,:,k}$
Notation & Terminology
Fibers are higher-order analogues of rows and columns in matrices
Mode-1 fibers $\mathcal{T}_{:,j,k}$
Mode-2 fibers $\mathcal{T}_{i,:,k}$
Mode-3 fibers $\mathcal{T}_{i,j,:}$
Machine Learning on Tensors
The general focus of this tutorial will be learning on adjacency tensors of multigraphs with tensor factorizations
Tensor factorizations can be considers higher-order generalizations of matrix factorization methods
Matrix factorizations have a long tradition in machine learning
Matrix factorizations try to express data in terms of factor matrices
Factorization methods have an appealing interpretation as latent variable models
The CP Decomposition & Explanatory Analysis
Hubs, Authorities, Topics & Co.
The Beatles on DBpedia
What are the important nodes of this multigraph?
In the Case of Graphs
Authoritative Ranking with HITS
Task: Ranking of nodes relative to their relevance for a query
Main Idea: Describe nodes in the graph in terms of hub- and authority-scores
Authority score $a_j$: Importance of the node $i$ for incoming links (web context: value of the content of a page), i.e.
$a_j = \sum_{i \rightarrow j} h_i$
Hub scores $h_i$: Importance of the node $j$ for outgoing links (web context: value of links to other pages), i.e.
$h_i = \sum_{i \rightarrow j} a_j$
Mutual Reinforcement: Good hubs link to many good authorities, good authorities receive many links from good hubs
(rank-reduced)
Singular Value Decomposition
Any matrix $X$ has a decomposition into a weighted sum of the outer products of pairwise orthonormal vectors, i.e. $X = \sum_r \lambda_r u_r \circ v_r$, such that
$X$
$=$
$\lambda_1 u_1 \circ v_1$
$ + \dots + $
$\lambda_r u_r \circ v_r$
$\lambda_{i}$ is the $i$-th singular value of $X$
$u_i$ is the $i$-th left singular vector of $X$
$v_i$ is the $i$-th right singular vector of $X$
the sets $u_i$, $v_i$ are pairwise orthonormal, e.g. $u_i^Tu_i = 1$, $u_i^Tu_{j\neq i} = 0$
the rank of $X$ can be defined as the minimal number $r$ of outer products $\lambda_r u_r \circ v_r$ that generate $X$ is their sum
Some Properties of SVD
The left singular vectors of X, $u_r$ are the eigenvectors of the kernel matrix $XX^T$, i.e $$\lambda_r u_r = XX^T u_r$$
The right singular vectors of X, $v_r$ are the eigenvectors of covariance matrix $X^TX$, i.e. $$\lambda_r v_r = X^TX v_r$$
SVD has many applications, e.g. LSI, dimensionality reduction, recommendation engines, HITS
In the Case of Graphs
Authoritative Ranking with HITS
The hub- and authority-scores are defined as
\begin{align*}
a_j & = \sum_{i \rightarrow j} h_i &
h_i & = \sum_{i \rightarrow j} a_j \\
\end{align*}
In terms of the adjacency matrix $A$, this can be stated as
\begin{align*}
a & = A^T h & h & = A a \\
\Rightarrow a & = A^TAa & h & = AA^Th
\end{align*}
$\Rightarrow$ The vectors $a$ and $h$ correspond to the singular vectors of $A$, i.e.
The rank $r$ of a CP decomp. is the number of rank-1 tensors used to approximate $\mathcal{T}$
"$\approx$" is the best approximation under some loss, e.g. $\|\mathcal{T} - \sum_r \lambda_r a_r \circ b_r \circ c_r \|^2$
Tensor rank $rank(\mathcal{T})$ is the smallest number $r$ of rank‑1 tensors that are needed to exactly reconstruct $\mathcal{T}$ as their sum
CP can be regarded a multilinear generalization of SVD
Authoritative Ranking with CP
For a rank-1 decomposition of an adjacency tensor $\mathcal{T} = \lambda h \circ a \circ t$ it holds that
\begin{align}
h_i & = \sum_{i \overset{k}{\rightarrow} j} a_j t_k &
a_j & = \sum_{i \overset{k}{\rightarrow} j} h_i t_k &
t_k & = \sum_{i \overset{k}{\rightarrow} j} h_i a_j
\end{align}
Rank-reduced factorization performs simultaneous dimensionality reduction on entities and relation types, such that similar entities or predicates are grouped together
$\mathcal{T} \in \mathbb{3 \times 3 \times 3}$
$\approx$
$\lambda_1 a_1 \circ b_1 \circ c_1$
$+$
$\lambda_2 a_2 \circ b_2 \circ c_2$
TOPHITS on the Beatles Soap Opera
TripleRank - CP on Semantic Web Data
Basic Idea: Apply TOPHITS approach to RDF data to enable faceted browsing in knowledge bases
TripleRank is applied to a subset of the data (e.g. all nodes with at most 2 hops distance to the query node)
Approach closer to original HITS application
Correlations between predicates can be important
TripleRank - CP on Semantic Web Data
(some unfortunately problematic)
Properties of CP / Tensor Rank
Determining tensor rank is NP-hard! In practice: find tensor rank numerically by fitting multiple models
Non-nestedness: Rank-$(n+1)$ approximation does not necessarily Rank-$n$ approximation (means no overspecification)
Weak upper bound of tensor rank for third-order tensors: $rank(\mathcal{T}) \leq \min\{IJ, IK, JK\}$
Degeneracy: A tensor may be approximated arbitrarily well by a lower rank factorization
(true)
Relational Learning
Or "tell me what is your friend's ErdÃ¶s-Number"
Collective Learning
Relations between entities are a rich source of information, as they allow to include information that are distant in the graph
Collective Learning is the ability of a learning algorithm to include information such as classes, relations or attributes of related entities in the learning and prediction task for a particular entity
Example: Party membership of US presidents
Factorizations & Relational Learning
Using a Link Prediction Example
Matrix factorizations decompose a observed matrix $X$ into latent (or hidden) factors, e.g. \begin{align*}X \approx SO^T && X_{ij} \approx s_i^T o_j\end{align*}$S$ and $O$ are called factor matrices and $s_i$, $o_j$ are the $i$-th and $j$-th row as column-vectors
Latent factors can be interpreted as new features that have been invented to describe the data (statistical predicate invention)
The rows of the factor matrices $s_i$, $o_j$ can be considered latent-variable representations of (in our case) entities that explain the observed variables $X_{ij}$
The columns of the factor matrices can be considered the invented latent features
The entries of the factor matrices specify how much an entity participates in a latent feature
Collective Learning & Bipartite Model
CP actually models a bipartite multigraph, i.e. CP assumes that the entries on the 1st mode are not identical to the entries on the 2nd mode (different latent representations $a_r$, $b_r$ in $\sum_r a_r \circ b_r \circ c_r$)
$\Rightarrow$ CP does not account for identities between subjects and objects
For general relational data, such as Linked Data, this assumption is violated
Bipartite modeling prevents information propagation over latent variables, or at least makes it very difficult
Collective Learning and CP
CP model for relational data with unique entity-representation would be
\[
\mathcal{T} \approx \sum_r e_r \circ e_r \circ p_r
\]
All frontal slices would be symmetric
$\Rightarrow$ CP can not simultaneously enforce the unique entity-representation constraint and still model directed relations
Tucker Decompositions offer more expressiveness than CP, such that the required constraints can be enforced $\Rightarrow$ RESCAL
Interlude - Unfolding
The mode-$n$ unfolding of a tensor $\mathcal{T}$, denoted by $\mathcal{T}_{(n)}$, reorders the elements of $\mathcal{T}$ into a new matrix $M$, by using the mode-$n$ fibers as columns of $M$. For $\mathcal{T} \in \mathbb{R}^{2 \times 3 \times 2}$
Mode-1 fibers
$\overset{\mathcal{T}_{(1)}}{\rightarrow}$
$\in \mathbb{R}^{2 \times 6}$
Mode-2 fibers
$\overset{\mathcal{T}_{(2)}}{\rightarrow}$
$\in \mathbb{R}^{3 \times 4}$
Mode-3 fibers
$\overset{\mathcal{T}_{(3)}}{\rightarrow}$
$\in \mathbb{R}^{2 \times 6}$
Interlude - Mode-n Product
The mode-$n$ product, denoted by $\times_n$, is the multiplication of a tensor by a matrix in mode $n$. Let $\mathcal{T} \in \mathbb{R}^{I_1 \times \dots \times I_N}$ and $M \in \mathbb{R}^{{\color{yellow}Q} \times I_n}$
Entries after reconstruction (i.e. $\mathcal{\hat{T}}_{ijk} = a_i R_k a_j^T$) can be regarded as confidence values of the model that the corresponding triple exists
Method can be motivated from a latent variable interpretation of factorizations
Due to the independence of random variables given the latent variables, very fast prediction, since it is equivalent to simple matrix-vector products
Similar procedures possible for CP and other tensor decompositions (without or reduced collective learning effect)
Consists of URIs of (vice) presidents and parties, the party memberships and the information who was (vice) president of whom
Importantly, the data contains no other information, therefore only a collective learning algorithm will learn something meaningful on this dataset
Task: Predict party-membership for persons in dataset (link prediction)
U.S. Presidents - Experiments
Experimental setting: 10-fold cross-validation over all persons
Evaluation measure: Area under precision-recall curve (AUC-PR)
AUC-PR
Link Prediction Benchmarks
AUC-PR
Kinships
UMLS
Nations
Entity Resolution & Taxonomy Learning
Entity Resolution
Problem: Data contains duplicate instances of the true underlying entities
Example: Names as identifiers in citation databases: "A. Einstein", "Einstein, A.", "Albert Einstein" all refer to the identical underlying entity Albert Einstein
Task: Determine which instances refer to the identical underlying entity
Important task in the Linked Data context, where this task is commonly referred to as Instance Matching
Relational data provides important information for this task, e.g. how similar are the co-authors of "A. Einstein" and "Einstein, A."
General approach: Use some similarity measure to determine which instances are most likely to be identical
Similarity of Entities in RESCAL
$\mathcal{T}_k$
$=$
$A R_k A^T$
RESCAL has a unique representation of entities via $A$
$A$ can be regarded as an embedding of the entities into a simple vector space
Similarity of entities in the latent-space $A$ corresponds to similarity in relational domain. Why?
$\Rightarrow$ Any feature-based machine-learning method (e.g. for clustering, classification etc.) can already be applied to $A$
Interlude - Kronecker Product
The Kronecker product $\otimes$ is a generalization of the outer product to matrices. Let $A \in \mathbb{R}^{Q \times R}, B \in \mathbb{R}^{S \times T}$. The Kronecker Product is defined as
The core tensor $\mathcal{G} \in \mathbb{R}^{P \times Q \times R}$ can be regarded a compressed version of $\mathcal{T}$
The factor matrices $A \in \mathbb{R}^{I \times P}$, $B \in \mathbb{R}^{J \times Q}$, $C \in \mathbb{R}^{K \times R}$ can be thought of the principal components in the respective modes
Uniqueness
Only possible combination of rank-1 tensors that sums to $\mathcal{X}$
Except basic indeterminacies of scaling and permutation
Matrix decompositions generally are not unique, e.g.
\[M \approx AB = (AQ^{-1}) \, (QB)\]
Interesting property for explanatory data analysis
CP is unique under mild conditions
Uniqueness is not identical to a global optimum!
Properties of Tucker
Tucker is not unique
\begin{align}
\mathcal{G} \times_1 A \times_2 B \times _3 C = & \hat{\mathcal{G}}\times_1 AQ^{-1} \times_2 BR^{-1} \times_3 CS^{-1} \\
\mbox{where } &
\hat{\mathcal{G}} = (\mathcal{G} \times_1 Q \times_2 R \times_3 S)
\end{align}
Tucker subsumes CP, i.e. CP is a Tucker model with the additional constraints that the core tensor $\mathcal{G}$ is superdiagonal and the factor matrices have equal rank, i.e. $P = Q = R$
Different variants of Tucker, based on number of modes that are factorized, i.e. Tucker-3, Tucker-2, Tucker-1
Tucker does not enforce unique entity constraints
Similarity of Entities in RESCAL
General approach: similarity of entities should be based on the similarity of their relational patterns
Outgoing links
\begin{align*}
X_{i,:,:} = & \color{orangered}{a_i} R_{(1)} (I \otimes A^T) \\
X_{j,:,:} = & \color{greenyellow}{a_j} R_{(1)} (I \otimes A^T)
\end{align*}
Incoming links
\begin{align*}
X_{:,i,:} = & \color{orangered}{a_i} R_{(2)} (I \otimes A^T) \\
X_{:,j,:} = & \color{greenyellow}{a_j} R_{(2)} (I \otimes A^T)
\end{align*}
For both incoming and outgoing links, the similarity of entities $e_i$ and $e_j$ is uniquely determined by the similarity of their latent-component representations $a_i$ and $a_j$
Due to the collective learning effect of RESCAL, the latent-variable representation also contains information about distant relations
Cora Publication Data
Cora is a hand-labeled dataset consisting of machine learning publications, authors, venues and paper title
Prior to entity resolution, the data contains ~90 authors, ~210 venues, ~400 publications
True number of entities: 50 authors, ~100 venues and ~130 publications
In total ~61.000 matching decisions
Instance matching: Ranking of entities by their similarity in the entity-latent-component space
Cora Publication Data
AUC-PR
Author
Citation
Venue
Due to noisy citations, a collective learning effect can significantly improve entity resolution results for authors and venues
Learning Taxonomies
Taxonomies are an integral part of ontologies
Taxonomies consist of
a list of concepts
a concept hierarchy
an assignment of entities to concepts
Approach: Learn taxonomies completely unsupervised by
hierarchical clustering of entities
Since similarity of entities is completely contained within the matrix $A$, any feature based hierarchical clustering method can be used
IIMB 2010 Benchmark Data
Instance matching benchmark from movie domain, consisting of $\approx$ 1400 entities, organized in 80 concepts
Top Level Concepts: Budget, Creature, Film, Language, Location
The concepts Creature, Film and Location are again subdivided into concepts like Person, Character, Country or Action Movie
Task: Unsupervised taxonomy learning via hierarchical clustering of entities
Method: Hierarchical clustering on $A$ using the OPTICS method
Level 1
0.982
Budget
1.0
Creature
1.0
Film
1.0
Language
0.963
Location
0.953
Level 2
0.806
Character
1.0
City
0.992
Country
0.632
Person
0.991
Parody
0.33
Zombie
1.0
Level 3
0.946
Actor
0.987
Capital
0.992
Character Creator
0.533
Director
0.781
Scalability
or "Tensors? I thought these things don't scale?"
The Size of Graphs
Multigraphs / Linked Data can be of large scale, consisting of millions of entities, thousands of predicates, billions of known facts
Number of random variables in a multigraph:
$|V| \times |V| \times |L|$ !
Real graphs are sparse, i.e. $|E| = O(|V|)$ instead of $O(|V|^2)$
Same property holds for most multigraphs and Linked Data, e.g.:
DBpedia: ~3.75 million entities, 800 object properties, 1.89 Billion triples XXX TODO XXX: report only entity-entity triples
YAGO 2: ~3 million entities, ~40 object properties, 124 Million triples XXX TODO XXX: report only entity-entity triples
By using sparse linear algebra, Alternating Least Squares (ALS) algorithms can easily exploit the sparsity of graphs
Alternating Least Squares
Typically, factorization methods are computed by optimizing the squared Frobenius norm, e.g. $\|Y - AB\|^2 = \sum_{ij}(Y_{ij} - a_i^T b_j)^2$
Alternating least squares is an alternating optimization scheme (non-linear block Gauss-Seidel method) to compute such least-squares problems in multiple variables
Basic procedure: Alternatingly keep all but one variable fixed and solve only for that variable
"Alternating Least Squares" because the algorithm solves a linear least squares subproblem for each factor matrix
Scales well for sparse matrices and tensors, easy to implement
Can show slow convergence (swamps), Using normal equations (next slide) ill-conditioned matrices can occur $\rightarrow$ Regularization
Equivalent formulations in unfolded form, by using the Khatri-Rao product and by forming new matrices $A$, $B$, $C$ from $a_r$, $b_r$, $c_r$, where e.g.
\[
A = \begin{bmatrix}a_1 & a_2 & \dots & a_R\end{bmatrix}
\]
\begin{align*}
T_{(1)} & = {\color{greenyellow}A} (C \odot B)^T & \Rightarrow \min_A \|T_{(1)} - A(C \odot B)^T\|^2 \\
T_{(2)} & = {\color{greenyellow}B} (C \odot A)^T & \Rightarrow \min_B \|T_{(2)} - B(C \odot A)^T\|^2 \\
T_{(3)} & = {\color{greenyellow}C} (B \odot A)^T & \Rightarrow \min_C \|T_{(3)} - C(B \odot A)^T\|^2
\end{align*}
CP-ALS algorithm: solve alternatingly for $A$, $B$, $C$
Normal Equations and Sparsity
Let $\min_F \|Y - WX\|^2$ be a linear least squares problem, where $Y \in \mathbb{R}^{P \times Q}$, $W \in \mathbb{R}^{P \times rank}$, $X \in \mathbb{R}^{rank \times Q}$
$F$ can be computed by using the method of normal equations
\[ W \leftarrow Y X^T (XX^T)^{-1} \]
$P$ and $Q$ can be large, e.g. for a Mode-1 unfolding of an adjacency tensor: $P = |V|$, $Q = |V| \times |L|$
Normal equations scale well for sparse tensor factorizations, as
$XX^T \in \mathbb{R}^{rank \times rank} \Rightarrow$ Inversion is cheap
$Y$ is large but sparse $\Rightarrow$ matrix multiplication $YX^T$ can be computed in only $O(nnz(Y) \times rank)$
$X$ is large in one dimension ($Q$) and dense but structured $\Rightarrow$ allows for very efficient optimizations to compute $XX^T$ for CP, Tucker and RESCAL
Normal Equations and CP
CP Model: $\mathcal{T}_{(1)} = A(C \odot B)^T$
For instance, the update of $A$ is computed by
\[
A \leftarrow
\underbrace{\vphantom{C_{(1)}^{-1}}\mathcal{T}_{(1)}}_{\vphantom{(X^T)^{-1}}=Y}
\underbrace{\vphantom{C_{(1)}^{-1}}(C \odot B)}_{\vphantom{(X^T)^{-1}}=X}
\underbrace{\vphantom{C_{(1)}^{-1}}(C^TC * B^TB)^{-1}}_{=(XX^T)^{-1}}
\]
where $C \odot B \in \mathbb{R}^{|V||L| \times rank}$ and $C^TC * B^TB \in \mathbb{R}^{rank \times rank}$
Similar, even slightly more scalable results exist for RESCAL and Tucker decompositions
They do scale!
(given suitable data & the right algorithm)
RESCAL-ALS scales linearly with the number of entities, predicates and known facts, superlinearly with the number of latent components
Scalability experiments using RESCAL on synthetic data
Base tensor $\mathcal{T} \in \mathbb{R}^{|V| \times |V| \times |L|}$, with $|V| = 1000$, $|L| = 10$
In each experiment, vary one of the parameters $|V|$, $|L|$, $nnz(\mathcal{T})$ and the number of latent components
Implications of Using ALS
The objective function of linear least squares is of the form
\[ \|T - FX\|^2 \]
which has various implications
Normality Assumption: The objective is only valid under the assumption that errors on random variables are normally distributed. For relational data a better error model is binomial
Open-World Assumption: By default, the objective treats unobserved triples as negative examples. Due to the open-world assumption of LD, this is not entirely justified
Both issues can be approached with using different objective functions, e.g. choosing a pairwise preference ranking objective, or an binomial error term. However, there is usually a trade-of between better modeling and scalability.
Does not account for some properties of real world graphs, such as scale free degree distribution etc.
Tensors & Attributes of Entities
Literally
Tensors + Attributes of Entities
Much of the information in Linked Data is in form of literals
In a machine learning setting, literals correspond to attributes
Naive handling:
Discretization of continuous or textual attributes
Use results of discretization step as entities and fill tensor accordingly
(Bob, inAgeGroup, old)
(Albert_Einstein, hasWord, Albert)
Handling of Attributes
Naive handling has serious scalability issues
Include attributes as additional constraint on
entity representation
\[
D \approx AV
\]
Joint or "coupled" tensor factorization
For instance, with RESCAL
\[
\min_{A,R,V} \sum_k \|X_k - A R_k A^T \|^2 + \|D - AV\|^2
\]
Similar methods exist for CP and Tucker
Graphs, Time & Tensors
Time-dependent Analysis of Graphs
Common ways to model time and graphs
Time-slices, i.e. assign points in time to bins (days, months etc.)
Sequential modeling, such as user - item - last item
Common assumptions made
Smoothness: Data changed smoothly over time
Markov property: Data at time point $t_i$ depends only on $t_{(i-1)}$
Here we will only consider the time-slices approach
DEDICOM
Basic Idea: Factorization of relational data into unique entity and predicate representations
\begin{align}
\mathcal{T}_k & = A D_k R D_k A^T \\
\mathcal{T}_{ijk} & = a_i^T D_k R D_k a_j
\end{align}
$=$
DEDICOM has one global interaction model $R$
Global patterns vary linearily over frontal slices of $\mathcal{T}$
Well suited to model time-varying graphs
Only defined for 2nd- and 3rd-order tensors
Enron Email Communication Network
Email data from ~150 users, mostly Enron senior managment
Collected and made public by FERC during its investigation
Original Data ~500,000 messages, containing
To: / From:
Date
Subject / Body
In the following smaller part of Data is used, consisting of
184 users and 34,427 messages
Modeling of data via 44 time-slices
Enron Network Analysis
$A$ assigns persons to roles, i.e. trade executive etc.
$R$ models global interactions of these roles
Assignment Quality
Global communication patterns
Enron Network Analysis
$D$ expresses the variation of patterns over time
Sequence Models
Modeling with time-slices does not take temporal order into account (Predicate fibers are exchangeable)
Sequence-oriented modeling: entity $\times$ entity $\times$ last entity
For instance: user $\times$ item $\times$ last item in recommendation engines
Ternary Relation $\rightarrow$ Hypergraph!
Concluding Remarks
Tensors offer a flexible and scalable way to machine learning on multi-relational data
Tensors offer a natural way to represent multi-graphs via adjacency tensors, without losing information as in the matrix case
Tensor factorizations like CP can be used for explanatory analysis on multi-graphs, e.g. to rank nodes by their importance
Rank-reduced factorizations off all three modes of a tensor offer a way to do joint disambiguation of entities and predicates
CP provides interpretable results via uniqueness property
Concluding Remarks
Tensor factorization methods have applications to many important problems in the Linked Data context such as link prediction, instance matching or information retrieval
RESCAL shows state-of-the-art results in link prediction and entity resolution
Embedding $A$ of RESCAL enables standard machine learning algorithms on relational data
Due to the sparsity of adjacency tensors, factorizations can scale up to complete knowledge bases
Time-evolving graphs and networks can be analysed by inspection of the factor matrices of appropriate decompositions like DEDICOM
Code & Tools
If You Want More...
Python and Matlab code for RESCAL available from my website and Github
from rescal import rescal
A, R = rescal(X, 20, init='nvecs')
Easy to use tools to convert from N-Triples or other formats to Python and Matlab tensors available from my website and Github