\( \newcommand{\R}{\mathbb{R}} \newcommand{\ten}[1]{\mathbf{#1}} \newcommand{\tenp}{\otimes} \newcommand{\bigtenp}{\bigotimes} \newcommand{\Set}[1]{\mathcal{#1}} \newcommand{\re}[3]{\vec{a}^T_{#1}R^{\vphantom{T}}_{#3}\vec{a}^{\vphantom{T}}_{#2}} \newcommand{\ttm}{\times} \newcommand{\vect}{\mathbf{vec}} \newcommand{\pred}[1]{\widehat{#1}} \newcommand{\BigO}{\mathcal{O}} \newcommand{\hadamard}{*} \newcommand{\discrep}{\mathcal{D}} \newcommand{\unfold}[2]{#1_{(#2)}} \newcommand{\from}{\sim} \)

Tensor Factorization for
Multi-Relational Learning

ECML/PKDD Nectar Track
25. September 2013, Prague
  • Hello everyone! First of all, it is great to be here
  • My name is Max Nickel, and I am just starting my postdoc at Tomsaso Poggio's group at MIT. However, what I want to talk about today is work I did for my PhD that I just finished at LMU under the supervision of Volker Tresp who is at Siemens of course.
  • Unfortunately, I've catched a really bad cold, so I hope the talk doesn't get too difficult to follow, so if you have questions please ask.
  • Anyway, the title of the talk is ... and it covers work over the last 2 years and summarizes of papers from ICML, NIPS and WWW

Relational Learning

  • Focus on dyadic multi-relational data, i.e. (subject, predicate, object) triples
  • Important relational learning tasks include
    • Link Prediction: Prediction of relationships between entities
    • Collective Classification: Classification of entities based on their relationships
    • Link-Based Clustering: Clustering of entities based on their relationships
    • Entity Resolution: Deduplication of entities
Relationships are a rich source of information, which allow to include information that is distant in the relational graph, what can significantly improve learning results.
Example:
Party membership of US presidents
Relations:
  • presidentOf
  • partyOf
  • Relational learning is a branch of machine learning concerned with learning from data where information is represented in form of relationships between entities
  • Now, a very important property of relational modeling is that relations allow us to access information that is more distant in the graph
  • And this is a large reason for relations being such a rich source of information
  • For instance, consider we want to predict the part membership of Al Gore. rarara attributes bad, relations good
  • Ok and this ability of an algorithm to read collective learning def
  • So, this collective learning effect is something that we really want to have, for example for link prediction, but also for other learning task where we think the relations hold important information like clustering, entity resolution and so on

Why Tensor Factorization?

  • Problem: Existing relational learning methods
    • do not scale (easily) to large data
    • often require extensive prior information about the problem
    • can be complex and difficult to implement
  • Tensor factorization in general:
    • Modeling simplicity: Multi-relational data has an efficient tensor represenation
    • No structure learning: Tensor factorization does not require structure learning or extensive prior knowledge
    • Weak collective learning: Most tensor factorizations show no or only a very reduced collective learning effect or show unreasonable constraints
  • Our approach:
    • Strong collective learning via unique latent representation of entities
    • $\Rightarrow$ State-of-the-art results on benchmark data
    • High scalability as algorithm scales linearly with the data size

Tensor Representation
of Relational Data

  • RDF, or any other dyadic relationaldata, has a natural and efficient representation as a sparse third-order tensor
  • Two modes refer to the entities, one mode to the relation types
    \begin{align} \Rightarrow & |Mode\ 1| = |Mode\ 2| = |Entities| = n\\ & |Mode\ 3| = |Predicates| = m \end{align}
  • We also require that the ordering of the entities on both modes is identical
$\color{cornflowerblue}{x_{ijk}}=\cases{1, & if triple ($\color{yellowgreen}{i}$-th entity, $\color{gold}{k}$-th predicate, $\color{orangered}{j}$-th entity) exists\cr 0, & otherwise\cr}$
In the following we will refer to this modeling as adjacency tensor
  • Another way to look at this construction is, that each of these frontal-slices of the tensor correspond exactly to the adjacency matrix of the predicate for that particular index and what we are then simply doing is that we take all these adjacency matrices and stack them behind each other

RESCAL

Main Idea: RESCAL is a factorization of multi-relational data, or multigraphs, into unique entity and predicate representations

\begin{align} X_{k\phantom{ij}} \approx & A_{\vphantom{k}} R_k A^T \\ \ten{X}_{\phantom{ijk}} \approx & \ten{R} \ttm_1 A \ttm_2 A \\ x_{ijk} \approx & \sum\nolimits_{q,r} a_{iq} r_{qrk} a_{jr} \end{align}
$X_k$
$\approx$
$A_{\vphantom{k}} R_k A^T$
  • $\color{orangered}{A} \in \R^{n \times r}$ represents the entity-latent-component space
  • $\color{orangered}{R_k} \in \R^{r \times r}$ is an asymmetric matrix that specifies the interaction of the latent components for the $k$-th predicate
  • $\color{orangered}{r}$ denotes the number of latent components of the factorization
  • $\color{orangered}{\approx}$ means best approximation under some loss function
  • To compute the factorization, we solve the optimization problem \[\min_{A,R_k} \sum_k \text{loss}(X_k; A_{\vphantom{k}} R_k A^T) + \lambda_A \|A\|^2 + \lambda_{\ten{R}} \|\ten{R}\|^2 \]
  • RESCAL factories each of the frontal slices of $X$, $X_k$ into the matrix product $A_{\vphantom{k}}R_kA^T$
  • Graphical view
  • Interpretation
  • Optimization problem - algorithm deffered until later
  • Unique representation enables strong collective learning effect

Probabilistic Interpretation

  • MAP estimate of $A$ and $\ten{R}$ for joint distribution
    \[ P(\ten{X}|A,\ten{R}) = \prod_{i=1}^n \prod_{j=1}^n \prod_{k=1}^m P(x_{ijk}|\re{i}{j}{k}) \]
  • reconstructed values $\re{i}{j}{k}$ are natural parameter of exponential family distribution
  • Nature of the distribution is determined by choice of loss function
    • Least Squares Loss $\Rightarrow$ Gaussian distribution
    • Logistic Loss $\Rightarrow$ Bernoulli distribution
  • Because of colliders in graphical model, $x_{ijk}$ depends indirectly on any other variable via its associated latent variables
  • When looking at it probabilistically, we can interpret each entry in the adjacency tensor as a RV that represents the state of the relationship
  • Under appropritate loss function, matrix and tensor factorization can be interpreted as maximum a posteriori estimation of the latent factors
  • Because of the collider structure the model can capture global dependencies in the data, because latent variables can possible depend on any other variable

  • Latent variable structure decouples prediction and learning
    • $x_{ijk}$ is conditionally independent of all variables given $\re{i}{j}{k}$
    • Simultaneously, latent variables are not $d$-separated from any other variable

Similarity of Entities in RESCAL

$X_k$
$\approx$
$A R_k A^T$
  • $A$ can be regarded as an embedding of the entities into a vector space representation
  • Similarity of entities in the latent-space $A$ corresponds to similarity of entities in the relational domain
  • $\Rightarrow$ Any feature-based machine-learning method (e.g. for clustering, classification etc.) can instantly be applied to $A$
  • Due to the collective learning effect of RESCAL, the latent-variable representation also contains information about distant relations

Collective Learning - U.S. Presidents

  • Data on party memberships of U.S. presidents extracted from DBpedia
  • Task: Predict party-membership for persons in dataset (link prediction)
  • Relations: partyOf, presidentOf, vicePresidentOf
  • Experimental setting: 10-fold cross-validation over all persons
  • Evaluation measure: Area under precision-recall curve (AUC-PR)
AUC-PR

Link Prediction Benchmarks

  • RESCAL shows state-of-the-art results, compared to MLN, IRM, BCTF
  • RESCAL shows very fast training times, i.e. under one minute on these datasets
AUC-PR
Kinships
UMLS
Nations

Learning Taxonomies on IIMB Benchmark

  • Taxonomies are an integral part of ontologies, consist of
    • a list of concepts
    • a concept hierarchy
    • an assignment of entities to concepts
  • Task: Unsupervised taxonomy learning via hierarchical clustering of entities
  • Method: Hierarchical clustering on $A$ using the OPTICS method
  • Data: Instance matching benchmark from movie domain, consisting of $\approx$ 1400 entities, organized in 80 concepts
    • Top Level Concepts: Film, Language, Budget, Creature, Location
    • The concepts Creature, Film and Location are again subdivided into concepts like Person, Character, Country or Action Movie
Level 10.982
Budget1.0
Creature1.0
Film1.0
Language0.963
Location0.953
Level 20.806
Character1.0
City0.992
Country0.632
Person0.991
Parody0.33
Zombie1.0
Level 30.946
Actor0.987
Capital0.992
Character Creator0.533
Director0.781

RESCAL-ALS Algorithm

RESCAL-ALS is an alternating-least squares algorithm to compute the RESCAL factorization, i.e. it optimizes

\[ \min_{A,\ten{R}}\sum_k \|{X}_k - AR_kA^T\|^2 + \lambda_A \|A\|^2 + \lambda_{\ten{R}} \|R_k\|^2 \]
Updates for factor matrices $A$ and $R_k$: \begin{align*} A_{\phantom{k}} & \leftarrow \left( \sum_{k=1}^m \class{rescalALS1}{X_k}AR_k^T + \class{rescalALS1}{X_k}^TAR_k \right) \left( \class{rescalALS2}{\sum_{k=1}^m B_k + C_k + \lambda I} \right)^{-1} \\ R_k & \leftarrow \left( \class{rescalALS2}{A^TA \otimes A^TA + \lambda I} \right)^{-1} \vect(A^T\class{rescalALS1}{X_k}A) \end{align*} where \begin{align*} B_k = R_kA^TAR_k^T,\quad C_k = R_k^TA^TAR_k \end{align*}
Exploit sparsity of $X_k$
Inversions can be computed in $\color{gold}{\mathcal{O}(r^3)}$
Largest dense matrix in memory: $\color{gold}{n \times r}$
  • ALS = BCD, blocks naturally given by factors
  • To compute the factorization we basically optimize a regularized least-squares problem
  • A nice thing about this approach is, that we can provide an efficient Alternating Least-Squares algoritm
  • Well and the update steps of this ALS algorithm to compute the factor matrices A and R_k are given here at the bottom
  • So, at first they might look expensive, but...

Scalability Experiments

Parameter value
Time (s)

Computational Complexity of RESCAL-ALS for sparse adjacency tensors

  • Entities: linear
  • Predicates: linear
  • Known Facts: linear
  • Latent Components: cubic
  • It turns out, and we can actually show that!, that the algorithm scales linear with the data size, meaning linear with the number of entities, the number of predicates and the number of known facts

Factorizing YAGO2 (core)

AUC-PR
  • Scalability allowed us to factorize the YAGO 2 core ontology with approximately
    • 3 million entities
    • 38 relations
    • 41 million known facts
  • Runtime for a rank-20 factorization on commodity hardware ~3 hours
  • Successfully predicted typeOf relationships for various higher-level classes in the ontology

Concluding Remarks & Future Work

  • RESCAL offers a flexible and scalable way to machine learning on multi-relational data in the size of complete knowledge bases
  • RESCAL shows state-of-the-art results in link prediction and entity resolution due to a strong collective learning effect
  • Scalability and state-of-the-art relational learning combined within a single model
  • Future work with regard to tensor factorizations for structured and relational data
    • Scalable Logistic Model
    • Handling scale-free nature of many multigraphs $\rightarrow$ Regularization
    • Reduce cubic complexity for $r$ by including prior knowledge in the factorization
  • Factor matrices enable standard machine learning algorithms on relational data
  • Fix Implications of using ALS
  • Normal distribution default model in many cases, for good reason, often does reasonably well
  • Limiting distribution of the sum of random variables
  • $X^TX$ can be ill-conditioned, what can be overcome by using regularization methods
  • Possible Justification: Big Data assumption -> More data beats better algorithms, therefore it might be beneficial to trade scalability for exact modeling, but this really depends on the problem

Code & Tools

  • Python and Matlab code for RESCAL available from my website and Github
    
      from rescal import rescal
      
      # compute rank-20 factorization of tensor X
      A, R = rescal(X, 20, init='nvecs')
    				
  • Easy to use tools to convert from N-Triples to Python and Matlab tensors available from my website and Github

Thank You!

Questions?

References

[1]
Nickel, M. et al. 2011. A Three-Way Model for Collective Learning on Multi-Relational Data. Proceedings of the 28th International Conference on Machine Learning (Bellevue, WA, USA, 2011), 809–816.
[2]
Nickel, M. et al. 2012. Factorizing YAGO: scalable machine learning for linked data. Proceedings of the 21st international conference on World Wide Web (New York, NY, USA, 2012), 271–280.
[3]
Nickel, M. and Tresp, V. 2013. An Analysis of Tensor Models for Learning on Structured Data. Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2013 (2013), to appear.
[4]
Nickel, M. and Tresp, V. 2011. Learning Taxonomies from Multi-Relational Data via Hierarchical Link-Based Clustering. NIPS Workshop - Learning Semantics (Granada, Spain, 2011).
[5]
Nickel, M. and Tresp, V. 2013. Logistic Tensor-Factorization for Multi-Relational Data. ICML Workshop -  Structured Learning: Inferring Graphs from Structured and Unstructured Inputs (Atlanta, GA, USA, 2013).
[6]
Nickel, M. and Tresp, V. 2012. Machine Learning on Linked Data: Tensors and their Applications in Graph-Structured Domains.
[7]
Nickel, M. and Tresp, V. 2013. Tensor Factorization for Multi-Relational Learning. Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2013 (2013), to appear.
[8]
Nickel, M. and Tresp, V. 2010. Three-Way DEDICOM for Relational Learning. NIPS 2010 Workshop - Tensors, Kernels and Machine Learning (Whistler, Canada, 2010).
[9]
Tresp, V. and Nickel, M. 2013. Relational Models. Encyclopedia of Social Network Analysis and Mining. J. Rokne and R. Alhajj, eds. Springer. to appear.

References

[1]
Airoldi, E.M. et al. 2008. Mixed Membership Stochastic Blockmodels. Journal of Machine Learning Research. 9, (Sep. 2008), 1981–2014.
[2]
Alon, N. 1995. Handbook of combinatorics (vol. 2). R.L. Graham et al., eds. MIT Press. 1749–1783.
[3]
Getoor, L. and Taskar, B. eds. 2007. Introduction to statistical relational learning. MIT Press.
[4]
Kemp, C. et al. 2006. Learning systems of concepts with an infinite relational model. Proceedings of the 21st national conference on Artificial intelligence - Volume 1 (Boston, Massachusetts, 2006), 381–388.
[5]
Kok, S. and Domingos, P. 2007. Statistical predicate invention. Proceedings of the 24th international conference on Machine learning (New York, NY, USA, 2007), 433–440.
[6]
Kolda, T.G. et al. 2005. Higher-order web link analysis using multilinear algebra. Proceedings of the Fifth International Conference on Data Mining (2005), 242–249.
[7]
Koller, D. and Pfeffer, A. 1998. Probabilistic frame-based systems. Proceedings of the National Conference on Artificial Intelligence (1998), 580–587.
[8]
Richardson, M. and Domingos, P. 2006. Markov logic networks. Machine Learning. 62, 1 (2006), 107–136.
[9]
Singh, A.P. and Gordon, G.J. 2008. A Unified View of Matrix Factorization Models. Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II (Berlin, Heidelberg, 2008), 358–373.
[10]
Singla, P. and Domingos, P. 2006. Entity Resolution with Markov Logic. Proceedings of the Sixth International Conference on Data Mining (Washington, DC, USA, 2006), 572–582.
[11]
Srebro, N. et al. 2005. Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices. Advances in Neural Information Processing Systems (Cambridge, MA, 2005), 1321–1328.
[12]
Srebro, N. 2004. Learning with matrix factorizations. Massachusetts Institute of Technology.

References

[13]
Sutskever, I. et al. 2009. Modelling Relational Data using Bayesian Clustered Tensor Factorization. Advances in Neural Information Processing Systems 22 (2009), 1821–1828.
[14]
Warren, H.E. 1968. Lower Bounds for Approximation by Nonlinear Manifolds. Transactions of the American Mathematical Society. 133, 1 (Aug. 1968), 167–178.
[15]
Xu, Z. et al. 2006. Infinite Hidden Relational Models. Proceedings of the Twenty-Second Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-06) (Arlington, Virginia, 2006), 544–551.
[16]
Yılmaz, Y.K. et al. 2011. Generalised coupled tensor factorisation. Advances in Neural Information Processing Systems (2011).

Backup Slides

Linked Data & Semantic Web

Relational Models

  • Approach of most relational learning methods: Template mechanism to create a grounded graphical model
    • Probabilistic Relational Models:
      Class diagram $\mapsto$ grounded Bayesian network
    • Markov Logic Networks:
      Logical formulas $\mapsto$ grounded Markov random field
  • Alternative approach: latent variable methods such as IHRM, IRM or mixed-membership stochastic block models
Problems of existing methods
  • Require extensive prior knowledge or structure learning
  • Require expensive inference methods such as MCMC or variational infercence
  • Relational learing is hard!
  • Typical indepence assumptions of ML do not hold
  • Some method needed to reduce computational complexity
  • RARARA
  • However, these existing methods have serious drawback that hindered their application
  • Also, difficult, complicated to implement
  • Collective Learning

    • Probabilistically, each possible edge in the multigraph can be interpreted as a binary random variable
    • "Traditional" Machine Learning: Observations are assumed to be independent
      • Joint distribution factorizes nicely
      • $P(x_{11},\ldots,x_{nm}) = \prod_{i=1}^n P(x_{i1},\ldots,x_{im})$
      • Binary random variables: $n2^{m}$ possible states
    • Relational Learning: Each random variable is possibly dependent on any other random variable
      • Without additional information, joint distribution can not be factorized
      • Previous example: $2^{n^2}$ possible states
      • For $n$ entities over $m$ dyadic relations: $2^{n^2m}$
    • Collective learning implies that every random variable can possibly be dependent on every other random variable in the data set
    • The problem is now that the usuall i.i.d. assumption of most machine learning methods doesn't hold anymore
    • This assumption is made to simplify computations, as with independent observations, the joint distribution factors nicely.
    • For instance, consider a $n \times m$ matrix, usually we would assume that we have $n$ observations involving $m$ features. The features are possibly dependent on each other, but the observations, i.e. the rows of the matrix are independent.

    Relational Learning & Bipartite Models

    • Tensor factorizations usually model a bipartite multigraph, i.e. assumes that the entries on the 1st mode are not identical to the entries on the 2nd mode
    • For instance, Tucker-2: $X_k \approx A_{\vphantom{k}}R_kB^T$, similar for CP
    • Bipartite Models do not account for the identity of subjects and objects
    • For general relational data, such as Linked Data, this assumption is violated
    • Bipartite modeling prevents information propagation over latent variables, or at least makes it very difficult (indirect effects can still exist)
    • Now, the problem with nearly all tensor factorizations in this context is, that by applying the CP decomposition to an adjacency tensors, we actually assume that the entities are bipartite
    • This is the case because CP uses a different latent vector representation for each of the modes, i.e. on time the vectors a, the other time the vectors b
    • Make sense in User-Movie-Time data
    • Violated on Adjacency tensors $\Rightarrow$ Breaks an information flow over the latent variables that we really want to have for collective learning
    • Lets say the observed variable are (Al, vP, Bill) and (Bill, p, Dem)
    • During learning the latent factors reflect that Bill is member of Dem. Party by some suitable representation
    • However when we model a graph is bipartite, occurences of entities as subjects and objects have a different representation, meaning that those representationa are different
    • And therefore, we have no guarentee, that the latent representation of Bill as an object also reflects the fact that he is member of the Democratic party (there might be some distant effects, but are quite unlikely, and this was only one hop!)
    • So, the information flow through the latent factors breaks here or at least is limited
    • What we would want instead of course is that all entities have a unique representation whether they occur as subjects or objects, because then, the fact that Bill is Dem. is more or less directly connected

    Similarity of Entities in RESCAL

    General approach: similarity of entities $\approx$ similarity of their relational patterns

    Outgoing links

    \begin{align*} \vect(X_{i,:,:}) = & \color{deeppink}{\vec{a}_i} R_{(1)} (I \otimes A^T) \\ \vect(X_{j,:,:}) = & \color{cornflowerblue}{\vec{a}_j} R_{(1)} (I \otimes A^T) \end{align*}

    Incoming links

    \begin{align*} \vect(X_{:,i,:}) = & \color{deeppink}{\vec{a}_i} R_{(2)} (I \otimes A^T) \\ \vect(X_{:,j,:}) = & \color{cornflowerblue}{\vec{a}_j} R_{(2)} (I \otimes A^T) \end{align*}
    ${X}_k$
    $\approx$
    $A_{\vphantom{k}} R_k A^T$
    • For both incoming and outgoing links, the similarity of entities $e_i$ and $e_j$ is uniquely determined by the similarity of $\vec{a}_i$ and $\vec{a}_j$
    • Due to the collective learning effect of RESCAL, the latent-variable representation also contains information about distant relations
    • $\Rightarrow$ Any feature-based machine-learning method (e.g. for clustering, classification etc.) can instantly be applied to $A$

    Entity Resolution - Cora

    • Cora is a hand-labeled dataset consisting of machine learning publications, authors, venues and paper title
      • Authors: Noisy approx. 90 $\mapsto$ True approx. 50
      • Venues: Noisy approx. 210 $\mapsto$ True approx. 100
      • Citation: Noisy approx. 400 $\mapsto$ True approx. 130
    AuthorCitationVenue
    NB0.9860.9130.738
    MLN(B)0.9870.9150.736
    MLN(BCTS)0.9920.9880.807
    CP0.9840.9910.746
    RESCAL0.9970.9910.810
    AUC-PR
    • Due to noisy citations, collective learning can significantly improve entity resolution results for authors and venues

    Data

    • Instance matching: Ranking of entities by their similarity in the entity-latent-component space
    • In total 61.000 matching decisions

    Visualization

    • Node size corresponds to degree of the nodes
    • Red nodes: authors, e.g. large node in center is michael kearns
    • Green nodes: venues, e.g. large node at to right is "Machine Learning" journal
    • Blue nodes: paper titles, e.g. blue node above ML journal is "the strength of weak learnability" from Robert E. Schapire, 1990

    Scalable Updates

    • Exploit sparsity of relational data in computations such as $XAR^T$
    • Complexity of straightforward implementation of updates for $R_k$: \begin{align*} \text{Computational: } \color{orangered}{\BigO(r^5)} && \text{Memory: } \color{orangered}{\BigO(r^4)} \end{align*}
    • Fortunately, $(A^TA \otimes A^TA + \lambda I )^{-1}\vect(A^T{X}_kA)$ has a lot of structure that we can exploit
    • By using properties of the Kronecker product and the vectorization operator we can show that updates for $R_k$ can be computed via

      \[ R_k \gets V (P \hadamard U^T X_k U) V^T \]

      where $A = U \Sigma V^T$ and $P_{ij} = \frac{\Sigma_{ii} \Sigma_{jj}}{\Sigma_{ii}^2\Sigma_{jj}^2 + \lambda}$

    • Complexity of improved updates for $R_k$: \begin{align*} \text{Computational: } \color{cornflowerblue}{\BigO(r^3)} && \text{Memory: } \color{cornflowerblue}{\BigO(r^2)} \end{align*}

    Efficient Handling of Attributes

    • Much of the information in Linked Data is in form of attributes (literals)
    • New approach: Include attributes via coupled tensor-matrix factorization
      \begin{align*} \ten{X} \approx \ten{R} \ttm_1 A \ttm A \quad \land \quad D \approx AV \end{align*}
    • Entity-latent-component-space $A$ is shared between factorization of $\ten{X}$ and $D$
    • Updated optimization objective for RESCAL: \[ \min_{A,\ten{R},F} \sum_k \|X_k - A R_k A^T \|^2 + \|D - AF\|^2 + \lambda_A \|A\|^2 + \ldots \]
    • Computational complexity of coupled updates $\color{white}{\BigO(mn + \ell)}$
    • Meaning, if we have 40 predicates, the coupled algorithm is 40 times faster