Introduction

Word meaning is embedded in context. This simple assumption is rooted in contemporay language philosophy as well as cognitive linguistics and recently led to very successful models of distributed word representations, namely in the form of word embeddings trained by deep forward neural networks.

Word embeddings are trained on large unannotated corpora. Each update is conditioned on its immediate context, most often the history of preceding words within a limited window. The core idea is that a sequence of words contains enough information to predict the next word, or in other words, that the meaning of a word can be defined by the words that typically precede (or surround it) in a corpus.

Despite its success, there are several conceptual shortcomings to this approach that suggest poor performance in affected linguistic tasks:

Humans exploit much longer contexts when assessing the meaning of a word [1]. This insight gave rise to recurrent neural networks for language modelling that are able to dynamically keep track of the context history [2]. However, implementation issues, such as the vanishing/ exploding gradient problem, pose very tight restrictions to the length of the history [3].

Optimization techniques, such as Hessian Free Optimization [4], are not suitable for language modelling tasks due to much higher computational costs [5].

Another solution to this problem is to compress the history to an abstract representation that is fed into the current time step via a separate context layer [6]. This in turn blurs the impact of specific words from the history which is fatal to the goal of pinpointing long-term dependencies, especially across sentence borders.

Outline

We claim that common models of word embeddings are still relatively local in nature and do not make use of important information given in the discourse or even in the same sentence.

Long-term dependencies between words contain valuable information about the meaning of a word. They should prove useful not only to enrich the representation of a word, but also to model representations of context that can be used for several context-dependent NLP tasks, such as sense disambiguation.

To address the shortcomings mentioned above, we propose the investigation of Long Short-Term Memory (LSTM) networks, first developed by Hochreiter and Schmidhuber [7]. LSTM networks make use of dedicated node types that learn to control the impact and memorization of specific information from the past. They have been applied successfully to a variety of NLP tasks, such as language modeling [5], handwriting recognition [8], named entity recognition [9] and speech recognition [10].

Our goal is to create task-independent distributed representations of contexts from running text (context embeddings, CE) and to enhance existing models of distributed word representations by long-term dependency information.

This idea is biologically motivated by conceptual short term memory which provides an understanding of how pieces of input information cycle inside the memory system to imbue the interpretation of future input [11].

Related work

LSTM for language modeling

Sundermeyer & Schlüter & Ney [5] train an LSTM Neural Network for language modeling. It predicts the probability distribution for the next word given the history via a softmax output layer. The training criterion is the cross entropy error (maximum likelihood).

Note that the projection layer is not mentioned as representation layer and is not explicitly kept after the training as usable word representations.

The projection layer does not seem to be a distributed representation drawn from a lookup-table [???]. The activations in the projection layer are simply computed by the identity function on the weighted linear combination of the current one-hot input vector. The weight matrix between the input and the projection layer is tied for all history words.

Multitask learning

Collobert & Weston [12] train one deep network on several tasks at once: language modelling, POS-tagging, chunking, named entity recognition and semantic role labeling. The idea is that knowledge learned in one tasks helps in learning another task.

All tasks except the language model is trained on labeled text.

The first hidden layer consists of lookup tables (one for each task) and learns relevant features for each word. Features are represented as vectors of different sizes and fixed dictionaries. They deliver the supervised part.

Their concenation defines the word's position in the feature space.

Some lookup tables are shared, others and subsequent layers are task specific.

Multiple word prototypes

Huang et al. [13] learn multiple embeddings per word to capture homonymy and polysemy. For each word, the syntactic representation (local context) and the semantic representation (global context) are kept separate.

The local context model is trained via a contrastive estimation network on an large unlabeled corpus (preserves order information). The global context model is a network that takes the local context as input, plus a vector that is computed as the weighted average of all word vectors in the document (bag of words). Both models compute a score that is summed to the final score of the total network.

[14]

Experimental setup

Definitions:

1. Unsupervised training of the LSTM network

Given:

Procedure:

  1. For each token ti (i ≥ 0)

    1. Retrieve WE wi for ti from lookup table
    2. Input wi
    3. Predict the next word via softmax on a probability distribution layer Pi(wh)
      • Alternatively: predict the embedding of the next word WEi + 1 via log-bilinear
    4. Backpropagate the error (cross entropy)
    5. Update the network parameters (input weights of cells and gates, output weights)
LSTM networks for language modelling

LSTM networks for language modelling

It is possible to utilize the topic-chunks given by the training data. For example, the forget gate at the constant error carousel (CEC) can be reset to 1 ("do not forget") after each topic-chunk.

Important keywords in the discourse are expected to have a larger impact on the update. The network is expected to learn when a token is worth remembering, in the course of processing the discourse.

2. Retrieval of training set for WSD

Given:

Procedure:

  1. For each token ti with label li (i ≥ 0)

    1. Retrieve WE wi for ti from lookup table
    2. Input wi
    3. Output: CEi
    4. Save tupel li, CEi

3. Supervised training of binary classifier

Given:

Procedure:

  1. Split the set of tupels into training set and test set
  2. Train an SVM on the training set
  3. Compute precision and recall on the test set

Future experiments

We will train and cluster generic context embeddings via k-means. Context embeddings of homonyms are expected to aggregate according to their respective meanings in the context meaning space. Each cluster provides a representative centroid.

Further application would include enhancing existing word embeddings by concatenation or convolution of both types of representation. Several tasks are possible for comparison.

Sketch of work plan

Parallel work: collect and outline related work and research around recurrent and LSTM networks.

Funding

References

[1] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “Recurrent neural network based language model.” in INTERSPEECH, 2010, pp. 1045–1048.

[2] T. Mikolov and G. Zweig, “Context dependent recurrent neural network language model.” in SLT, 2012, pp. 234–239.

[3] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks.” in ICML (3), 2013, vol. 28, pp. 1310–1318.

[4] J. Martens and I. Sutskever, “Learning recurrent neural networks with hessian-free optimization.” in ICML, 2011, pp. 1033–1040.

[5] M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural networks for language modeling.” in INTERSPEECH, 2012.

[6] J. L. Elman, “Distributed representations, simple recurrent networks, and grammatical structure,” Machine learning, vol. 7, no. 2-3, pp. 195–225, 1991.

[7] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” vol. 9, no. 8, pp. 1735–1780, 1997.

[8] V. Frinken, F. Zamora-Martínez, S. E. Boquera, M. J. C. Bleda, A. Fischer, and H. Bunke, “Long-short term memory neural networks language modeling for handwriting recognition.” in ICPR, 2012, pp. 701–704.

[9] J. Hammerton, “Named entity recognition with long short-term memory,” in Proceedings of coNLL-2003, 2003, pp. 172–175.

[10] A. Graves, D. Eck, N. Beringer, and J. Schmidhuber, “Biologically plausible speech recognition with lSTM neural nets.” in BioADIT, 2004, vol. 3141, pp. 127–136.

[11] M. C. Potter, “Conceptual short term memory,” scholarpedia, vol. 5, no. 2, p. 3334, 2010.

[12] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” …international conference on Machine learning, 2008.

[13] E. Huang, R. Socher, C. Manning, and A. Ng, “Improving word representations via global context and multiple word prototypes,” Proceedings of the 50th …, no. July, pp. 873–882, 2012.

[14] R. Collobert, J. Weston, and L. Bottou, “Natural language processing (almost) from scratch,” The Journal of Machine …, 2011.