David Kaumanns, 5.11.2013
Wittgenstein
“the meaning of a word is its use in the language”
Technical interpretation
\(\Rightarrow\) word embeddings
Children learn meaning by semantic bootstrapping.
Abstract concepts are built upon more concrete concepts.
… beginning with (and supported by) concrete physical properties.
Linguistic word representations can be improved by experiential data.
This has been shown for
But what about
Raw image representations
Idea
From a set of images
Extract feature descriptor vectors for iconic regions.
Quantize the feature vectors into clusters (k-means).
For each image
Linguistic contexts
Visual contexts
Caption
Women standing in line to vote in Bangladesh.
Tags
1 Bangladesh
1 line
1 standing
1 vote
1 Women
Tags
1 robot
1 pinball
1 people
1 men
1 man
1 machine
1 light
1 game
1 color
1 car
Are the image-text corpora adequate?
Are visual embeddings for image tags useful by themselves?
How to fine-tune the parameters?
How to finally integrate visual semantics into linguistic semantics?