Meaning of a word

Lexical Semantics 词汇语义

What should we represent meaning of the word?

Words, lemmas, senses, definition
Relationships between words or senses
Taxonomy: abstract -> concrete
Semantic frames and roles

Discrete Representations

例如One-hot 编码等一系列离散型编码，容易出现以下问题：

过于主观 Subjective
过于稀疏 Sparse
空间消耗过大，向量长度$V$取决于词汇表大小 Expensive
难以解释词与词之间的关系 Hard to compute word relationships
Too coarse: eg: Expert – Skillful

Distributional Hypothesis

“The meaning of a word is its use in the language”
Wittgenstein 1943

“You shall know a word by the company it keeps”
Firth 1957

也叫 Distributional Representations 分布式表示

词向量

Each word = a vector
Similar words are “nearby in space”
The standard way to represent meaning in NLP

Approaches for encoding words as vectors

Counting-based methods (e.g., Tf-idf)
Matrix factorization (e.g., topic modeling)
Brown clusters
Word2Vec

Count matrix可以取很多种取值，例如tf-idf PMI(Point Mutual Information)

前三种情况见课件，不主要介绍，主要介绍最后一种。

Distributed Word Embeddings

CBOW (Continuous Bag of Words)

$p(v|c)$

Similar to feedforward neural LM w/o the feedforward layers in Lecture 3.

在 Mikolov[1] 论文里提到一般来说Skip-gram模型效果比CBOW好。

Skip-gram

$p(c|v)$

我们的目标函数是：

$$J(\Theta) = -\frac{1}{T} \sum_{t=1}^T\sum_{-m\leq j\leq m, j\not= 0}\log p(w_{t+j}|w_t;\Theta)$$

概率密度利用softmax函数，

$$p(o|c) = \frac{\exp(u_0^Tv_c)}{\sum_{i=1}^V \exp(u_i^Tv_c) }$$

Notation:

$o$ = index of outside (context) word
$c$ = index of center word ($w_t$)

现阶段有个缺点，$V$为训练样本中的所有词汇的数量(vocab size, can be 50K-30M)，我们分母的点积的计算量变得特别的大。

优化的时候后面这项也很难计算。因此我们引进了新的方法是Negative Sampling (负采样)。

Goal: 用一个词附近的词来描述该词。利用中心词来找周围词，但最终目标是寻找一个词向量，可能在别的任务上比较好用。例如词嵌入(Embedding)可视化。我们得到的向量应该满足差不多的关系，例如：

还有词相似度等。

一般将词向量作为第一层，然后后面就可以接其他的神经网络了。

Skip-gram with Negative Sampling

Convert the task to binary classification rather than multiclass:

$$\mathbb{P}(o|c) = \frac{\exp(u_o^Tv_c}{\sum_{i=1}^V\exp(u_i^Tv_c)}\rightarrow \mathbb{P} (o|c) = \frac{1}{1+\exp(-u_o^Tv_c)} = \sigma(u^T_ov_c)$$

The new Objective function:

$$\log \mathbb{P}(o_+|c) + \sum_{i=1}^k\log(1-\mathbb{P}(o_i|c))$$

前面的项是一个正样本，配套$k$个负样本来优化Objective Function。负样本是随机从词典中采样得来。

Pick negative samples according to unigram frequency

More common to choose according to:

$$\mathbb P_\alpha(w) = \frac{count(w)^\alpha}{\sum_wcount(w)^\alpha}$$

$\alpha = 0.75$ works well empirically.

Evaluating word vectors

Intrinsic Evaluation: test whether the representations align with our intuitions about word meaning.

Extrinsic Evaluation: test whether the representations are useful for downtream tasks, such as tagging, parsing, QA, …

References

Mikolov, T. a. S., Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff (2013). Distributed Representations of Words and Phrases and their Compositionality, Curran Associates, Inc.