Lexical Semantics 词汇语义
What should we represent meaning of the word?
- Words, lemmas, senses, definition
- Relationships between words or senses
- Taxonomy: abstract -> concrete
- Semantic frames and roles
Discrete Representations
例如One-hot 编码等一系列离散型编码,容易出现以下问题:
- 过于主观 Subjective
- 过于稀疏 Sparse
- 空间消耗过大,向量长度$V$取决于词汇表大小 Expensive
- 难以解释词与词之间的关系 Hard to compute word relationships
- Too coarse: eg: Expert – Skillful
Distributional Hypothesis
“The meaning of a word is its use in the language”
Wittgenstein 1943
Firth 1957
“You shall know a word by the company it keeps”
也叫 Distributional Representations 分布式表示
词向量
- Each word = a vector
- Similar words are “nearby in space”
- The standard way to represent meaning in NLP
Approaches for encoding words as vectors
- Counting-based methods (e.g., Tf-idf)
- Matrix factorization (e.g., topic modeling)
- Brown clusters
- Word2Vec
Count matrix可以取很多种取值,例如tf-idf PMI(Point Mutual Information)
前三种情况见课件,不主要介绍,主要介绍最后一种。
Distributed Word Embeddings
CBOW (Continuous Bag of Words)
$p(v|c)$
Similar to feedforward neural LM w/o the feedforward layers in Lecture 3.
在 Mikolov[1] 论文里提到一般来说Skip-gram模型效果比CBOW好。
Skip-gram
$p(c|v)$
我们的目标函数是:
$$J(\Theta) = -\frac{1}{T} \sum_{t=1}^T\sum_{-m\leq j\leq m, j\not= 0}\log p(w_{t+j}|w_t;\Theta)$$
概率密度利用softmax函数,
$$p(o|c) = \frac{\exp(u_0^Tv_c)}{\sum_{i=1}^V \exp(u_i^Tv_c) }$$
Notation:
$o$ = index of outside (context) word
$c$ = index of center word ($w_t$)
现阶段有个缺点,$V$为训练样本中的所有词汇的数量(vocab size, can be 50K-30M),我们分母的点积的计算量变得特别的大。
优化的时候后面这项也很难计算。因此我们引进了新的方法是Negative Sampling (负采样)。
Goal: 用一个词附近的词来描述该词。利用中心词来找周围词,但最终目标是寻找一个词向量,可能在别的任务上比较好用。例如词嵌入(Embedding)可视化。我们得到的向量应该满足差不多的关系,例如:
还有词相似度等。
一般将词向量作为第一层,然后后面就可以接其他的神经网络了。
Skip-gram with Negative Sampling
Convert the task to binary classification rather than multiclass:
$$\mathbb{P}(o|c) = \frac{\exp(u_o^Tv_c}{\sum_{i=1}^V\exp(u_i^Tv_c)}\rightarrow \mathbb{P} (o|c) = \frac{1}{1+\exp(-u_o^Tv_c)} = \sigma(u^T_ov_c)$$
The new Objective function:
$$\log \mathbb{P}(o_+|c) + \sum_{i=1}^k\log(1-\mathbb{P}(o_i|c))$$
前面的项是一个正样本,配套$k$个负样本来优化Objective Function。负样本是随机从词典中采样得来。
Pick negative samples according to unigram frequency
More common to choose according to:
$$\mathbb P_\alpha(w) = \frac{count(w)^\alpha}{\sum_wcount(w)^\alpha}$$
$\alpha = 0.75$ works well empirically.
Evaluating word vectors
- Intrinsic Evaluation: test whether the representations align with our intuitions about word meaning.
- Extrinsic Evaluation: test whether the representations are useful for downtream tasks, such as tagging, parsing, QA, …
References
- Mikolov, T. a. S., Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff (2013). Distributed Representations of Words and Phrases and their Compositionality, Curran Associates, Inc.