Lexical Semantics 词汇语义

What should we represent meaning of the word?

  • Words, lemmas, senses, definition
  • Relationships between words or senses
  • Taxonomy: abstract -> concrete
  • Semantic frames and roles

Discrete Representations

例如One-hot 编码等一系列离散型编码,容易出现以下问题:

  • 过于主观 Subjective
  • 过于稀疏 Sparse
  • 空间消耗过大,向量长度$V$取决于词汇表大小 Expensive
  • 难以解释词与词之间的关系 Hard to compute word relationships
  • Too coarse: eg: Expert – Skillful

Distributional Hypothesis

“The meaning of a word is its use in the language”

Wittgenstein 1943

“You shall know a word by the company it keeps”

Firth 1957

也叫 Distributional Representations 分布式表示


  • Each word = a vector
  • Similar words are “nearby in space”
  • The standard way to represent meaning in NLP

Approaches for encoding words as vectors

  • Counting-based methods (e.g., Tf-idf)
  • Matrix factorization (e.g., topic modeling)
  • Brown clusters
  • Word2Vec

Count matrix可以取很多种取值,例如tf-idf PMI(Point Mutual Information)


Distributed Word Embeddings

CBOW (Continuous Bag of Words)


Similar to feedforward neural LM w/o the feedforward layers in Lecture 3.

Mikolov[1] 论文里提到一般来说Skip-gram模型效果比CBOW好。




$$J(\Theta) = -\frac{1}{T} \sum_{t=1}^T\sum_{-m\leq j\leq m, j\not= 0}\log p(w_{t+j}|w_t;\Theta)$$


$$p(o|c) = \frac{\exp(u_0^Tv_c)}{\sum_{i=1}^V \exp(u_i^Tv_c) }$$


$o$ = index of outside (context) word
$c$ = index of center word ($w_t$)

现阶段有个缺点,$V$为训练样本中的所有词汇的数量(vocab size, can be 50K-30M),我们分母的点积的计算量变得特别的大。

优化的时候后面这项也很难计算。因此我们引进了新的方法是Negative Sampling (负采样)。

Goal: 用一个词附近的词来描述该词。利用中心词来找周围词,但最终目标是寻找一个词向量,可能在别的任务上比较好用。例如词嵌入(Embedding)可视化。我们得到的向量应该满足差不多的关系,例如:



Skip-gram with Negative Sampling

Convert the task to binary classification rather than multiclass:

$$\mathbb{P}(o|c) = \frac{\exp(u_o^Tv_c}{\sum_{i=1}^V\exp(u_i^Tv_c)}\rightarrow \mathbb{P} (o|c) = \frac{1}{1+\exp(-u_o^Tv_c)} = \sigma(u^T_ov_c)$$

The new Objective function:

$$\log \mathbb{P}(o_+|c) + \sum_{i=1}^k\log(1-\mathbb{P}(o_i|c))$$

前面的项是一个正样本,配套$k$个负样本来优化Objective Function。负样本是随机从词典中采样得来。

Pick negative samples according to unigram frequency

More common to choose according to:

$$\mathbb P_\alpha(w) = \frac{count(w)^\alpha}{\sum_wcount(w)^\alpha}$$

$\alpha = 0.75$ works well empirically.

Evaluating word vectors

  • Intrinsic Evaluation: test whether the representations align with our intuitions about word meaning.
  • Extrinsic Evaluation: test whether the representations are useful for downtream tasks, such as tagging, parsing, QA, …


  1. Mikolov, T. a. S., Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff (2013). Distributed Representations of Words and Phrases and their Compositionality, Curran Associates, Inc.

