edfward

Peter Norvig 的拼写检查器

今天看了 Peter Norvig 的一篇关于拼写检查器的文章(翻译原文),不仅写的好而且刚好跟不久前实现的一个 Naive-Bayes 名词短语分类器相关,感觉收益良多。简单总结一下文章提到的东西。

21行 python 代码实现,功能完备,1 秒处理 10 多个单词,准确率 80%~90%。

语言层面上的 notes:
关于 spell corrector 的统计模型用的是很简单的 Bayes’ Rule: P(c|w) = P(c) * P(w|c) / P(w):
误差和可能的改进:

update @ 1/6/2014
附上原文中的代码。

import re, collections

  def words(text): return re.findall('[a-z]+', text.lower()) 

  def train(features):
      model = collections.defaultdict(lambda: 1)
      for f in features:
          model[f] += 1
      return model

  NWORDS = train(words(file('big.txt').read()))

  alphabet = 'abcdefghijklmnopqrstuvwxyz'

  def edits1(word):
     splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
     deletes    = [a + b[1:] for a, b in splits if b]
     transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
     replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
     inserts    = [a + c + b     for a, b in splits for c in alphabet]
     return set(deletes + transposes + replaces + inserts)

  def known_edits2(word):
      return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

  def known(words): return set(w for w in words if w in NWORDS)

  def correct(word):
      candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
      return max(candidates, key=NWORDS.get)

使用:

>>> correct('speling')
'spelling'
>>> correct('korrecter')
'corrector'