r 语言初学者指南_阻止自然语言处理的初学者指南

r 语言初学者指南

My job focuses almost exclusively on NLP. So I work with text. A lot.

我的工作几乎完全专注于NLP 。 所以我处理文本。 很多。

Text ML comes with its own challenges and tools. But one recurring problem is excessive dimensionality.

Text ML带有自己的挑战和工具。 但是一个反复出现的问题是尺寸过大。

In real life (not Kaggle), we often have a limited number of annotated examples on which to train models. And each example often contains a large amount of text.

在现实生活中(不是Kaggle),我们经常会在有限的带注释的示例中训练模型。 每个示例通常包含大量文本。

This sparsity, and lack of common features, make it difficult to find patterns in data. We call this the curse of dimensionality.

这种稀疏性以及缺乏通用功能,使得很难在数据中找到模式。 我们称其为维数诅咒

One technique for reducing dimensionality is stemming.

降低尺寸的一种技术是阻止

It removes the suffix from a word to get it’s “stem”.

它从单词中删除后缀以得到其“词干”。

So fisher becomes fish.

所以fisher变成了fish

Let’s learn how stemming works, why to use it, and how to use it.

让我们学习词干的工作原理,为什么要使用它以及如何使用它。

干什么? (What is stemming?)

后缀 (Suffixes)

Stemming reduces words to their stem (root) by removing suffixes.

词干通过删除后缀来减少词干(词根)。

Suffixes are word endings which modify a word’s meaning. These include but are not limited to...

后缀是修饰词义的词尾。 这些包括但不限于...

-able
-ation
-ed
-er
-est
-iest
-ese
-ily
-ful
-ing
...

Obviously the words runner and running having different meanings.

显然,“ runner和“ running这两个词具有不同的含义。

But when it comes to text classification tasks, both words imply that a sentence is talking about something similar.

但是当涉及到文本分类任务时,两个词都暗示一个句子在谈论类似的东西。

算法 (The Algorithm)

How do stemming algorithms work?

词干算法如何工作?

In a nutshell, algorithms contain a list of rules for removing suffixes, which are conditionally applied to input tokens.

简而言之,算法包含用于删除后缀的规则列表,这些条件有条件地应用于输入令牌。

There’s nothing statistical going on here, so algorithms are pretty basic. That makes stemming fast to run on large bodies of text.

这里没有任何统计信息,因此算法是非常基本的。 这使得词干快速可以在大量文本上运行。

If you’re interested, NLTK’s Porter stemming algorithm is freely available to dig into, here, and the paper explaining it, here.

如果您有兴趣,可以在此处自由地下载NLTK的Porter词干算法,并在此处进行解释。

That said, there are different stemming algorithms. NLTK alone contains Porter, Snowball, and Lancaster. Each being more aggressive in stemming than the last.

也就是说,存在不同的词干算法。 仅NLTK就包含Porter,Snowball和Lancaster。 每个都比最后一个更具侵略性。

阻止文字的原因 (Reasons for stemming text)

语境(Context)

A large part of NLP is figuring out what a body of text is talking about.

NLP的很大一部分正在弄清楚一段文本在谈论什么。

While not always true, a sentence containing the word, planting, is often talking about something similar to another sentence containing the word, plant.

虽然并不总是正确的,但包含planting一词的句子经常在谈论与包含plant一词的另一句话相似的事物。

Giving this, why not reduce all words to their stems before training a classification model on them.

鉴于此,在训练分类模型之前,为什么不减少所有词的词干。

There’s also another benefit.

还有另一个好处。

维数 (Dimensionality)

Large bodies of text contain huge numbers of different words. Combined with a limited number of training examples, sparsity makes it hard for models to find patterns and do accurate classifications.

大文本正文包含大量不同的单词。 结合有限数量的训练示例,稀疏性使模型很难找到模式并进行准确的分类。

Reducing words to their stem decreases sparsity and makes it easier to find patterns and make predictions.

将词减少到词干会减少稀疏性,使查找样式和进行预测变得更加容易。

Stemming allows each string of text to be represented in a smaller bag of words.

词干允许以较小的单词袋表示每个文本字符串。

Example: After stemming, the sentence, "the fishermen fished for fish", can be represented in a bag of words like this.

示例:词干之后,可以在这样的一袋单词中表示"the fishermen fished for fish"的句子。

[the, fisherman, fish, for]

Instead of.

代替。

[the, fisherman, fished, for, fish]

This both decreases sparsity across examples, and increases training algorithm speed.

这既降低了示例的稀疏性,又提高了训练算法的速度。

In my experience, very little signal is lost via stemming, compared to say using a word embeddings.

根据我的经验,与使用词嵌入相比,通过词干丢失的信号很少。

一个Python例子 (A Python Example)

We’ll start with a string of words, each with the stem, “fish”.

我们将从一串单词开始,每个单词都带有词干“鱼”。

'fish fishing fishes fisher fished fishy'

And then stem the tokens in that string.

然后阻止该字符串中的标记。

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
text = 'fish fishing fishes fisher fished fishy'
# Tokenize the string
tokens = word_tokenize(text)
print(tokens)
#=> ['fish', 'fishing', 'fishes', 'fisher', 'fished', 'fishy']
stemmer = PorterStemmer()
stems = [stemmer.stem(w) for w in tokens]
print(stems)
#=> ['fish', 'fish', 'fish', 'fisher', 'fish', 'fishi']

The result is.

结果是。

['fish', 'fish', 'fish', 'fisher', 'fish', 'fishi']

Note that it didn’t reduce all tokens to the stem “fish”. Porter is one of the most conservative stemming algorithms. That said, this is still an improvement.

请注意,它并没有减少词干“鱼”的所有标记。 波特是最保守的词干算法之一。 也就是说,这仍然是一种改进。

结论 (Conclusion)

Anecdotally, preprocessing is the most important (and neglected) part of the NLP pipeline.

有趣的是,预处理是NLP管道中最重要(也是被忽略)的部分。

It determines the shape of data that are eventually fed to ML models, and the difference between feeding a model quality data, and garbage.

它确定最终馈送给ML模型的数据的形状,以及馈送模型质量数据和垃圾之间的区别。

Garbage in, garbage out — as they say.

就像他们说的那样,垃圾进,垃圾出。

After down-casing, and removing punctuation and stopwords, stemming is a key component of most NLP pipelines.

精简并删除标点符号和停用词后,阻止是大多数NLP管道的关键组成部分。

Stemming (and it’s cousin, lemmatization), will give you better results, on less data, and decrease model training time.

词干(表亲,词条还原)将在更少的数据上提供更好的结果,并减少模型训练时间。

翻译自: https://towardsdatascience.com/a-beginners-guide-to-stemming-in-natural-language-processing-34ddee4acd37

r 语言初学者指南