Identification in NLP

Posted Jun 28, 20209 min read

Source|Analytics Vidhya


  • Identification is a key to processing text data
  • We will discuss various nuances of identification, including how to deal with out-of-vocabulary words(OOV)


It is daunting to master a new language from scratch. If you have ever learned a language that is not your mother tongue, you will understand! There are too many levels to consider, such as grammar. This is a considerable challenge.

In order for our computer to understand any text, we need to decompose the word in a way that the machine can understand. This is the concept of identification in natural language processing(NLP).

Simply put, tokenization is very important for processing text data.

The following is an interesting thing about tokenization, it is not just a breakdown of text. Identification plays an important role in processing text data. Therefore, in this article, we will explore the tokenization in natural language processing and how to implement it in Python.

table of Contents

  1. Identification
  2. The real reason behind the tokenization
  3. Which one should we use(word, character or sub-word)?
  4. Implement Byte Pair encoding in Python


Tokenization is a common task in natural language processing(NLP). This is a basic step in traditional NLP methods(such as Count Vectorizer) and advanced deep learning-based architectures(such as Transformers).

Words are part of natural language.

Tokenization is a method of dividing text into smaller units called logos. Here, the logo can be a word, character or sub-word. Therefore, tokenization can be roughly divided into three types:tokenization of words, characters, and sub-words(n-gram characters).

For example, consider this sentence:"Never give up".

The most common form of words is based on space. Assuming spaces as delimiters, the sentence identification will produce 3 words, Never-give-up. Since each logo is a word, it becomes an example of word identification.

Similarly, tokens can be characters or sub-words. For example, let's consider smarter":

  1. Character identification:s-m-a-r-t-e-r
  2. Subword identification:smart-er

But is this necessary? Do we really need tokenization to accomplish all this?

The real reason behind tokenization

Since words are the building blocks of natural language, the most common way of processing original text occurs at the word level.

For example, Transformer-based models(the latest(SOTA) deep learning architecture in NLP) process raw text at the word level. Similarly, the most popular deep learning architectures for NLP, such as RNN, GRU, and LSTM, also process raw text at the word level.

As shown in the figure, RNN receives and processes each word at a specific time step.

Therefore, identification is the first step of text data modeling. Perform tokenization on the corpus to obtain words. Then use the following words to prepare a vocabulary. Vocabulary refers to the words that have appeared in the corpus. Remember, the vocabulary can be constructed by considering each unique word in the corpus or considering the top K frequently occurring words.

Creating a glossary is the ultimate goal of identification.

One of the simplest techniques to improve the performance of the NLP model is to use top K words to create a vocabulary.

Now, let's understand the usage of vocabulary in traditional and advanced deep learning-based NLP methods.

  • Traditional NLP methods such as word frequency counting and TF-IDF use vocabulary as features. Each word in the vocabulary is considered a unique feature:

  • In the advanced NLP architecture based on deep learning, the vocabulary is used to create input sentences. Finally, these words are passed to the model as input

Which one should we use(word, character or sub-word)?

As mentioned earlier, identification can be performed at the word, character, or sub-word level. This is a common question-what kind of tokenization should be used when solving NLP tasks? Let us discuss this issue here.

Word-level identification

Word tokenization is the most commonly used tokenization algorithm. It splits a piece of text(English) into single words according to specific separators. Depending on the separator, different word-level identifiers will be formed. Pre-trained word embeddings, such as Word2Vec and GloVe, belong to word identification.

This has only a few disadvantages.

Disadvantages of word-level identification

One of the main problems with word identification is dealing with out-of-vocabulary(OOV) words. OOV words refer to new words encountered in the test. These new words do not exist in the vocabulary. Therefore, these methods cannot handle OOV words.

But wait, don't jump to conclusions!

  • A little trick can save the word identifier from OOV words. The trick is to use the first K frequent words to form a vocabulary, and replace the rare words in the training data with unknown identifiers(UNK). This helps the model use UNK to learn the representation of OOV words
  • Therefore, during the test, any words that do not exist in the vocabulary will be mapped to the UNK logo. This is how we solve the OOV problem in the tokenizer.
  • The problem with this method is that when we map OOV to UNK words, the entire information of the words is lost. The structure of the word may help to accurately represent the word. Another problem is that every OOV word has the same representation

Another problem with word identification is related to the size of the vocabulary. In general, pre-trained models are trained on a large number of text corpora. So, imagine building a vocabulary with all words in such a large corpus. This will greatly increase the vocabulary!

This opens the door to character-level identification.

Character level identification

Character identification divides each text into a set of characters. It overcomes the shortcomings we saw above regarding word identification.

  • The character identifier handles OOV words coherently by saving the word information. It decomposes OOV words into characters and uses these characters to represent words
  • It also limits the size of vocabulary. Want to guess the vocabulary? The answer is 26.
Disadvantages of character identification

Character identification solves the OOV problem, but when we represent a sentence as a sequence of characters, the length of the input and output sentences increases rapidly. Therefore, learning the relationship between words to form meaningful words becomes very challenging.

This takes us to another tokenization called Subword tokenization(Subword), which is somewhere between word and character tokenization.

Sub-word identification

Sub-word identification divides the text into sub-words(or n characters). For example, words like lower can be divided into low-er, smartest, smart-est, and so on.

The transformation-based model(SOTA in NLP) relies on the sub-word identification algorithm to prepare the vocabulary. Now, I will discuss one of the most popular subword identification algorithms, called Byte Pair Encoding Byte Pair Encoding(BPE).

Using BPE

Byte Pair encoding, BPE is a widely used identification method in converter-based models. BPE solves the problem of word and character identifiers:

  • BPE effectively solves the OOV problem. It divides OOV into subwords and uses these subwords to represent words
  • Compared with character identification, the length of input and output sentences after BPE is shorter

BPE is an identification algorithm that iteratively merges the most frequently occurring characters or character sequences. The following is a tutorial to learn BPE step by step.

Steps to learn BPE

  1. Append the ending symbol
  2. Initialize the vocabulary with unique characters in the corpus
  3. Calculate the frequency of pairs or character sequences in the corpus
  4. Merge the most frequent pair in the corpus
  5. Keep the best pair in the glossary
  6. Repeat steps 3 to 5 for a certain number of iterations

We will understand these steps through an example.

Consider a corpus

1a) Append the end of the word after each word in the corpus(for example, ):

1b) Divide the words in the corpus into characters:

  1. Initialize the vocabulary:

Iteration 1:

  1. Calculation frequency:

  1. Merge the most common pair:

  1. Save the best pair:

Repeat steps 3-5 for each iteration from now on. Let me demonstrate another iteration.

Iteration 2:

  1. Calculation frequency:

  1. Merge the most common pair:

  1. Save the best pair:

After 10 iterations, the BPE merge operation is as follows:

Very straightforward, right?

Application of BPE in OOV words

But how do we use BPE to represent OOV words during testing? Any ideas? Let's answer this question now.

During the test, OOV words were divided into character sequences. Then apply the learned operations to merge characters into larger known symbols.

The following is the process of representing OOV words:

  1. Split the OOV words into characters after appending
  2. Calculate the pair or character sequence in a word
  3. Select the existing pair that has been learned
  4. Merge the most common pair
  5. Repeat steps 2 and 3 until you can merge

Next let's take a look at all this!

Implementing Byte Pair encoding in Python

We now know how BPE learns and uses OOV vocabulary. So, it's time to implement it in Python.

The BPE Python code is already available in the code published in the original paper.

Read the corpus

We will consider a simple corpus to illustrate the idea of BPE. However, the same idea applies to another corpus:

#Import library
import pandas as pd

#Reading .txt file
text = pd.read_csv("sample.txt",header=None)

#Convert the data frame to a single list
for row in text.values:
    tokens = row[0].split(" ")
    for token in tokens:

Text preprocessing

Split words into characters in the corpus and append at the end of each word:

#Initialize vocabulary
vocab = list(set(" ".join(corpus)))

#Split the word into characters
corpus = [" ".join(token) for token in corpus]

corpus=[token+' </w>' for token in corpus]

Learning BPE

Calculate the frequency of each word in the corpus:

import collections

#Return the frequency of each word
corpus = collections.Counter(corpus)

#Convert counter object to dictionary
corpus = dict(corpus)


Let us define a function to calculate the frequency of a pair or character sequence. It accepts a corpus and returns the frequency:

#pair or frequency of character sequence
#The parameter is a corpus and returns the frequency of each pair
def get_stats(corpus):
    pairs = collections.defaultdict(int)
    for word, freq in corpus.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]]+= freq
    return pairs

Now, the next task is to merge the most frequent pairs in the corpus. We will define a function to accept the corpus, the best pair, and return the modified corpus:

#Merge the most common pair in the corpus
#Accept corpus and best pair
import re
def merge_vocab(pair, corpus_in):
    corpus_out = {}
    bigram = re.escape(''.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')

    for word in corpus_in:
        w_out = p.sub(''.join(pair), word)
        corpus_out[w_out]= corpus_in[word]

    return corpus_out

Next, it's time to learn BPE operation. Since BPE is an iterative process, we will perform and understand the steps of an iteration. Let us calculate the frequency of the bi-gram:

#bi-gram frequency
pairs = get_stats(corpus)


Find the most common:

#Calculate the best pair
best = max(pairs, key=pairs.get)
print("Most Frequent pair:",best)

Output:(e , s)

Finally, merge the best pair and save it in the glossary:

#Combination of frequent pairs in the corpus
corpus = merge_vocab(best, corpus)
print("After Merging:", corpus)

#Convert tuple to string
best = "".join(list(best))

#Merge into merges and vocab
merges = []


We will follow similar steps:

num_merges = 10
for i in range(num_merges):

    #Calculate the frequency of bi-gram
    pairs = get_stats(corpus)

    #Calculate the best pair
    best = max(pairs, key=pairs.get)

    #Merge frequent pairs in the corpus
    corpus = merge_vocab(best, corpus)

    #Merge into merges and vocab

#Convert tuple to string
merges_in_string = ["".join(list(i)) for i in merges]
print("BPE Merge Operations:",merges_in_string)


The most interesting part is still behind! Apply BPE to OOV vocabulary.

Application of BPE in OOV vocabulary

Now, we will see how to apply BPE to OOV words. For example, the OOV word is "lowest":

#BPE in OOV vocabulary
oov ='lowest'

#Split OOV into characters
oov = "".join(list(oov))

#Add </w>
oov = oov + '</w>'

#Create dictionary
oov = {oov:1}

Applying BPE to OOV words is also an iterative process. We will perform the steps discussed earlier in this article:


    #Calculate frequency
    pairs = get_stats(oov)

    #Extract keys
    pairs = pairs.keys()

    #Find out the available pairs from previous study
    ind=[merges.index(i) for i in pairs if i in merges]

        print("\nBPE Completed...")

    #Choose the most frequently learned operation
    best = merges[min(ind)]

    #Merge the best pair
    oov = merge_vocab(best, oov)

    print("Iteration ",i+1, list(oov.keys())[0])


As you can see, the OOV word "low est" is divided into low-est.


Tokenization is a powerful method for processing text data. We saw this in this article and used Python to implement the tokenization.

Continue to try this method on any text-based data set. The more you practice, the better you can understand how tokenization works(and why it is such a key NLP concept).

Original link: https://www.analyticsvidhya.c...

Welcome to the Patron AI blog site:

Sklearn machine learning Chinese official document:

Welcome to the Patron blog resource summary station: