This post is Grassroots, meaning a reader posted it directly. If you see an issue with it, contact an editor.
If you’d like to post a Grassroots post, click here!

0.1
July 1, 2021

Know the Difference Between Word, Subword, and Personality Batch processing

Photo by SHVETS production on Pexels.

Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) that enables robots (computers) to comprehend written and spoken human language in the same manner that humans do. NLP is nearly everywhere, and it is assisting individuals in their daily work. It is now such a commonplace technology that we frequently take it for granted. Spell check, autocomplete, spam detection, Alexa, and Google Assistant are a few examples. NLP is sometimes taken for granted, however it is important to remember that machines work with numbers rather than letters/words/sentences. So, in order to work with a vast amount of text data easily available on the internet, we need to manipulate and clean the text, which is usually referred to as text pre-processing in NLP.
The first stage in dealing with text and developing a model to answer our business challenge is pre-processing. Pre-processing is a multi-stage procedure in and of itself. We will solely discuss tokenization and tokenizers in this article. So let us get started.

Please keep in mind that we are mostly concerned with the English language.

My articles is a family member  of guest posting websites which has a large community of content creators and writers.You are warmly welcome to signup and publish a guest post with a dofollow backlink no matter in which niche you have a business. Follow your favorite writers, create groups, forums, chat, and much much more!

Tokenization

Tokenization is a key step in text pre-processing. You cannot miss this stage whether you are using standard NLP approaches or modern deep-learning techniques. ??

Tokenization is the process of dividing a phrase, sentence, paragraph, or one or more text documents into smaller components. Each of these smaller components is referred to as a token. These tokens can now be anything – a word, a subword, or even an entire character. Different algorithms use different procedures to do tokenization, but the example below will give you a good notion of the differences between these three.

Information Transformation Service provides  web scraping Services that provides high-quality structured data to improve business outcomes and enable intelligent decision making,their Web scraping service allows you to scrape data from any websites and transfer web pages into an easy-to-use format such as Excel, CSV, JSON and many others.

Take a look at the following sentence/raw text.

“Let us practise tokenization.”

The sentence will be broken down into words by a word-based tokenization technique. The most frequent is space-based splitting.

[“Let”, “us”, “learn”, “tokenization.”]

A character-based tokenization algorithm will break the sentence into characters.

[“L”, “e”, “t”, “u”, “s”, “l”, “e”, “a”, “r”, “n”, “t”, “o”, “k”, “e”, “n”, “i”, “z”, “a”, “t”, “i”, “o”, “n”, “.”]

Tokens are the foundation of NLP, and all NLP models process raw text at the token level. These tokens are utilised to create the vocabulary, which is a collection of distinct tokens in a corpus (a dataset in NLP). This lexicon is then translated into integers (IDs), which aid in modelling. ?
In this section, we discussed three possible tokenization strategies. Each of these strategies works differently and has its own set of benefits and drawbacks. Let us get into the specifics of each of these strategies to learn more. ??
Tokenization based on words

This is the most often used tokenization method. It divides a text string into words depending on a delimiter. Space is the most widely used delimiter. You can also use more than one delimiter, such as space and punctuation marks, to split your content. You will obtain different word-level tokens depending on the delimiter you selected.

Word-based tokenization is simple to implement using bespoke RegEx or Python’s split() technique. Aside from that, Python has a plethora of libraries that can assist you with tokenization, such as NLTK, spaCy, Keras, and Gensim.

Example:

“Is it strange that I don’t enjoy coffee?” ”

We get the following results from word-based tokenization with space as a delimiter:

[“Is”, “it”, “strange”, “I”, “don’t”, “like”, “coffee?” ”]

Looking at the tokens “don’t” and “coffee? ”, we will observe that these words are punctuated. What if there is another raw text (sentence) in our corpora that looks like this — “I love coffee.” This time there will be a token “coffee.” which will cause the model to learn different representations of the word coffee (“coffee?” and “coffee.”) and will make the representation of words (tokens) suboptimal. ??

We should consider punctuation when tokenizing since we do not want our model to learn distinct representations of the same word with every potential punctuation (of course the ones that can follow a word). If we allow our model to do so, the number of representations a model will learn (each word number of punctuations used in a language) will be exploded. So, let’s talk about punctuation.

[“Is”, “it”, “weird”, “I”, “don”, “‘”, “t”, “like”, “coffee”, “? ”]

This is far superior to what we had previously. However, as we can see, tokenization has produced three tokens for the word “don’t” — “don”, “‘”, and “t”. Better tokenization of “don’t” would have been “do” and “n’t,” and if the model saw the term “doesn’t” in the future, it would have tokenized it into “does” and “n’t,” and since the model had previously learnt about “n’t,” it would have applied its knowledge here. The problem appears to be complex, yet it can be solved by following a few simple rules. ?

You may have observed that modern NLP models each have their own tokenizers because each model utilises distinct rules to accomplish tokenization in addition to tokenizing with spaces. As a result, tokenizers from various NLP models can generate distinct tokens for the same text. Word-based tokenization is exemplified by spaces and punctuation, as well as rule-based tokenization.

Each word is then represented by an ID, and each ID carries a wealth of information, as a word in a phrase typically contains a wealth of contextual and semantic information. ?

The technique is great, yet this form of tokenization results in a gigantic corpus, which results in a large vocabulary. Transformer XL, the most recent model, employs space and punctuation tokenization and contains a vocabulary of 267,735. That is enormous! This large vocabulary size necessitates a large embedding matrix for both the input and output layers, making the model heavier and necessitating more computational resources.

This tokenization also assigns different IDs to words like “boy” and “boys,” which are nearly identical in the English language (one is singular and the other is plural). We genuinely want our model to understand that terms like these are related.

To address this massive vocabulary issue, we can restrict the amount of words that can be added to the vocabulary. For example, we can only save the top 5,000 words in our lexicon based on their frequency of occurrence in corpora. The programme will then generate IDs for the 5,000 most common terms and mark the remaining words as OOV (Out Of Vocabulary). However, this results in information loss because the model will not learn anything about the OOV terms. This is a significant compromise for the model because it will learn the same OOV representation for all unknown words. ?

Another disadvantage is the misspelt words. If the word “knowledge” is misspelt as “knowldge” in the corpus, the model will assign an OOV token to the later term.

To address all of these challenges, researchers devised character-based tokenization.

Tokenization based on characters

Tokenizers based on characters separate the raw text into distinct characters. The idea behind this tokenization is that a language has a large number of words but only a limited amount of letters. As a result, the vocabulary is extremely limited. ?

For example, in the English language, we employ 256 different characters (letters, numerals, and special characters), but it has a vocabulary of about 170,000 words. As a result, character-based tokenization will employ fewer tokens than word-based tokenization.

One of the primary benefits of character-based tokenization is that there will be no or few unfamiliar or OOV words. As a result, it may use the representation for each character to generate a representation of the unknown words (words not encountered during training). Another benefit is that misspelt words can be spelt correctly instead of being marked as OOV tokens and losing information.

This type of tokenization is straightforward and can significantly reduce memory and time complexity. So, is it the best or perfect tokenization algorithm? The answer is no (at least in terms of the English language)! A character, unlike a word, normally does not convey any meaning or information. ?

Nota bene: In a few languages, each character contains a lot of information. As a result, character-based tokenization can be advantageous in this case.

In character-based tokenization, reducing the vocabulary size entails a trade-off with the sequence length. Because each word is separated into each character, the tokenized sequence is substantially longer than the raw text. The word “knowledge,” for example, will have nine distinct tokens. ?

Researchers Karparthy, Radford, et al., Kalchbrenner et al., and Lee et al. demonstrated the usage of character-based tokenization and obtained some impressive results. To learn more, read these papers!

Despite its flaws, character-based tokenization has fixed many of the challenges that plagued word-based tokenization. Let us examine if we can also tackle the problems associated with character-based tokenization.

Tokenization based on subwords

Subword-based tokenization is another prominent tokenization method that bridges the gap between word and character-based tokenization. The main idea is to solve the problems that word-based tokenization (very large vocabulary size, large number of OOV tokens, and different meanings of very similar words) and character-based tokenization (very large vocabulary size, large number of OOV tokens, and different meanings of very similar words) have (very long sequences and less meaningful individual tokens).

The subword-based tokenization algorithms work on the basis of the following concepts.

Do not break up commonly used words into smaller subwords.

Divide the uncommon words into smaller meaningful subwords.

For example, the word “boy” should not be split, however the word “boys” should be split into “boy” and “s.” This will assist the model understand that the term “boys” is generated from the word “boy,” which has slightly distinct connotations but the same root word.

We divided the term “tokenization” into “token” and “ization” at the beginning of this article, with “token” being the root word and “ization” being the second subword identified as supplementary information for the base word. The subword splitting will assist the model in learning that terms with the same root word as “token,” such as “tokens” and “tokenizing,” have similar meanings. It will also assist the model understand that the phrases “tokenization” and “modernization” are derived from separate root words but share the suffix “ization” and are used in the same syntactic conditions. Another example is the phrase “unexpected.” Because these stand-alone subwords emerge more frequently, subword-based tokenization will separate it into “surprising” and “ly.”

To signify which word is the start of the token and which word is the completion of the start of the token, subword-based tokenization algorithms typically utilise a specific symbol. For example, the word “tokenization” can be divided into “token” and “##ization,” indicating that “token” is the beginning of the word and “##ization” is the end of the term.

Different NLP models employ various special symbols to represent subwords. The BERT model uses “##” for the second subword. Please keep in mind that special symbols can also be added to the beginning of a word.

Most models that have achieved cutting-edge outcomes in the English language employ some form of subword-tokenization method. WordPiece, which is used by BERT and DistilBERT, Unigram, which is used by XLNet and ALBERT, and Bye-Pair Encoding, which is used by GPT-2 and RoBERTa are a few common subword-based tokenization techniques. ?

Subword-based tokenization enables the model to have a large vocabulary while also learning meaningful context-independent representations. A model can even process a word it has never seen before because deconstruction can lead to recognised subwords. ‘ ‘

As a result, we saw how tokenization methods changed through time to meet the ever-changing needs of the NLP domain and to provide better answers to challenges.

Leave a Thoughtful Comment
X

Read 0 comments and reply

Top Contributors Latest

NeilCummings225  |  Contribution: 2,665