Tokenizer - louis030195

#tokenizer #nlp #machine-learning #text-preprocessing #language-modeling #sentiment-analysis #python #code-explanation #ai #llm Created at 110323 # [Anonymous feedback](https://www.admonymous.co/louis030195) # [[Epistemic status]] #shower-thought Last modified date: 110323 Commit: 0 # Related - [[Readwise/Articles/en.wikipedia.org - Bigram - Wikipedia]] - [[Computing/Embeddings in the human mind]] - [[Readwise/Articles/nlp.stanford.edu - Tokenization]] - [[Business/Entrepreneurship/Pair-programming]] - [[Readwise/Articles/arxiv.org - BEiT V2 Masked Image Modeling With Vector-Quantized Visual Tokenizers]] # TODO > [!TODO] TODO # Tokenizer A tokenizer is a software tool that breaks down a piece of text into its constituent parts, called tokens. These tokens are usually words or phrases that form the building blocks of a text. Tokenizers are commonly used in natural language processing (NLP) and machine learning applications to preprocess text before analysis. The process of tokenization involves removing punctuation and breaking words into their constituent parts, such as roots, suffixes, or prefixes. This process helps machines to better understand the structure and meaning of text. A simple tokenizer can be implemented like this ```py def simple_tokenizer(text): # Convert text to lowercase text = text.lower() # Split text into words words = text.split() # Remove punctuation from each word words = [word.strip('.,!?:;()') for word in words] return words ``` This simple tokenizer takes in a piece of text and returns a list of tokens. It first converts the text to lowercase and then splits it into words. It then removes any punctuation from each word using the `strip()` method. Finally, it returns the list of tokens. Here's an example of how to use this tokenizer: ```py text = "This is an example sentence. It shows how a simple tokenizer works!" tokens = simple_tokenizer(text) print(tokens) ``` Output: ``` ['this', 'is', 'an', 'example', 'sentence', 'it', 'shows', 'how', 'a', 'simple', 'tokenizer', 'works'] ``` A more advanced one can be implemented from scratch for sentences: ```py import re def advanced_tokenizer(text): #Convert text to lowercase text = text.lower() #Replace any unwanted characters with spaces text = re.sub('[^a-zA-Z0-9\s]', ' ', text) #Split text into sentences sentences = text.split('.') #Split each sentence into words words = [sentence.split() for sentence in sentences] return words ``` This advanced tokenizer takes in a piece of text and returns a list of sentences, each of which is a list of tokens (words). It first converts the text to lowercase and then replaces any unwanted characters (e.g., punctuation) with spaces using the `re.sub()` method. It then splits the text into sentences based on the period symbol (`.`), and each sentence is split into individual words using the `split()` method. Finally, it returns the list of sentences as a list of lists of tokens. Here's an example of how to use this tokenizer: ```py text = "This is an example sentence. It shows how an advanced tokenizer works! It can handle punctuation, numbers (like 123), and symbols %&$#@!" sentences = advanced_tokenizer(text) print(sentences) ``` Output: ``` [['this', 'is', 'an', 'example', 'sentence'], ['it', 'shows', 'how', 'an', 'advanced', 'tokenizer', 'works'], ['it', 'can', 'handle', 'punctuation', 'numbers', 'like', '123', 'and', 'symbols']] ``` ## When do you want to train a Tokenizer? >If a language model is not available in the language you are interested in, or if your corpus is very different from the one your language model was trained on, you will most likely want to retrain the model from scratch using a tokenizer adapted to your data. That will require training a new tokenizer on your dataset. But what exactly does that mean? When we first looked at tokenizers in [Chapter 2](https://huggingface.co/course/chapter2), we saw that most Transformer models use a _subword tokenization algorithm_. To identify which subwords are of interest and occur most frequently in the corpus at hand, the tokenizer needs to take a hard look at all the texts in the corpus — a process we call _training_. The exact rules that govern this training depend on the type of tokenizer used, and we’ll go over the three main algorithms later in this chapter. >~ [[Huggingface]]