A Primer on NLP for Clinicians: Part 1 - Tokenization

In this first part we talk about tokenization, the tool we use to make text readable by a machine.

Chris McMaster true

What’s this and why should I care?

If you’re reading this, I presume you have at least a cursory interest in natural language processing (NLP). Either that or I have made you read this. Regardless, this primer on NLP for clinicians will show you how modern NLP works and how it can be applied in clinical medicine. The reasons why are almost too many to mention. Aside from the interesting knowledge we might gain, NLP is an extremely practical tool. We spend our lives reading text — journals, articles, progress notes, you name it, we’re constantly reading to absorb new information. So much of this can be simplified, streamlined, automated and improved with NLP. From conducting audits to summarising the latest literature, NLP can make our lives easier. Hopefully this sparks some ideas and becomes the start of that path for you. We will start with tokenization.


Let’s pretend we have a large number of clinical notes (we will refer to this as a corpus) and we want to represent those in such a way that they can be used by a machine learning model. We can think about this by examining the first snippet from our corpus (there are 10 other brief notes that I won’t show):

('HOPC: 2 weeks of progressive exertional dysponea, now SOB at rest w/ maximum '
 'ET ~10m. Periphral oedema w/ pitting above knees bilaterally.')

One natural way to turn this text into data might be to split it every time we encounter a space, essentially splitting into words. For example:

['HOPC:', '2', 'weeks', 'of', 'progressive', 'exertional', 'dysponea,', 'now',
'SOB', 'at', 'rest', 'w/', 'maximum', 'ET', '~10m.', 'Periphral', 'oedema',
'w/', 'pitting', 'above', 'knees', 'bilaterally.']

Each word is a data point, we could even convert this into a table and feed it in that way. Some cleverer ideas involve counting the frequency of each word and using that data (see tf-idf, although import information is lost in this process, namely the position of each word).

Our simple tokenizer has problems, particularly as the number of documents begins to explode and therefore the number of unique words becomes unreasonably large. Additionally, each misspelling will be its own unique token, which might be okay for very common typos, but we would ideally like some way of coping with small variations in spelling that might occur less frequently.

Subword tokenization refers to a group of tokenization techniques that allow us to optimally tokenize a corpus of text with a limited vocabulary (where we get to choose how large that is). We’re going to explain this using the simplest subword tokenizer, byte pair encoding (BPE).

Byte Pair Encoding

We will get into the details of Byte Pair Encoding (BPE) later. For now, if you’re following along with the code we will import our libraries and instantiate a BPE.

Show code
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.normalizers import NFD, Lowercase, StripAccents, Sequence
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer

tokenizer = Tokenizer(BPE())

Before we can train a BPE algorithm, we need to clean up our text. We will do this in two steps:

  1. Normalization

  2. Pre-processing


Normalization is the process of removing and replacing characters in text to reduce it to a standard set of characters. This often involves several steps and might include the following:

Here, we create a normalization schema by stringing together NFD (a unicode normalization standard - we don’t need to know the details), the lowercase function and a function to remove those accents.

tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])

This is what our string looks like after normalization:

('hopc: 2 weeks of progressive exertional dysponea, now sob at rest w/ maximum '
 'et ~10m. periphral oedema w/ pitting above knees bilaterally.')


We can now pre-process our text — that is, we can split it into parts. These parts are the maximum units we can tokenize, with the rules that each part is composed of 1 or more tokens, but no token can span between parts. If you you think about it, what we’re really describing here are words — we’re going to split our text into words. We will do this by splitting our text every time we encounter whitespace, although you can also include rules about punctuation.

tokenizer.pre_tokenizer = Whitespace()

So here our text becomes:

('hopc', ':', '2', 'weeks', 'of', 'progressive', 'exertional', 'dysponea', ',',
'now', 'sob', 'at', 'rest', 'w', '/', 'maximum', 'et', '~', '10m', '.',
'periphral', 'oedema', 'w', '/', 'pitting', 'above', 'knees', 'bilaterally',


We’re now ready to train a tokenizer, so lets try and understand the algorithm.

After pre-processing, we count the number of instances for each word — this is just going to help us with calculations down the line. We then begin with a candidate vocabulary consisting of every character that appears at least once in our corpus. We then look at token pairs, that is two consecutive tokens occurring within a word. We count the number of times each token pair appears. The token pair that appears most frequently gets added to our vocabulary (in this example at and on share the highest frequency and in reality adding each to our vocabulary occurs in 2 separate steps, condensed to 1 step here for brevity). This process then repeats, with our new tokens also counted at the next.

This process continues until we either run out of new tokens to create (i.e. each word is its own token), or we reach our maximum vocabulary size.

We can now see what this looks like with our corpus. Our corpus is very small, so we will use a vocabulary of size 150, although for an appropriately large corpus we usually use a vocabulary in the 10s of thousands.

trainer = BpeTrainer(vocab_size=150)
tokenizer.train_from_iterator(notes, trainer)

We can look at the first 10 tokens in the vocabulary:

['th', 'fee', 'd', '.', 'p', 'or', 'to', 'is', 'st', 'ys']

As you can see, these aren’t words. They are parts of words, carefully chosen by our algorithm to tokenize the most frequently occurring parts within our vocabulary limit.

Finally, we can see what our text looks like after tokenization:

['h', 'o', 'p', 'c', ':', '2', 'week', 's', 'of', 'pro', 'gres', 'si', 've',
'exer', 'tion', 'al', 'dys', 'pone', 'a', ',', 'now', 'sob', 'at', 'rest', 'w',
'/', 'ma', 'x', 'im', 'u', 'm', 'e', 't', '~', '1', '0m', '.', 'per', 'ip', 'h',
'ral', 'oedema', 'w', '/', 'pit', 'ting', 'abo', 've', 'kn', 'ees', 'b', 'i',
'l', 'at', 'erally', '.']

Actually, this isn’t what the machine learning model sees. Each token in our vocabulary is represented by a number, so this is what it will see:

[20, 26, 27, 15, 12, 7, 103, 30, 55, 80, 118, 75, 49, 100, 96, 43, 110, 134, 13,
1, 98, 99, 61, 101, 34, 4, 62, 35, 120, 32, 24, 17, 31, 37, 6, 105, 3, 130, 121,
20, 135, 102, 34, 4, 132, 145, 108, 49, 122, 111, 14, 21, 23, 61, 97, 3]

Next Steps

That’s tokenization at a basic level. Next, we need to find a way to make these tokens meaningful with embeddings.


For attribution, please cite this work as

McMaster (2022, Feb. 12). chrismcmaster.com: A Primer on NLP for Clinicians: Part 1 - Tokenization. Retrieved from https://chrismcmaster.com/posts/2022-02-12-how-machine-learning-can-read-your-notes/

BibTeX citation

  author = {McMaster, Chris},
  title = {chrismcmaster.com: A Primer on NLP for Clinicians: Part 1 - Tokenization},
  url = {https://chrismcmaster.com/posts/2022-02-12-how-machine-learning-can-read-your-notes/},
  year = {2022}