Breaking Down Transformer Models for Intelligent Language Systems

Sudarsanam Bharath
9 min readMay 21, 2024

--

Hey, my friend, We’re in an exciting time where Generative AI has changed the way we interact with digital devices. Isn’t it cool how tools like co-pilots and chatbots have become common, changing how we work? Looking back, it’s impressive to see how far AI has come. Recall the ‘90s? That’s when we started exploring AI, focusing on creating smart systems. Now, we have powerful tools like Co-pilot+PCs, Gemini, and ChatGPT showing AI’s potential. We’ve moved from simple systems to complex language models, showing our continuous interest and innovation in AI. So Generative AI? It’s like chatting in natural language with the interface, and it replies in the same language, code, or pictures. large language models (LLMs) are the machines behind these applications. They are special types of machine learning models used for language tasks, such as:

  • Understanding or grouping text
  • Summarizing long text
  • Comparing different texts for similarity
  • Creating new natural language
  • Translating text between languages

But have you ever wondered how these AI models work, or what makes these large language models give correct results? Let’s take a look at the transformers behind them!

What is a transformer?

The transformer we all know

The moment we hear the word “transformers,” cinema lovers will likely find themselves thinking about Michael Bay’s Transformers, where various super vehicles combine to form a giant and powerful robot. Similarly, large language models possess a transformer-like architecture where the transformer model is made up of several parts, such as encoder blocks, decoder blocks, etc.

The transformer we don’t know

Transformer models are trained with large volumes of text, enabling them to represent the semantic relationships between words and use those relationships to determine probable sequences of text that make sense. These transformer models are trained on large amounts of data from the internet with the help of powerful GPUs (Graphical Processing Units), which leads to the generation of responses that are difficult to distinguish from human responses.

Numbers for llama 2 70B parameter model

Transformer Blocks

A transformer model consists of two components or blocks:

  • An encoder block, creates semantic representations of the training vocabulary.
  • A decoder block, generates new language sequences.

While the architecture remains the same, the specific implementations of this technology can vary greatly depending on the use case. An excellent example of this is Google’s Bidirectional Encoder Representations from Transformers (BERT) model. This model, which plays a crucial role in supporting their search engine, uses only the encoder block of the transformer model. This is due to the fact that the encoder block is primarily responsible for taking in information and processing it, which aligns with the role of a search engine to process and understand queries. Conversely, OpenAI’s Generative Pretrained Transformer (GPT) model adopts a different strategy, opting to use only the decoder block. The decoder block’s primary function is to generate output based on the processed information, which aligns with GPT’s goal of generating human-like text. The variations in the usage of the transformer model’s components demonstrate its versatility and adaptability to different applications.

While a comprehensive explanation of all aspects of Transformer models is beyond this document’s scope, understanding these key elements can provide insight into how they support generative AI.

Process of generating a response with a transformer

Tokenization

Let’s start our journey into training a transformer model! First things first, we need to break down the training text into tokens. Think of tokens as unique text values. To keep things simple, let’s consider each unique word in the training text as a token. But remember, tokens can also represent parts of words or combinations of words and punctuation.

Let’s use a sentence to illustrate this:

I heard rcb won the knockout match against csk

We can tokenize this sentence by identifying each unique word and giving them their own token IDs:

  • I (1)
  • heard (2)
  • rcb (3)
  • won (4)
  • the (5)
  • knockout (6)
  • match (7)
  • against (8)
  • csk (9)

Now, we can represent the sentence with these tokens: [1 2 3 4 5 6 7 8 9]. In the same way, the sentence “I heard rcb won against csk” can be represented as [1 2 3 4 8 9].

As we keep training the model, each new token we discover in the training text gets added to the vocabulary with its own unique token ID:

  • taylor (10)
  • swift (11)
  • and so on…

With a large training text set, our vocabulary could end up with thousands of these tokens.

Embeddings

Tokens are like individual words or parts of words. We could simply give each token a unique ID, but that doesn’t tell us anything about what the token means or how it relates to other tokens. Instead, we give each token a vector, which is just a list of numbers, like [10, 3, 1]. Each number in this list tells us something about the token’s meaning. The specific meanings of these numbers are learned during training, based on how often words are used together or in similar ways.

It’s helpful to think of these vectors like coordinates on a map, where each token has its own specific “address.” Tokens that are closer together on the map have similar meanings.

For example, consider these tokens and their vectors:

  • “dog”: [10,3,2]
  • “bark”: [10,2,2]
  • “cat”: [10,3,1]
  • “meow”: [10,2,1]
  • “skateboard”: [3,3,1]

If we plotted these vectors on a 3D chart, tokens with similar meanings (“dog” and “bark”, “cat” and “meow”) would be closer together, and unrelated tokens (“skateboard”) would be further away.

Note: This is a simple example model in which each embedding has only three dimensions. Real llm’s have many more dimensions

Tokens in the embedding space are placed based on how closely related they are. For instance, “dog” is near “cat” and “bark”, while “cat” and “bark” are near “meow”. “Skateboard” is farther from these tokens.

There are several methods to find suitable embeddings for tokens, such as the Word2Vec language modeling algorithm or the encoder block in a transformer model.

Attention

The encoder and decoder in a transformer model are the parts that make up the brain of the model. One key feature they both use is the attention layer. This layer looks at a sentence and tries to decide how much each word in the sentence is related to the others.

In the encoder, each word is looked at in its context, which means where it sits in the sentence and which words are around it. Depending on this context, the word can have different meanings. For example, “bark” in “the bark of a tree” is different from “bark” in “I heard a dog bark”.

In the decoder, the attention layer tries to guess the next word in a sentence. It looks at the words that have already been processed and decides which are the most helpful for guessing the next word. For example, in the sentence “I heard a dog,” the words “heard” and “dog” might be quite important when guessing the next word:

I heard a dog [bark]

The attention layer works with numbers, not the actual words. It starts with a sequence of numbers representing the sentence to be completed. A layer called positional encoding adds a number to each word to show its position in the sentence. For example:

  • [1,5,6,2] (I)
  • [2,9,3,1] (heard)
  • [3,1,1,2] (a)
  • [4,10,3,2] (dog)

During training, the model tries to predict the final word based on the previous ones. The attention layer gives each word a score that shows how important it is. The scores are used to guess the next word. This is done many times using different parts of the words to get many scores. Then, a neural network looks at all potential words to decide the most likely next one. This process is repeated for each word in the sentence, building the output one word at a time.

This animation demonstrates how this process works in a simplified way. Real calculations are more complex, but these are the basic steps:

  1. A series of tokens, or units of information, are put into the attention layer. Each token is displayed as a series of numbers.
  2. The goal is to guess the next token in the series, which will also be a number that matches a token in the model’s vocabulary.
  3. The attention layer looks at the series so far and gives weights to each token based on how much they might influence the next token.
  4. These weights help create a new number for the next token, along with an attention score. Multi-head attention uses different parts of the tokens to come up with multiple possible tokens.
  5. A connected neural network uses these scores to predict the most likely token from all possible tokens.
  6. The predicted output is added to the series so far, and then used as the input for the next step.

Limitations and Future of Transformers

Despite their impressive abilities, transformer models are not without limitations. One of the main challenges is their inability to understand or interpret the world in the way humans do. While they can generate coherent and contextually relevant sentences, they don’t truly comprehend the information they process. Additionally, these models require massive amounts of computational resources for training, making them less accessible for smaller entities. In the future, we might see innovations that make these models more efficient and accessible, as well as advancements that bring us closer to models that can comprehend information in a more human-like manner.

This means that a transformer model like GPT-4 (the brain behind ChatGPT and Bing) works by taking in text (a prompt) and creating a grammatically correct reply (a completion). Think of it like a really good sentence maker. It doesn’t mean the model is “smart” or “knows” things. It just has a lot of words to use and can put them together in a way that makes sense. What makes GPT-4 really strong is the huge amount of data it’s learned from (public and licensed data from the Internet) and its complex design. This lets it make replies based on how words relate to each other in the data it learned from. Often, it can make replies that seem just like a human would say.

Conclusion

Transformer models are the powerful brains behind today’s smart language systems. They use building blocks like tokens, embeddings, attention layers, and encoder-decoder blocks. These models are at the heart of AI’s progress.

These models let machines understand and create text similar to humans, changing how we use technology. They’re used in chatbots like ChatGPT and more complex systems like Google’s BERT. They’re versatile and efficient, but they need a lot of computing power and don’t truly understand text.

The future of these models is promising. We’re working on making them more efficient and easier to use, which could spread AI’s benefits further. As we improve these technologies, we’re getting closer to AI that understands and interacts with the world like humans do.

To summarize, transformer models show our progress towards smarter, more intuitive AI. By understanding their parts and how they work, we can see how far we’ve come in AI and look forward to the changes they’ll keep making in our digital lives.

Thanks to everyone who read this blog. Your curiosity is greatly appreciated. Let’s explore the future of AI together! Before you go:

--

--

Sudarsanam Bharath

Technology, Astrophysics, and Life. Data Science Intern@Lamarr #itookcs50