top of page

Transformer Models

Prep Time:

Cook Time:

Serves:

Level:

AI

About the Recipe

Ingredients

Preparation

GitHub Repo - https://github.com/huggingface/notebooks

Citing: Huggingfacecourse, 2022


Understanding NLP and LLMs
NLP (Natural Language Processing)

The broader field that focuses on enabling computers to understand, interpret, and generate human language.

NLP encompasses many techniques such as sentiment analysis, named entity recognition, and machine translation.
LLMs (Large Language Models)

Powerful subset of NLP Models characterized by their massive size, extensive training data, and ability to perform a wide range of language tasks with minimal task-specific training.

Models like the Llama, GPT, or Claude series are examples of LLMs.


Important Limitations for LLMs:

  • Hallucinations: They can generate incorrect information confidently

  • Lack of true understanding: The lack of true understanding of the world and operate purely on statistical patterns

  • Bias: They may product biases present in their training data or inputs

  • Context windows: They have limited context windows

  • Computational resources: They require significant computational resources



Transformer Types

Zero-shot Classification - allows you to specify which labels to use for a classification, so you don't have to rely on the labels of the pre-trained model.


Text-generation - provide a prompt and the model will auto-complete it by generating the remaining text.


Mask filling - the model will fill in the missing word of a sentence also referred to as a mask token.


Named entity recognition - is a task where the model has to find parts of the input text correspond to entities such as persons, locations, or organizations.


Question answering - model answers questions using information from a given context.


Summarization - a task of reducing a text into a shorted text while keeping all of the important aspects referenced in the text.


Translation - model that translate from one language to another.


Image classification - uses an image to generate description along with a probability score.


Audio classification - uses audio to translate in text.



Transformer History

Transformer architecture was introduced in 2017 with GPT, fine-tuned on various NLP tasks.


Throughout the years, multiple transformer models have been released (BERT, T5, etc.) that have been trained as language models. This means that they have been trained on large large amounts of raw text in a unsupervised fashion, meaning humans are not needed to label the data.


The general strategy to achieve better performance = increasing models' sizes & amount of data they are pretrained on (with the exception of DistilBERT). It's important to note that this becomes expensive with time and compute resource.


Taken from HuggingFace Course
Taken from HuggingFace Course

Pretraining

Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge. It requires large corpus of data meaning it can take p to several weeks.



Fine-tuning

This is done after a model has been pretrained. To preform fine-tuning, you must first:

  1. Acquire a pretained language model

  2. perform additional training with a dataset specific to your task

Meaning, the amount of time and resources needed to get good results are much lower.



General Transformer architecture

Encoder: receivers an input and builds a representation of its features.

  • bi-directional

  • self-attention


Decoder: the decoded uses the encoder's representation (features) along with the other inputs to generate a target sequence.

  • uni-directional

  • auto-regressive

  • masked self-attention


Encoder-only models - good for sentence classification and named entity recognition

Decoder-only models - good for generative tasks such as text generation

Encoder-decoder models (or sequence-to-sequence models) - good for generative tasks that require an input, such as translation or summarization


Attention layers

Attention layers will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word. this applies to any tasks associated with natural language.





bottom of page