About the Recipe

Ingredients
Preparation
GitHub Repo - https://github.com/huggingface/notebooks
Citing: Huggingfacecourse, 2022
Understanding NLP and LLMs
Important Limitations for LLMs:
Hallucinations: They can generate incorrect information confidently
Lack of true understanding: The lack of true understanding of the world and operate purely on statistical patterns
Bias: They may product biases present in their training data or inputs
Context windows: They have limited context windows
Computational resources: They require significant computational resources
Transformer Types
Zero-shot Classification - allows you to specify which labels to use for a classification, so you don't have to rely on the labels of the pre-trained model.
Text-generation - provide a prompt and the model will auto-complete it by generating the remaining text.
Mask filling - the model will fill in the missing word of a sentence also referred to as a mask token.
Named entity recognition - is a task where the model has to find parts of the input text correspond to entities such as persons, locations, or organizations.
Question answering - model answers questions using information from a given context.
Summarization - a task of reducing a text into a shorted text while keeping all of the important aspects referenced in the text.
Translation - model that translate from one language to another.
Image classification - uses an image to generate description along with a probability score.
Audio classification - uses audio to translate in text.
Transformer History
Transformer architecture was introduced in 2017 with GPT, fine-tuned on various NLP tasks.
Throughout the years, multiple transformer models have been released (BERT, T5, etc.) that have been trained as language models. This means that they have been trained on large large amounts of raw text in a unsupervised fashion, meaning humans are not needed to label the data.
The general strategy to achieve better performance = increasing models' sizes & amount of data they are pretrained on (with the exception of DistilBERT). It's important to note that this becomes expensive with time and compute resource.

Pretraining
Pretraining is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge. It requires large corpus of data meaning it can take p to several weeks.

Fine-tuning
This is done after a model has been pretrained. To preform fine-tuning, you must first:
Acquire a pretained language model
perform additional training with a dataset specific to your task
Meaning, the amount of time and resources needed to get good results are much lower.

General Transformer architecture
Encoder: receivers an input and builds a representation of its features.
bi-directional
self-attention
Decoder: the decoded uses the encoder's representation (features) along with the other inputs to generate a target sequence.
uni-directional
auto-regressive
masked self-attention
Encoder-only models - good for sentence classification and named entity recognition
Decoder-only models - good for generative tasks such as text generation
Encoder-decoder models (or sequence-to-sequence models) - good for generative tasks that require an input, such as translation or summarization
Attention layers
Attention layers will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word. this applies to any tasks associated with natural language.
