Embedding a word: It's be process of encoding an input (text, image, audio) into a vector (array of numerical values) with multiple dimensions (100s)
This process is performed by an embeddings model
All the embeddings are stored in a vector database
Data (e.g., words) that have a semantic relationship have similar embeddings
It's hard to visualize multiple dimensions, so sometimes dimensionality reduction is applied to the vector to visualize it in 2D or 3D
$W_E$ is the initial tensor that will be refined until the result is found
For word prediction models, the embedding tensor is composed of vectors represented every single possible word
This embedding tensor is multiplied by each vector to produce the first layer
Attention
The vectors pass through an operation called Attention block in which the vectors exchange information with each other in order to adjust its values/weights.
E.g., words that tend to appear together side-by-side, words with different meaning depending on the context
Multilayer perceptron (Feedforward layer)
In this phase the vectors don't talk to each other
The objetive is to fine tune the meaning of the vector (it's value)
The values are adjusted according to a series of questions that are asked for each vector
E.g., is it english? is it a noun? does it refer to a person?
This operation can be parallelized on GPUs!
\<Repeat>
Steps attention and multilayer perceptron are repeated multiple times
The final result is a last vector that represents the probability distribution of the next possible token
The whole process can be repeated from the start including the new token and with that a complete goal can be generated (a sentence, an image, etc)