Gemma is a household of light-weight, state-of-the artwork open fashions constructed from the identical analysis and expertise used to create the Gemini fashions.
Totally different variations of Gemma are designed for various use circumstances and modalities, similar to:
- Single modality (Textual content in, Textual content out)
- Specialization for coding use circumstances
- Multi modality (Textual content and Picture in, Textual content out)
- Various sizes for various {hardware} varieties, inference wants, and different constraints.
- “Novel” architectures
As a result of all these fashions share an identical DNA, the Gemma household presents a singular option to study in regards to the architectures and design decisions which can be out there in trendy LLM techniques. We hope this contributes to a wealthy ecosystem of open fashions and promotes a better understanding of how LLM techniques work.
This sequence will cowl:
- Gemma 1 (2B, 7B) – Transformer primarily based text-to-text fashions.
- CodeGemma (2B and 7B) – A fine-tuned model of Gemma, optimized for code completion and technology.
- Gemma 2 (2B, 9B, 27B) – Up to date text-to-text fashions educated with newer structure with the 2B and 9B variations educated by means of distillation from bigger fashions.
- RecurrentGemma (2B, 9B) – A mannequin constructed on the novel Griffin structure. This structure makes use of a combination of native consideration and linear recurrences to attain quick inference when producing lengthy sequences.
- PaliGemma (3B) – A vision-language mannequin that may absorb textual content and pictures and supply a textual content output.
The best way to use this information
On this sequence, we are going to
- Collate the particular architectures of assorted fashions
- Clarify how these parameters have an effect on mannequin generations (e.g. num embeddings, Multi Question vs Multi Head vs Grouped Question)
- Present code examples of the fashions for additional exploration
To offer details about the mannequin, we use Hugging Face Transformers print module, like the easy code under.
from transformers import AutoModelForCausalLM
mannequin = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
print(mannequin)
You may discover contained in the mannequin with torchinfo or abstract() within the Keras Mannequin class API as properly.
What this information isn’t
This information isn’t an introduction to AI. It assumes working data of neural networks, Transformers and related phrases like tokens. If you happen to want a refresher on these ideas listed here are some assets to get you began:
A palms on neural community studying software that works in browser
An introduction to transformers
Gemma
Gemma is an open weight LLM. It is available in each instruction-tuned and uncooked, pretrained variants at numerous parameter sizes. It’s primarily based on the LLM structure launched by Google Analysis within the Consideration Is All You Want paper. Its major operate is to generate textual content tokenword by tokenword, primarily based on a immediate offered by a consumer. In duties like translation, Gemma takes a sentence from one language as enter and outputs its equal in one other language.
As you’ll quickly see Gemma is each an ideal mannequin by itself, but in addition lends itself to customized extensions to satisfy totally different consumer wants.
Gemma Structure
First, let’s see the transformer decoder that Gemma fashions are primarily based on.
In contrast to the unique encoder-decoder transformer mannequin structure launched in “Attention Is All You Need”, Gemma is solely a “decoder-only” mannequin.
The core parameters of the structure are summarized within the desk under.
Fashions are educated on a context size of 8192 tokens. This implies they will course of as much as roughly 6144 phrases (utilizing the rule of thumb of 100 tokens ~= 75 phrases) at a time.
It is value noting that the sensible enter restrict can differ primarily based on the duty and utilization. It’s because textual content technology consumes tokens inside the context window, successfully lowering area for brand new enter. Though the technical enter restrict stays fixed, generated output turns into a part of the following enter, influencing additional generations.
d_model (2B: 2048, 7B: 3072)
d_model represents the scale of the embeddings (vector representations of phrases or subwords a.okay.a tokens) used as enter to the decoder. It additionally determines the scale of the inner illustration inside the decoder layers.
“d_model x Num heads x Head size” defines the parameter quantity in self_attn
A bigger d_model worth means the mannequin has extra “space” to signify the nuances of various phrases and their relationships. This will result in higher efficiency, particularly for complicated language duties. Nonetheless, growing d_model additionally makes the mannequin bigger and extra computationally costly to coach and use.
Layers (2B: 18, 7B: 28)
Transformers encompass a number of stacked layers. Deeper fashions have extra layers, and due to this fact extra parameters and might study extra intricate patterns. Nonetheless these further parameters imply they’re additionally extra susceptible to overfitting and require extra computational assets.
This augmented representational capability would possibly end result within the mannequin studying noise or particular coaching information patterns that lack the flexibility to generalize to novel examples.
Moreover, deeper fashions usually necessitate extra coaching information to avert overfitting. In circumstances the place out there information is proscribed, the mannequin would possibly lack adequate examples to study a generalizable illustration, resulting in the memorization of coaching information as an alternative.
Feedforward hidden dims (2B: 32768, 7B: 49152)
Every Transformer layer features a feedforward community after the eye mechanism. This community has its personal dimensionality, usually bigger than the d_model dimension to extend the mannequin’s expressive energy.
It’s applied as a multi-layer perceptron (MLP), a form of neural community, to additional rework the embeddings and extract extra intricate patterns.
In Gemma, the usual ReLU non-linearity is changed by the GeGLU activation operate, a variation of GLU (Gate Linear Unit). GeGLU divides the activation into two elements: a sigmoidal half and a linear projection. The output of the sigmoidal half is element-wise multiplied with the linear projection, leading to a non-linear activation operate.
Num heads (2B: 8, 7B: 16)
Every Transformer layer accommodates a number of consideration mechanisms working in parallel. These “heads” enable the mannequin to give attention to totally different facets of the enter sequence concurrently. Rising the variety of heads can improve the mannequin’s means to seize numerous relationships within the information.
Num KV heads (2B: 1, 7B: 16)
The 7B mannequin makes use of multi-head consideration(MHA), whereas the 2B mannequin makes use of multi-query consideration(MQA). MQA shares the identical key and worth projections, which implies every head focuses on the identical underlying illustration however with totally different question projections.
The unique MHA presents richer illustration studying however comes with greater computational prices. MQA gives an environment friendly various that has been proven to be efficient.
Head dimension (2B: 256, 7B: 256)
It refers back to the dimensionality of every consideration head inside the multi-head consideration mechanism. It’s calculated by dividing the embedding dimension by the variety of heads. For instance, if the embedding dimension is 2048 and there are 8 heads, then every head would have a dimension of 256.
Vocab dimension (2B: 256128, 7B: 256128)
It defines the variety of distinctive tokens (phrases, sub phrases or characters) that the mannequin understands and might course of. Gemma tokenizer is predicated on SentencePiece. The scale of the vocabulary is predetermined earlier than coaching. SentencePiece then learns the optimum subword segmentation primarily based on the chosen vocabulary dimension and the coaching information. Gemma’s massive 256k vocabulary permits it to deal with numerous textual content inputs and probably enhance efficiency on numerous duties, e.g. dealing with multilingual textual content inputs.
Gemma 7B
GemmaForCausalLM(
(mannequin): GemmaModel(
(embed_tokens): Embedding(256000, 3072, padding_idx=0)
(layers): ModuleList(
(0-27): 28 x GemmaDecoderLayer(
(self_attn): GemmaSdpaAttention(
(q_proj): Linear(in_features=3072, out_features=4096, bias=False)
(k_proj): Linear(in_features=3072, out_features=4096, bias=False)
(v_proj): Linear(in_features=3072, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=3072, bias=False)
(rotary_emb): GemmaRotaryEmbedding()
)
(mlp): GemmaMLP(
(gate_proj): Linear(in_features=3072, out_features=24576, bias=False)
(up_proj): Linear(in_features=3072, out_features=24576, bias=False)
(down_proj): Linear(in_features=24576, out_features=3072, bias=False)
(act_fn): PytorchGELUTanh()
)
(input_layernorm): GemmaRMSNorm()
(post_attention_layernorm): GemmaRMSNorm()
)
)
(norm): GemmaRMSNorm()
)
(lm_head): Linear(in_features=3072, out_features=256000, bias=False)
)
embed_tokens (Embedding Layer)
This layer converts the enter tokens (phrases or subwords) into dense numerical representations (embeddings) that the mannequin can course of. It has a vocabulary dimension of 256,000 and creates embeddings of dimension 3072.
layers
That is the guts of the mannequin, consisting of 28 stacked GemmaDecoderLayer blocks. Every of those layers refines the token embeddings to seize complicated relationships between phrases and their context.
self_attn
Within the self-attention mechanism, the mannequin assigns totally different weights to the phrases within the enter when creating the following phrase. Leveraging a scaled dot-product consideration mechanism, the mannequin employs linear projections (q_proj, k_proj, v_proj, and o_proj) to generate question, key, worth, and output representations.
All out_features values are the identical 4096 for q_proj, k_proj and v_proj as this mannequin makes use of Multi Head Consideration (MHA). They’ve 16 heads with a dimension of 256 in parallel, totaling 4096 (256 x 16).
Moreover, the mannequin leverages positional data extra successfully by using rotary_emb (GemmaRotaryEmbedding) for positional encoding (a.okay.a RoPE).
Lastly, o_proj layer initiatives the eye output again to the unique dimension (3072).
Word that the Gemma 2B mannequin makes use of Multi Question Consideration (MQA).
k_proj and v_proj share the identical head with a dimension of 256, leading to out_features of 256. In distinction, q_proj and o_proj have 8 heads (256 x 8 = 2048) in parallel.
(self_attn): GemmaSdpaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=256, bias=False)
(v_proj): Linear(in_features=2048, out_features=256, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): GemmaRotaryEmbedding()
)
mlp
It makes use of gate_proj and up_proj for a gating mechanism, adopted by down_proj to scale back the dimension again to 3072.
input_layernorm, post_attention_layernorm and norm
These normalization layers stabilize coaching and enhance the mannequin’s means to study successfully.
lm_head
This remaining layer maps the refined embeddings (3072) again to a chance distribution for the following token over the vocabulary area (256000).
CodeGemma (2B and 7B)
CodeGemma fashions are fine-tuned Gemma fashions which can be optimized for code completion and coding chat help. CodeGemma fashions are educated on greater than 500 billion tokens of primarily code. As well as CodeGemma provides fill-in-the- center functionality, permitting completions that happen between two items of present textual content.
CodeGemma highlights the finetunability of the Gemma checkpoints. By means of further coaching the fashions change into specialised at a sure job, studying a extra complicated completion than pure suffix completion.
Code Gemma Utilization
You should use 4 user-defined tokens – 3 for FIM and a “” token for multi-file context assist.
BEFORE_CURSOR = ""
AFTER_CURSOR = ""
AT_CURSOR = ""
FILE_SEPARATOR = ""
Think about that you’re attempting to finish the code just like the display screen under.
And the enter immediate ought to appear to be this
fim_prefix|>import fim_suffix|>if __name__ == "__main__":n sys.exit(0)fim_middle|>
The mannequin will present “sys” because the instructed code completion.
You may discover extra about CodeGemma on CodeGemma / Quickstart.
What’s Subsequent?
This text mentioned the Gemma structure.
In our subsequent sequence of posts, you’ll discover the newest mannequin, Gemma 2. With substantial enhancements in security measures, this mannequin surpasses its predecessor when it comes to efficiency and effectivity throughout inference.
Keep tuned and thanks for studying!
References
Papers
Code Examples
Gemma
CodeGemma
📋 The whole Gemma structure sequence
- Gemma defined: An outline of Gemma mannequin household architectures
- Gemma defined: What’s new in Gemma 2
- Gemma defined: RecurrentGemma structure
- Gemma defined: PaliGemma structure