topside

Thoughts on Training Large Language Models

Back in February of 2023, Meta released the first version of the LLaMA language model. It wasn’t the first open source LLM, but it was the first that was really capable of letting someone run it on their own hardware. From the moment it was released, I was obsessed with it. I wanted to understand how it worked, why it worked, and how I could use it to do things.

Over the last 2 years, I have read just about every paper I could find on the subject. Hundreds of papers. Starting with the original “Attention is All You Need” paper, to “Physics of Language Models” series, to everything in between. Inspired by blog posts by Eric Hartford, I began tinkering to begin training my own.

Before we get into the details, lets review what a large language model is so we are all on the same page.

What is a Large Language Model?

A large language model is a series of smaller sub-models which are stacked on top of each other in layers; with a special embedding layer at the beginning and end that perform operations to shape the input and output of the model. The input is a series of tokens, which is a encoded chunk of text. In the case of LLaMA and most other modern LLMs, these tokens are a type of subword, broken down by a Byte Pair Encoding (BPE) algorithm.

For example, the sentence “I am training a stupendous large language model” would be broken down into the tokens:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")

sample_text = "I am training a stupendous large language model"

result = tokenizer.encode(sample_text)
print(result)
# [40, 1079, 4862, 264, 90032, 408, 782, 3460, 4128, 1614]

results = [tokenizer.decode(r) for r in result]
print(results)

# ['I', ' am', ' training', ' a', ' stup', 'end', 'ous', ' large', ' language', ' model']

Most of the tokens are single words, with a space at the beginning of the word, but some longer and less common words are broken down into smaller tokens. For those of you who have taken CS classes, it is very similar to what you would find in any standard linker. This input is processed by the model.embed_tokens.weight matrix, which is a matrix of weights that are used to embed the input tokens into a higher dimensional space. For our example Qwen3-1.7B model, the input tokens are embedded into a 2048 dimensional space, so the input tokens are multiplied by a 151,936 x 2048 matrix, creating a vector of 2048 dimensions for each input token.

This is where the idea of the context window comes in. The context window is the number of tokens that the model can see at one time. For our theoretical input sequence, we had 10 tokens, so the current context window is 10. This means we will have 10 x 2048 vectors coming in to our transformer layers.

The outputs of a LLM are a matrix of probabilities, called logits, which are the log of the probabilities of the next token. For example, if the model has a vocabulary of 100,000 tokens, the logits will be a vocab_length x hidden_size tensor. For our example model, the hidden size is 2048 and the vocab size is 151,936, so the logit matrix is a single 2048 x 151,936 matrix. You then “sample” from this matrix to get the next token.

The Transformer - The Latent Space

So, with the inputs and outputs defined, we can now talk about the transformer. The transformer is a type of neural network that is used to plot the course we are traversing in what is called the latent space. Latent space is this nebulous n-dimensional space that represents all of the information that has been encoded across all of the documents that have been used to train the model. It is a bit of a difficult concept to grasp, so let’s break it down.

When a model is trained, it is trained on a corpus of text. This corpus of text is a collection of documents that are tokenized and fed in to the model during the pre-training phase, where the model steps through the input tokens and then applies the machine learning technique of back-propagation to update the weights of the model. The task during pre-training is to minimize a loss function called cross-entropy that seeks to minimise what is equivalently called the KL divergence between the data distribution and the model distribution. This training regime has the effect of minimizing perplexity, which is a measure of how well the model is able to predict the next token.

Trillions of tokens are used to train a model. Often these documents start at around 8192 tokens, and then techniques like Rotational Position Embedding (RoPE) are use to extend the context window, and larger document are fed in, stretching the context window even further. Techniques like this are used to break the local contextual relationships between tokens, and instead create a global understanding of the latent space.

Why do we have to do it like this? Because the latent space is sampled in patches of information. Like, the smallest discrete units of information are used in the attention mechanism of the transformer, and then all of the words around that state are played out through subsequent generations.

Internals of the Transformer

The transformer is effectively two different types of neural networks. First, we have the attention mechanism, which is a series of matrices that are multiplied together and stacked on top of each other. The input to this is called the query, which is used with the key to get the value, which is then applied to the out to get the attention score. You will often hear these referred to as QKV matrices or (q k v o) layers.

The next type of neural network is the feed-forward network (FFN), which is a standard multi-layer perceptron (MLP). The inputs will be the outputs of the attention layer, and the outputs are the inputs to the next layer. They are of the size of the hidden size of the model, and follow your standard feed-forward network ‘upscaling’ process. The layers used are the up and gate layers in, and the down layer out. The up layer and down layer work together in a form of SVD-like decomposition, which spreads the information out in a generally larger hidden space between so that the model can learn more complex relationships between the input and output. The gate layer is used with the down layer to operate as a controlled mask, which polymorphically shapes what part of the hidden space is used in processing data through the MLP.

The Layering Effect

So, each of these transformer blocks are stacked on top of each other, permuting the information in the context to tweak and update the hidden state of the model; making small updates to this hidden state iteratively. This is an important point, because each layer is in effect doing a different thing, and the computation it is doing is dependent on the priors - the previous layers determine some pieces of information that are then critical in the downstream layers.

So, even though they are all operating and tweaking this same hidden state, they are doing it in different ways. Some might be considering “this token is related to this other token”, while others might be considering “this token is at this position in the document”, or whatever other infinite eigenfunctions of the latent space it has reward hacked it’s way to learn.

That is an important insight that you have to understand about LLMs, and training them. These attention mechanisms are not just a simple matrix multiplication, but a complex interplay of information, and every step is trying to nudge and shuffle the model to fit this document vector. In the process, it discovers all sorts of creases and corners, ways of sneaking information in to the gaps that are available in the models parameters - and when you push information in to the gaps, it might learn something, but it also might cause it to forget something else.

Transformer Architecture TL;DR

The TL;DR is that facts and relationships are encoded in the QKV attention, and style and language is encoded in the FFN.

Dataset Curation

When preparing to train a model, the first thing you need is a dataset. When selecting or curating a dataset, there are a few things you need to keep in mind.

When I think about what each entry in the dataset represents, I think about a serialized kernel of information. The LLM is a complex system, and skips through these kernels as it generates new tokens into the context window, activating whichever ones reduce the next-token uncertainty the most. If a kernel is not well represented, it will not have a strong enough signal to the model to guide it to use the information from that kernel.

So, when designing your corpus, firstly you need to consider what you are wanting to accomplish with the model. Your dataset, in whole, is a stack of these information kernels that form a policy function in the model’s latent space. This policy function will structurally warp the latent space in a way to encourage the model to reproduce the structures it discovers. This leads to the first question: what specific patterns do you want to encourage the model to master?

Understanding Data Requirements: Pre-training vs Fine-tuning

The data requirements for training depend dramatically on whether you’re doing pre-training or fine-tuning, and recent research has fundamentally changed our understanding of optimal data sizing.

Pre-training (Chinchilla Rule)

For full pre-training where all parameters start random, the “Training Compute‑Optimal Large Language Models” paper (known informally as Chinchilla) suggests that we need approximately 20 tokens worth of data per trainable parameter to achieve compute-optimal training. This is because every parameter matrix must be sculpted from scratch to encode linguistic patterns and world knowledge.

Thus, if we are performing full pre-training on a 1.7B parameter model, we would need 34B tokens of data. That is an enormous amount of data - far more than most people have access to or have the compute to process in a reasonable amount of time.

Fine-tuning (Information-Capacity Rule)

However, for LoRA fine-tuning, the optimal ratio is dramatically different. Recent research by Morris et al. (2025) reveals that the optimal training regime occurs when dataset information bits ≈ adapter capacity bits. This gives us a new rule:

Optimal tokens ≈ 0.8 × trainable_parameters

Using empirical constants of 3.6 bits per parameter storage capacity and corpus-specific entropy rates:

  • Instruction datasets: ~4-5 bits/token (highly templated)
  • General text: ~6-7 bits/token
  • Code: ~2-3 bits/token

This means a rank-16 LoRA adapter with ~70M trainable parameters needs only ~56M tokens total, giving an optimal T/P ratio of ≈ 0.8, not 20.

Why the Difference?

The fundamental difference is that LoRA adapters only need to learn task-specific behavioral patterns, not world knowledge. The base model already encodes most linguistic and factual information, so the adapters require far less data to specialize effectively.

Dataset Quality

When constructing your dataset, a extremely important consideration you must make is to judge the quality of the data and repair or exclude any poor data. The data you feed into the model will be the data that the model will use to generate new tokens. If the data is not clean, the model will learn to replicate the bad patterns in the data, effectively wiping out all of the good work you did in the first two steps. Hand curated, human produced data from a subject matter expert is generally considered to be the best. Synthetic data is an option that is often used, but it does not have the same level of diversity as human produced data.

By contrast, the lowest sediment of internet scraping, like Twitter hot-takes and Reddit karma-chasing, tends to be stylistically narrow and generally low quality. If you find a lot of duplication of phrasing, typos, or other issues, you need to consider the negative impact it will have on the quality of the model. Bad data in means bad results out.

Large language models are shown to be able to generalize well on distributions that they have seen, but may produce errors known as hallucinations when prompted to generate text that was absent from the corpus, as this is out-of-distribution data. While most modern base models are trained on massive datasets, and have seen so much data that they can generalize well in most areas, but you will still need to consider the type and quality of the data you are feeding into the model, as all training will cause the model to forget some information that it already knows.

Given these points, your dataset needs to be appropriately sized, focused, and clean if it is going to provide a clear representative sample of the patterns you want the model to produce at inference time.

Instruct Datasets - How to get the model to do what you want

Another factor to consider is the how of what you get the model to do what you want. The context window is effectively rendering one giant document, but in practice we programmatically operate with models using request/response chat turns. The bridge between the two is a type of structured text called an instruct dataset.

Instruct combines templates, datasets that use specific formats, and special tokens to produce this structured text that then notify the inference interface when to expect user input and when to halt generation and return the response.

There are four main types of instruct templates: ChatML, Llama, Mistral, and Alpaca (legacy). Each uses their own special tokens and formatting. There are also a few different types of instruct dataset formats, with each generally having a system prompt, and then a series of user and assistant messages.

from transformers import AutoTokenizer, Qwen2TokenizerFast

tokenizer : Qwen2TokenizerFast = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")

turns = [
    {'role': 'system', 'content': 'You are a helpful assistant'},
    {'role': 'user', 'content': 'How are you today?'},
    {'role': 'assistant', 'content': 'I am doing great!'},
    {'role': 'user', 'content': 'What is on my schedule for this morning?'},
]

# Huggingface tokenizer uses jinja2 templates under the hood
# The add_generation_prompt=True is important for inference, because it will add the generation prompt to the end of the document, which is required to put the model in the correct generation context
# This would not be used in a training dataset, but is used when you are generating text with the model

tokenized_turns = tokenizer.apply_chat_template(turns, tokenize=False, add_generation_prompt=True)

print(tokenized_turns)

# '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHow are you today?<|im_end|>\n<|im_start|>assistant\nI am doing great!<|im_end|>\n<|im_start|>user\nWhat is on my schedule for this morning?<|im_end|>\n<|im_start|>assistant\n'

The inference engine will format this document and pass it to the model, which will then generate one logit matrix. The engine will use some specified sampler settings to determine which token to pick, and will add that to the end of the context and repeat the process. It does this until some end token is reached, or some maximum number of tokens are generated.

So, to train the model, we need to generate this document and pass it to the model trainer.

System Prompts and Soft Prompting

The system prompt (and instruct formatting in general) is a special form of what is known as soft prompting. Soft prompting is a method by which a path is etched into the latent space of the model, which encourages the model to generate text in a specific way.

This technique works by ‘framing’ the model in a specific context. Because some tokens are set at the beginning of the document, they encourage the model to generate text in a specific way, with the initial tokens being the most important and acting as an anchor that biases the rest of the document that is being generated.

When training your own model, you can utilize system prompts to guide your model into the patterns you want to see. You do this by creating a system prompt that informs the model about what the contents of the rest of your document is. By using the same or highly similar system prompts across all of the documents in your dataset, you can create a policy that ’etches’ a lighter and stronger path through the latent space, encouraging the model to select your desired tokens more often without overfitting to the dataset or damaging the model’s ability to generalize.

Soft prompting is shown to be strongest for the first 20 tokens, and then it’s effect diminishes rapidly; so creating system prompts that are longer than 100 tokens is not recommended as they will not have the same effect and waste room in the context window.

Preprocessing and Pre-Tokenization

Before you can train the model, you have to preprocess your dataset. This is a pretty straightforward process if you are using a tokenizer that supports the instruct format, and have your dataset in a jsonl or parquet file.

from datasets import load_dataset

# The most common way, if a dataset is already uploaded to huggingface hub
dataset = load_dataset("my_dataset", split="train")

# If you have a local dataset downloaded, you can load it like this
dataset = load_dataset("json", "my_local_dataset", split="train")

# If you have some loose files, you can load them like this
dataset = load_dataset("json", ".", data_files=["my_local_dataset.jsonl"], split="train")

# For my example, I have a jsonl with {"entry_id": "foo", "turns": [{"role": "system", "content": "bar"}, ...]}
print(dataset)
# Dataset({
#     features: ['entry_id', 'turns'],
#     num_rows: 1330
# })

Then you apply the chat template to the dataset.


def formatting_function(samples):
    formatted_text = [
        tokenizer.apply_chat_template(entry, tokenize=False)
        for entry in samples["turns"]
    ]
    return {"text": formatted_text}

# Process the dataset in parallel, removing the turns and entry_id columns
dataset_formatted = dataset.map(
    formatting_function, batched=True,
    num_proc=8, remove_columns=["turns", "entry_id"]
)

print(dataset_formatted)
# Dataset({
#     features: ['text'],
#     num_rows: 1330
# })

# Save the dataset to a new file - this will be compressed as one or more arrow files
dataset_formatted.save_to_disk("my_dataset_formatted")

# Or you can save it to a jsonl file to be more portable
dataset_formatted.to_json("my_dataset_formatted.jsonl")

The model is not trained on raw text, but internally uses the tokenized representation of the text. If you want to save some on-line compute, you can pre-tokenize your dataset before training. This is not required, but it can save some time during training.

MAX_SEQ_LENGTH = 32768

def tokenize_for_sft(examples):
    tokenized_output = tokenizer(
        examples["text"],
        truncation=True,
        max_length=MAX_SEQ_LENGTH,
        padding=False,
    )
    return tokenized_output

sft_ready_dataset = dataset_formatted.map(
    tokenize_for_sft,
    batched=True,
    num_proc=8,
    remove_columns=['text']
)

print(sft_ready_dataset)
# Dataset({
#     features: ['input_ids', 'attention_mask'],
#     num_rows: 1330
# })

In the resulting tokenized dataset, you will have two columns: input_ids and attention_mask. The input_ids are the tokenized representation of the text, and the attention_mask is a binary mask that indicates which tokens are real (trainable) and which are padding.

Parameter Tuning

Now that we have a formatted dataset, we are ready to think about training on a model. To do this, we will be creating what is known as an adapter for the model. An adapter is an overlay that is added on top of a pretrained base model, containing the changes in the weights generated during training.

There are various types of adapters, and they differ in size and complexity. The most common type is a LoRA adapter (Low-Rank Adaptation), which is a type of reduced rank adapter that consists of a pair of up and down matrices for each individual parameter matrix in the model.

The advantage of using a lower rank adapter is multi-fold. It reduces the number of trainable parameters, as it is a rank-deficient projection of the parameter matrix. This both reduces the amount of GPU memory required to train the model and the amount of space on disk it takes up. It also effectively reduces the amount of parameters that are being trained, which, going back to the information-capacity principle, suggests that we can train the model using much less data without causing overfitting.

Optimal LoRA Rank Selection

Rather than guessing rank based on GPU memory or following outdated heuristics, you can calculate the optimal rank mathematically using information theory.

The Information-Capacity Rule

The key insight from Morris et al. 2025 is that optimal generalization occurs when:

Dataset Information ≈ Adapter Capacity

Where:

  • Dataset Information = Total Tokens × Entropy per Token
  • Adapter Capacity = Trainable Parameters × Bits per Parameter

Using empirical constants:

  • 3.6 bits per parameter (model storage capacity)
  • 4-5 bits per token for instruction data
  • 6-7 bits per token for general text

Calculating Your Optimal Rank

def optimal_rank_for_dataset(total_tokens, target_modules_params, 
                           entropy_per_token=4.5, bits_per_param=3.6):
    """
    Find LoRA rank where dataset bits ≈ adapter capacity bits
    """
    target_capacity_bits = total_tokens * entropy_per_token
    optimal_params = target_capacity_bits / bits_per_param
    optimal_rank = optimal_params / target_modules_params
    
    return int(optimal_rank)

Generalization Score

You can track where you are on the memorization-to-generalization curve:

def generalization_score(tokens, trainable_params, entropy_per_token=4.5):
    """Returns 0-1 where 1 = pure generalization, 0 = pure memorization"""
    capacity_tokens = (3.6 / entropy_per_token) * trainable_params
    ratio = tokens / capacity_tokens
    
    # Logistic centered at ratio=1 (the information capacity knee)
    return 1 / (1 + math.exp(-math.log(ratio) / 0.6))

Practical Guidelines

Generalization Score Interpretation Action
< 0.3 Under-training Add more data or reduce rank
0.3 - 0.7 Sweet spot Optimal training regime
> 0.7 Over-training Stop training or increase rank

Multiple Epochs: Less is More

For LoRA training, 1-2 epochs are typically sufficient. The information-capacity curve shows that once you exceed the optimal token budget, additional epochs primarily increase memorization risk rather than improving generalization.

The rule of thumb for epochs:

  • First epoch: Model learns the basic task patterns
  • Second epoch: Refinement and stabilization
  • Third epoch and beyond: Diminishing returns with increased memorization risk

If your T/P ratio is already > 2, additional epochs waste compute and may harm generalization.

Learning Rate and Alpha

The learning rate is the most important parameter to tune in the training process. It is the step size that is taken during each optimization step. If it is too large, the model will overfit to the training data, etching too hard and cause massive amount of catastropic forgetting. If it is too small, the data will not be imprinted in to the tensors, and it will learn little to nothing.

For various type of training, the learning rate is set differently. For our examples we will assume we are using an AdamW optimizer and a cosine learning rate scheduler, as these are the most common and accepted as best practices.

The alpha is the scaling factor that is applied to the adapter matrices during the forward pass. It is like the volume knob on the adapter, and how loud you turn it up or down is going to be dependent on the rank of the adapter you use.

For adapter training, the learning rate is generally set at 1e-4 to 2e-4 for lower rank (<16) adapters, and 5e-5 to 8e-5 for higher rank (16+) adapters. The alpha is set to the rank of the adapter.

As you train your model against your data, you might notice that your loss curve is dropping too quickly. This is a sign that the model is overfitting to the training data, and you need to either adjust the learning rate or the alpha down. In general, if you have a diverse dataset you should see your loss drop at the very beginning, and then level out with a very gentle slope; but then see a significant drop between epochs.

If you notice that your loss did not drop at your epoch boundary, you need to either adjust the learning rate or the alpha up.

The rule of thumb is that for small ranks you would adjust the learning rate, and for larger ranks you would adjust the alpha.

Information-Capacity vs Chinchilla

The fundamental difference between the two approaches:

Approach Optimization Target T/P Ratio Use Case
Chinchilla Compute efficiency 20 Pre-training from scratch
Information-Capacity Generalization efficiency 0.5 - 2.0 LoRA fine-tuning

The Chinchilla paper suggests that we need 20 tokens worth of data per trainable parameter. This is the threshold at which a model training from scratch will generalize well instead of purely memorizing sequences. However, this represents the lower bound of the compute optimal range for pre-training, not fine-tuning.

For LoRA fine-tuning, the information-capacity rule shows that optimal generalization occurs at much lower T/P ratios because:

  1. Base knowledge exists: The pretrained model already encodes linguistic and world knowledge
  2. Limited capacity: Only adapter parameters can store new information
  3. Task-specific learning: Adapters only need to learn behavioral patterns, not facts

At the information-capacity knee (T/P ≈ 0.8), you get maximum generalization pressure without starving the adapters of information.

Overcoming Data Limitations

When working below the optimal token count, there are techniques to improve training:

Dropout techniques can be used to overcome limitations. Model dropout will mask some percentage of parameters, preventing them from being updated during a single training step. lora_dropout is a type of dropout that is applied to the LoRA matrices, preventing updates to portions of the adapter during a training step.

Model dropout is generally done for full model training, whereas LoRA dropout is done to prevent overfitting from data insufficiency.

A second common technique is to apply weight_decay to the model. This is a L2 regularization technique that adds a small penalty to the model’s loss. During each optimization step, the parameters are pulled back towards the origin (zero) by a small amount. This helps prevent overfitting by keeping the parameters from getting too large. On an adapter-only fine tune, set it at 0.01-0.02.

Using both weight decay and dropout together is a common recipe when your information-capacity ratio suggests you’re in the memorization regime.

LoRA Dropout Values

In general, only one dropout should be set - so pick the LoRA dropout or the model dropout.

LoRA dropout values can be tuned based on your position on the memorization-generalization curve. When you have optimal information-capacity balance, a small lora dropout of 0.01-0.02 is recommended. As you move into the memorization regime, your dropout might be set at 0.05 to offset overfitting. For larger rank (>=32) rsLoRA with a small corpus, you might need to set it as high as 0.1.

rsLoRA

rsLoRA is a technique that works to stabilize the training process for larger rank adapters; generally used for rank 16 and above. It works by scaling the alpha of the LoRA matrices by the square root of the rank. This is a technique that helps keep the LoRA balanced as it acknowledges the parameter count rises in a square proportional manner.

DoRA

DoRA (Weight-Decomposed Low-Rank Adaptation) is a technique that I have not experimented much with, but it stores the scale and magnitude of the parameters. The idea that it operates in a more signal-focused way intuitively makes more sense; and explains why it tends to converge faster than LoRA methods and achieve higher quality results.

It is suggested in the paper that a lower learning rate, half of the rank, and more dropout should be used.

Model Breakdown

We will use Qwen 3 as our example, since it is one of the most modern and well performing models and trains well.

Qwen released six models in this family at the end of April 2025, along with this chart:

Model Layers Attention Heads
(Q / KV)
Tied Embeddings Context Length
Qwen3‑0.6 B 28 16 / 8 Yes 32 K
Qwen3‑1.7 B 28 16 / 8 Yes 32 K
Qwen3‑4 B 36 32 / 8 Yes 32 K
Qwen3‑8 B 36 32 / 8 No 128 K
Qwen3‑14 B 40 40 / 8 No 128 K
Qwen3‑32 B 64 64 / 8 No 128 K

Advanced Model Features - Attention Heads and GQA

Modern models are using a technique called Grouped Query Attention (GQA) to improve the attention mechanism. This is a technique that groups the queries together and processes them in parallel, which can improve the speed of the attention mechanism. It subdivides the (k v) tensors into smaller groups (or heads) so it can still simultaneously execute multiple queries across a smaller key/value vectors.

This effectively makes each transformer layer a block of smaller lookups instead of a single large lookup. The tradeoff has a lot of memory and cache implications that speed up inference at the cost of some long-range context understanding. It also means that you have to be more careful about the rank of the adapter you use, as it can cause the model to become unstable and forget the facts that were encoded in the base model.

You will notice from the above table that the number of query and key/value heads are not the same. This is the different between ‘full’ Multi-Head Attention (MHA) and GQA; wherein GQA performs a lot of queries across each single key/value pair. This creates a bottleneck in the attention mechanism which forces more information into the k_proj tensor.

In practicality this forces us to consider if we want to train our k_proj layer much (or even at all). The q_proj is going to end up doing most of your heavy lifting, which is then balanced by the v_proj and fixed by the o_proj.

Tied Embeddings

Tied embeddings mean that the embed_tokens and the lm_head share the same weights - becoming instead of two separate matrices, a single matrix and it’s transpose. This dramatically reduces the size of the model, as the input and output matrices are usually some of the largest in the model, and it also regularizes the model more because the input and output token geometry have to live in the space space.

I generally do not even train the lm_head or the embed_tokens layer, as it is the most delicate part of the model. Unless you are training new special tokens, it is best to leave it alone.

Layer Sizes and Information-Capacity Analysis

To accurately determine your optimal configuration, you can count the number of parameters in each layer and calculate the information-capacity balance.

Running an updated parameter counter script on a dataset yields results like:

Layer Name                | Parameters (M)  | Excluded  
--------------------------+-----------------+-----------
embed_tokens              | 311.16          | Yes
q_proj                    | 117.44          | No
k_proj                    | 58.72           | No
v_proj                    | 58.72           | No
o_proj                    | 117.44          | No
q_norm                    | 0.00            | No
k_norm                    | 0.00            | No
gate_proj                 | 352.32          | No
up_proj                   | 352.32          | No
down_proj                 | 352.32          | No
input_layernorm           | 0.06            | No
post_attention_layernorm  | 0.06            | No
norm                      | 0.00            | No
lm_head                   | 311.16          | Yes
--------------------------+----------------
Total parameters          | 1409.41M
--------------------------+----------------
Dataset length            | 140.42M
2 epochs                  | 280.85M
--------------------------+----------------
T/P ratio                 | 0.20
Using ENTROPY_PER_TOKEN = 6.4 bits/token

--- LoRA Information-Capacity Analysis ---

Rank  | Params(M) | T/P    | Ratio  | MemScore | GenScore | Status 
------+-----------+--------+--------+----------+----------+--------
4     | 4.4       | 64.44  | 114.56 | 0.036    | 0.964    | Over   
8     | 8.7       | 32.22  | 57.28  | 0.059    | 0.941    | Over   
16    | 17.4      | 16.11  | 28.64  | 0.096    | 0.904    | Over   
32    | 34.9      | 8.06   | 14.32  | 0.155    | 0.845    | Over   
64    | 69.7      | 4.03   | 7.16   | 0.252    | 0.748    | Over   
128   | 139.5     | 2.01   | 3.58   | 0.410    | 0.590    | Over   
256   | 278.9     | 1.01   | 1.79   | 0.665    | 0.335    | Optimal
512   | 557.8     | 0.50   | 0.90   | 0.895    | 0.105    | Knee   
1024  | 1115.7    | 0.25   | 0.45   | 0.448    | 0.552    | Under  

======================================================================
INTERPRETATION:
• Ratio = tokens / capacity_tokens (1.0 = knee, >1.0 = past knee)
• MemScore = memorization level (1.0 = max memorization at knee)
• GenScore = generalization level (higher = better generalization)
• Optimal status = maximum generalization zone (ratio ~1.5-2.0)
======================================================================

>> OPTIMAL RANK for MAX GENERALIZATION: 256
   - Trainable params: 278.9M
   - T/P ratio: 1.01
   - Capacity ratio: 1.79
   - Generalization score: 0.335
   - Status: Optimal

>> HIGHEST GENERALIZATION SCORE: Rank 4 (GenScore: 0.964)

For this dataset with 140M tokens and english-style entropy (6.4 bits/token), rank 256 provides the optimal information-capacity balance, but you might select rank 32-64 for a higher degree of generalization.

Batch Size

The batch size is a combination of the per_device_train_batch_size and the gradient_accumulation_steps. The per_device_train_batch_size is the number of parallel batches that will be processed on each GPU, and the gradient_accumulation_steps is the number of samples that will be processed before the optimizer step is taken; which updates the weights of the model.

A trade off between memory, speed, and model updates. The more frequently the model is updated, the more variation that is pushed into the model. Often multiple steps are accumulated before the optimizer step is taken, which helps to mitigate issues with the data distribution, leading to better generalization.

It is quite common to see a total batch size (these two values multiplied together) of around 12 to create a stable training process, but this can be adjusted up or down depending on the size and diversity of the dataset.

The other factor to consider is that your GPU memory is a limited resource, and you may need to adjust the batch size to stay within that limit. While you generally want to set your per_device_train_batch_size as high as possible to maximize speed, you may need to adjust it down if you are running out of memory.

For my training run, I am going to shoot for 6 since that is what will fit in my GPU without OOM (Out of Memory) errors and should provide enough variety to not over-optimize on any one document.

Tooling

There are several different tooling options for training your own model. The most common is to use the trl library, which is a library that is built on top of accelerate and peft, and supported by HuggingFace.

On top of trl, unsloth is a library that has several memory optimizations that can be used to speed up training and reduce memory usage, particularly for training on one GPU.

For those that want the easiest option, axolotl is a configurable training platform that can be used to train a model with a single command driven from a yaml config file.

There is a huge ecosystem of tools and libraries that can be mixed and matched if you are willing to put in the effort. LinkedIn has released the liger-kernel project that has rewritten several of the portions of the HuggingFace training pipeline, and has some interesting techniques for efficient training.

Unsloth Example

I generally use unsloth to train my models. This specific example will require 12GB of VRAM, and on my 3060 takes around 4 days to run.

# train.py
import os
import glob
from datasets import load_from_disk, Dataset
from dotenv import dotenv_values
from unsloth import FastLanguageModel, is_bfloat16_supported
from transformers import TrainingArguments
from trl import SFTTrainer
import wandb

# --- Configuration for Preprocessed Data Paths ---
PREPROCESSED_SFT_DATASET_PATH = 'sft_ready_dataset'
MAX_SEQ_LENGTH = 16384 + 2048
LOAD_IN_4BIT = True
LORA_RANK = 256                             # Updated based on information-capacity analysis
LORA_ALPHA = 256                            # Set to 1 to 2x rank for balanced training
OUTPUT_DIR = "outputs"
EVAL_SPLIT_SIZE = 32
USE_RSLORA = True                           # We have a large rank (>16), so we should use rsLoRA

MODEL_NAME = "Qwen/Qwen3-1.7B"
PROJECT_NAME = "qwen3-1.7B-blog"

# Store your keys in an .env file for security
envconfig = dict(dotenv_values(".env"))

os.makedirs(OUTPUT_DIR, exist_ok=True)

# 1. Load dataset
print(f"Loading SFT-ready (tokenized) dataset from: '{PREPROCESSED_SFT_DATASET_PATH}'")
sft_ready_dataset_full: Dataset = load_from_disk(PREPROCESSED_SFT_DATASET_PATH)
print("\nSFT-ready dataset loaded:")
print(sft_ready_dataset_full)

# 2. Create train/eval splits
# Just a small sample to act as a sanity check
if len(sft_ready_dataset_full) <= EVAL_SPLIT_SIZE:
    raise ValueError(f"SFT-ready dataset too small for an eval split of size {EVAL_SPLIT_SIZE}.")
train_sft_size = len(sft_ready_dataset_full) - EVAL_SPLIT_SIZE
ds_train_sft = sft_ready_dataset_full.select(range(train_sft_size))
ds_eval_sft = sft_ready_dataset_full.select(range(train_sft_size, len(sft_ready_dataset_full)))
print(f"\nSFT Training set size (tokenized): {len(ds_train_sft)}")
print(f"SFT Evaluation set size (tokenized): {len(ds_eval_sft)}")

# 3. Load Model and Tokenizer
print(f"\nLoading model and tokenizer: {MODEL_NAME}")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_NAME,
    max_seq_length = MAX_SEQ_LENGTH,
    dtype = None,
    load_in_4bit = LOAD_IN_4BIT,
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print("Tokenizer pad_token set to eos_token in main.")

# 4. Apply LoRA
model = FastLanguageModel.get_peft_model(
    model, r = LORA_RANK,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"],
    lora_alpha = LORA_ALPHA, lora_dropout = 0.02, bias = "none",
    rank_pattern     = {"k_proj": LORA_ALPHA // 2},
    alpha_pattern    = {"k_proj": LORA_ALPHA // 2},
    use_gradient_checkpointing = "unsloth", random_state = 3407,
    use_rslora = USE_RSLORA
)

# 5. Training arguments - optimized for information-capacity balance
targs = TrainingArguments(
    per_device_train_batch_size = 2, gradient_accumulation_steps = 3,
    learning_rate = 1e-5,
    weight_decay = 0.001, gradient_checkpointing = True,
    max_grad_norm = 1.0, warmup_steps = 50, num_train_epochs = 2,
    optim = "adamw_torch", # other common options to save memory: adamw_8bit, paged_adamw_32bit
    lr_scheduler_type = "cosine", seed = 3407,
    fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(),
    logging_steps = 1, per_device_eval_batch_size = 1, eval_strategy = "steps",
    eval_steps = 25, save_strategy = "steps", save_steps = 25,
    save_total_limit = 3, output_dir = OUTPUT_DIR,
    report_to="wandb", remove_unused_columns=False,
)

# 6. Initialize wandb
wandb_key = envconfig.get('wandb_key')
if wandb_key: wandb.login(key=wandb_key)
else: print("⚠️ WandB key not found in .env file.")
wandb.init(
    project=PROJECT_NAME,
    config={
        "learning_rate": targs.learning_rate, 
        "architecture": MODEL_NAME,
        "dataset": 'blogdata', "epochs": targs.num_train_epochs,
        "gradient_accumulation_steps": targs.gradient_accumulation_steps,
        "effective_batch_size": targs.per_device_train_batch_size * targs.gradient_accumulation_steps * targs.world_size,
        "lora_rank": LORA_RANK,
        "lora_alpha": model.peft_config["default"].lora_alpha, "max_seq_length": MAX_SEQ_LENGTH
    }
)

# 7. Create SFTTrainer
trainer = SFTTrainer(
    model=model, tokenizer=tokenizer,
    train_dataset=ds_train_sft, eval_dataset=ds_eval_sft,
    max_seq_length=MAX_SEQ_LENGTH, packing=False, args=targs,
)

# 8. Check for existing checkpoints
checkpoint_dirs = glob.glob(os.path.join(OUTPUT_DIR, "checkpoint-*"))
resume_from_checkpoint = None
if checkpoint_dirs:
    checkpoint_dirs.sort(key=lambda x: int(x.split("-")[-1]))
    latest_checkpoint = checkpoint_dirs[-1]
    print(f"\n🔄 Found existing checkpoint: {latest_checkpoint}")
    resume_from_checkpoint = latest_checkpoint
else:
    print(f"\n🆕 No checkpoints found in {OUTPUT_DIR}, starting fresh training")

# 9. Train
print("\nStarting SFTTrainer training with pre-tokenized data...")
if resume_from_checkpoint:
    trainer_stats = trainer.train(resume_from_checkpoint=resume_from_checkpoint)
else:
    trainer_stats = trainer.train()

if wandb.run:
    wandb.finish()

At the end of the optimized training run, you end up with a well-balanced adapter that generalizes effectively without overfitting.

There are some arguments to be made for different learning rates (from 5e-6 to 1e-4), alpha values (from .5 to 2x rank), and number of epochs (2.5-3+) you should use. Using different optimizers is also an option, and they have their own parameters to tune related to infomration decay rates.

Merging The Adapter

To merge the adapter back into the base model, you can use the merge_and_unload method.

# merge_and_unload.py
from peft import PeftModel
from transformers import AutoTokenizer

MODEL_NAME = "Qwen/Qwen3-1.7B"

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# load the adapter
model = PeftModel.from_pretrained(model, "outputs")

model = model.merge_and_unload()

model.save_pretrained("merged_model")
tokenizer.save_pretrained("merged_model")

Now you can push it to HuggingFace hub. I usually use the cli to do this, because most all of the other tooling is brittle. Before that though, you will want to make a README.md file and set a license.

huggingface-cli upload --repo-type model --token $HF_TOKEN <username>/<repo_name> merged_model

Resources

Script - Information-Capacity Parameter Counter

# param_counter.py
from collections import defaultdict
from datasets import load_from_disk, load_dataset
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM
import torch.nn as nn
import math

BITS_PER_PARAM = 3.6                    # Morris et al. 2025, Fig. 7
ENTROPY_PER_TOKEN = 6.4                 # BPE bits of information (measured for your dataset)
NUM_EPOCHS = 2

def capacity_tokens(p_train: int) -> float:
    """Tokens that would exactly fill the model's bit capacity."""
    return (BITS_PER_PARAM / ENTROPY_PER_TOKEN) * p_train

def capacity_ratio(tokens: float, p_train: int) -> float:
    """Calculate ratio of tokens to capacity tokens (ratio=1 is the knee)."""
    cap_tok = capacity_tokens(p_train)
    return tokens / cap_tok

def memorization_score(tokens: float, p_train: int) -> float:
    """
    Memorization score based on Morris et al. Fig 2.
    Peaks at ratio=1 (knee), falls off on both sides.
    Returns 0-1 where 1 = maximum memorization, 0 = minimum memorization.
    """
    ratio = capacity_ratio(tokens, p_train)
    
    if ratio <= 1.0:
        # Left side: memorization grows linearly to peak
        return ratio
    else:
        # Right side: memorization falls off as power law
        return ratio ** -0.7  # alpha=0.7 from Morris et al.

def generalization_score(tokens: float, p_train: int) -> float:
    """
    Generalization score = 1 - memorization_score.
    Maximum generalization occurs when memorization is minimized (ratio ~1.5-2.0).
    """
    return 1.0 - memorization_score(tokens, p_train)

def training_status(tokens: float, p_train: int) -> str:
    """
    Determine training status based on capacity ratio and generalization score.
    """
    ratio = capacity_ratio(tokens, p_train)
    gen_score = generalization_score(tokens, p_train)
    
    if ratio < 0.8:
        return "Under"      # Well before knee, under-training
    elif ratio < 1.2:
        return "Knee"       # At the knee, memorization peak
    elif ratio < 2.5:
        return "Optimal"    # Past knee, maximum generalization zone
    else:
        return "Over"       # Way past knee, diminishing returns

def best_rank_for_dataset(tok_cnt: int, lora_results: list,
                          h_token: float = ENTROPY_PER_TOKEN,
                          b_param: float = BITS_PER_PARAM) -> dict:
    """
    Return the lora_results entry that gives maximum generalization.
    This occurs when ratio is around 1.5-2.0 (past the memorization peak).
    """
    target_ratio = 1.8  # Sweet spot for max generalization
    
    return min(
        lora_results,
        key=lambda r: abs(capacity_ratio(tok_cnt, r["lora_params_M"] * 1e6) - target_ratio)
    )

MODEL_NAME = "Qwen/Qwen3-1.7B"

snapshot_download(MODEL_NAME, max_workers=11)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# go through the model and count the number of parameters on each layer
exclude_layers = ["embed_tokens", "lm_head"]

params = defaultdict(int)
for name, param in model.state_dict().items():
    path = name.split(".")
    name = path[-2]
    params[name] += param.numel()
    
params = dict(params)
print(f"\n{'Layer Name':<25} | {'Parameters (M)':<15} | {'Excluded':<10}")
print(f"{'-'*25}-+-{'-'*15}-+-{'-'*10}")
parameter_count = 0
for name, count in params.items():
    if name not in exclude_layers:
        parameter_count += count
    print(f"{name:<25} | {count/1e6:<15.2f} | {'Yes' if name in exclude_layers else 'No'}")
print(f"{'-'*25}-+-{'-'*15}")
print(f"{'Total parameters':<25} | {parameter_count/1e6:.2f}M")
print(f"{'-'*25}-+-{'-'*15}")

# Load the dataset and get the total length of the input ids
dataset = load_from_disk("/mnt/biggy/ai/notebook/train/sft_ready_dataset")
get_length = lambda x: {"length": [len(r) for r in x["input_ids"]]}
lengths = dataset.map(get_length, batched=True, num_proc=10)
length_total = sum(lengths["length"])

print(f"{'Dataset length':<25} | {length_total/1e6:.2f}M")

length_total = length_total * NUM_EPOCHS
print(f"{str(NUM_EPOCHS) + ' epochs':<25} | {length_total/1e6:.2f}M")
print(f"{'-'*25}-+-{'-'*15}")

# The T/P ratio is the total length of the input ids divided by the total number of parameters
print(f"{'T/P ratio':<25} | {length_total/parameter_count:.2f}")

print(f"Using ENTROPY_PER_TOKEN = {ENTROPY_PER_TOKEN} bits/token")

print("\n--- LoRA Information-Capacity Analysis ---")

ranks = [4, 8, 16, 32, 64, 128, 256, 512, 1024]
lora_target_module_suffixes = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]

# Calculate LoRA parameters for each rank
lora_results = []
for rank_value in ranks:
    lora_params_for_rank = 0
    for module_name, module in model.named_modules():
        if isinstance(module, nn.Linear):
            is_target_type = any(module_name.endswith(f".{suffix}") for suffix in lora_target_module_suffixes)
            is_lm_head = module_name == "lm_head" or module_name.endswith(".lm_head")

            if is_target_type and not is_lm_head:
                in_features = module.in_features
                out_features = module.out_features
                lora_params_for_rank += rank_value * (in_features + out_features)
    
    # Information-capacity metrics
    cap_tok = capacity_tokens(lora_params_for_rank)
    ratio = capacity_ratio(length_total, lora_params_for_rank)
    tp_ratio_lora_specific = length_total / lora_params_for_rank if lora_params_for_rank > 0 else float('inf')
    
    # Memorization and generalization scores
    mem_score = memorization_score(length_total, lora_params_for_rank)
    gen_score_val = generalization_score(length_total, lora_params_for_rank)
    status = training_status(length_total, lora_params_for_rank)
    
    lora_results.append({
        "rank": rank_value,
        "lora_params_M": lora_params_for_rank / 1e6,
        "tp_ratio": tp_ratio_lora_specific,
        "cap_tokens_M": cap_tok / 1e6,
        "ratio": ratio,
        "mem_score": mem_score,
        "gen_score": gen_score_val,
        "status": status
    })

# Print the results
print(f"\n{'Rank':<5} | {'Params(M)':<9} | {'T/P':<6} | {'Ratio':<6} | {'MemScore':<8} | {'GenScore':<8} | {'Status':<7}")
print(f"{'-'*5}-+-{'-'*9}-+-{'-'*6}-+-{'-'*6}-+-{'-'*8}-+-{'-'*8}-+-{'-'*7}")
for result in lora_results:
    print(f"{result['rank']:<5} | {result['lora_params_M']:<9.1f} | {result['tp_ratio']:<6.2f} | "
          f"{result['ratio']:<6.2f} | {result['mem_score']:<8.3f} | {result['gen_score']:<8.3f} | {result['status']:<7}")

print(f"\n{'='*70}")
print("INTERPRETATION:")
print("• Ratio = tokens / capacity_tokens (1.0 = knee, >1.0 = past knee)")
print("• MemScore = memorization level (1.0 = max memorization at knee)")  
print("• GenScore = generalization level (higher = better generalization)")
print("• Optimal status = maximum generalization zone (ratio ~1.5-2.0)")
print(f"{'='*70}")

opt = best_rank_for_dataset(length_total, lora_results)
print(f"\n>> OPTIMAL RANK for MAX GENERALIZATION: {opt['rank']}")
print(f"   - Trainable params: {opt['lora_params_M']:.1f}M")
print(f"   - T/P ratio: {opt['tp_ratio']:.2f}")
print(f"   - Capacity ratio: {opt['ratio']:.2f}")
print(f"   - Generalization score: {opt['gen_score']:.3f}")
print(f"   - Status: {opt['status']}")

# Find the rank with highest generalization score
max_gen_rank = max(lora_results, key=lambda r: r['gen_score'])
print(f"\n>> HIGHEST GENERALIZATION SCORE: Rank {max_gen_rank['rank']} (GenScore: {max_gen_rank['gen_score']:.3f})")