Bitflip-Aware LoRA Fine-Tuning#

This tutorial walks through bitflip-aware LoRA fine-tuning on a pretrained LLM (e.g., unsloth/Llama-3.1-8B).

Note

If you have not set up the environment yet, follow Installation first.

Overview#

Bitflip-aware LoRA fine-tuning combines two ideas:

Random Bitflip Simulation — During the forward pass, random bit flips are injected into both activations and weights of every linear layer (except lm_head), emulating hardware-level bit errors in approximate or unreliable compute substrates.
Low-Rank Adaptation (LoRA) — Instead of updating all parameters, small low-rank matrices (lora_A, lora_B) are attached to each linear layer; only these are trained. The original pretrained weights are frozen.

By training with bitflip noise present, the LoRA adapters learn to compensate for hardware-induced errors, making the model more resilient at inference time.

How It Works#

Each nn.Linear layer is replaced by a BitFlipLinearLora layer whose forward pass computes:

\[Y = \text{bitflip}(X) \cdot \text{bitflip}(W + B \cdot A \cdot \text{scaling})^T\]

where:

\(X\) — input activation (with optional bitflip noise)
\(W\) — frozen pretrained weight
\(A\) (lora_A) and \(B\) (lora_B) — trainable low-rank matrices
\(\text{scaling} = \text{lora\_alpha} / r\)
\(\text{bitflip}(\cdot)\) — applies random bit flips with configurable per-component probabilities

The model transformation is handled by transform_llama, which replaces all nn.Linear modules (excluding lm_head) with BitFlipLinearLora.

Entry Points#

File	Description
experiments/llm-bitflip/lora_finetune/run_clm_no_trainer.py	Main training script (HuggingFace Accelerate, no Trainer).
experiments/llm-bitflip/lora_finetune/fine-tune-bitflip-clm.sh	Shell wrapper that computes training steps and launches the run.
experiments/llm-bitflip/lora_finetune/transform_cfg.toml	Bitflip + LoRA configuration file.

Step-by-Step Guide#

Step 1 — Configure the Bitflip & LoRA Transform#

The transform configuration is defined in a TOML file. Default configuration at experiments/llm-bitflip/lora_finetune/transform_cfg.toml:

use_lora = true

[fc]
    w_p_exp = 1.52587890625e-05
    w_p_frac = 1.52587890625e-05
    w_zero_out_t = 1.25
    x_p_exp = 1.52587890625e-05
    x_p_frac = 1.52587890625e-05
    x_zero_out_t = 30.0

[lora]
    r = 32
    lora_alpha = 32

Configuration parameters#
Section	Parameter	Description
(top-level)	`use_lora`	Enable LoRA adaptation. When `false`, all parameters are trained.
`[fc]`	`w_p_exp`	Bitflip probability for the sign-exponent bits of the weight.
`[fc]`	`w_p_frac`	Bitflip probability for the mantissa bits of the weight.
`[fc]`	`w_zero_out_t`	Threshold for zeroing out weight outliers / NaN values.
`[fc]`	`x_p_exp`	Bitflip probability for the sign-exponent bits of the activation.
`[fc]`	`x_p_frac`	Bitflip probability for the mantissa bits of the activation.
`[fc]`	`x_zero_out_t`	Threshold for zeroing out activation outliers / NaN values.
`[lora]`	`r`	LoRA rank.
`[lora]`	`lora_alpha`	LoRA scaling factor (effective scaling = `lora_alpha / r`).

Note

The bitflip probability must be a power of 0.5 (e.g., 0.5^16 ≈ 1.526e-05). The kernel snaps to the nearest valid value automatically. Minimum supported probability is 0.5^24 ≈ 5.96e-08. See Mase-Triton for details.

Step 2 — Understand the Training Budget#

The shell script fine-tune-bitflip-clm.sh automatically calculates the number of training steps using a budget of 1% of the model’s parameter count in tokens. For unsloth/Llama-3.1-8B (8B parameters):

fine-tune tokens = 8,000,000,000 / 100 = 80,000,000
tokens per step  = num_gpus × per_device_batch_size × block_size
max_train_steps  = fine-tune tokens / tokens per step

For 8 GPUs, batch size 1, block size 2048:

tokens per step = 8 × 1 × 2048 = 16,384
max_train_steps = 80,000,000 / 16,384 ≈ 4,883 steps

Step 3 — Launch Fine-Tuning#

cd experiments/llm-bitflip/lora_finetune

The shell script accepts positional arguments:

./fine-tune-bitflip-clm.sh [num_processes] [model_name_or_path] \
    [per_device_train_batch_size] [learning_rate] [weight_decay] \
    [gradient_accumulation_steps] [block_size]

Example: fine-tune Llama-3.1-8B on 8 GPUs with default settings:

./fine-tune-bitflip-clm.sh 8 unsloth/Llama-3.1-8B 1 1e-5 0.01 2 2048

This is equivalent to:

uv run accelerate launch --num_processes=8 \
    run_clm_no_trainer.py \
    --model_name_or_path unsloth/Llama-3.1-8B \
    --dataset_name Cheng98/fineweb-edu-1.25B \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-5 \
    --weight_decay 0.01 \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 2 \
    --lr_scheduler_type linear \
    --output_dir ./output/Llama-3.1-8B-bitflip-lora \
    --preprocessing_num_workers 32 \
    --trust_remote_code \
    --with_tracking \
    --report_to wandb \
    --transform_cfg ./transform_cfg.toml \
    --block_size 2048 \
    --log_train_loss_steps 50 \
    --max_train_steps 4883 \
    --wandb_tags unsloth/Llama-3.1-8B,lr1e-5,steps4883

Key arguments#
Argument	Description
`--model_name_or_path`	HuggingFace model identifier or local path.
`--dataset_name`	Training dataset. We use a 1.25B-token subset of FineWeb-Edu.
`--transform_cfg`	Path to the TOML config for bitflip + LoRA.
`--block_size`	Context length for training samples.
`--log_train_loss_steps`	Log training loss to W&B every N steps.
`--max_train_steps`	Total optimizer steps (auto-calculated by the shell script).

Tip

The first argument to fine-tune-bitflip-clm.sh controls --num_processes. The script automatically recalculates max_train_steps to maintain the same total token budget regardless of the number of GPUs.

Step 4 — Monitor Training#

If W&B is configured (wandb login), training loss and validation perplexity are logged automatically to the Bitflip-CLM-Fine-tune project.

Training loss is logged every 50 steps (configurable via --log_train_loss_steps).
Validation perplexity is evaluated at end of each epoch on the first 64 batches.

Step 5 — Output#

After training, the fine-tuned model (LoRA weights merged into the base model) and tokenizer are saved to the output directory:

./output/Llama-3.1-8B-bitflip-lora/
├── config.json
├── model.safetensors
├── tokenizer.json
├── tokenizer_config.json
└── all_results.json         # Final perplexity

Results#

Training Curves#

Metric	Value
Final Training Loss (↓)	2.50
Final Validation Perplexity (↓)	11.01
Total Training Steps	4883

Comparison with Baselines#

Bitflipped	Fine-tuned	Bitflip config	Fine-tune config	Val PPL (↓)
✘	✘	N/A	N/A	7.91
✔	✘	`w/x_p_exp=1.53e-5, w/x_p_frac=1.53e-5`	N/A	1008.95
✔	✔	`w/x_p_exp=1.53e-5, w/x_p_frac=1.53e-5`	LoRA rank=32	11.01

LoRA fine-tuning reduces perplexity from 1008.95 to 11.01 on a 7B model — a 99% reduction. Increasing the LoRA rank or using full fine-tuning would likely improve results further.

Resources#

W&B logs: https://wandb.ai/cz98/Bitflip-CLM-Fine-tune
Training config: transform_cfg.toml

Appendix: Evaluation Scripts#

The baseline comparison above was generated with two evaluation-only scripts that reuse run_clm_no_trainer.py but skip all optimizer steps. Both share the signature:

./script.sh [num_processes] [model_name_or_path] [per_device_batch_size] \
            [block_size] [eval_max_steps]

Script	Purpose
eval-bitflip-no-finetune.sh	Measures perplexity with bitflips injected but no fine-tuning (bitflipped ✔, fine-tuned ✘).
eval-no-biflip-no-finetune.sh	Clean baseline — no bitflips, no fine-tuning (bitflipped ✘, fine-tuned ✘).

Bitflip-Aware LoRA Fine-Tuning

Contents

Bitflip-Aware LoRA Fine-Tuning#

Overview#

How It Works#

Entry Points#

Step-by-Step Guide#

Step 1 — Configure the Bitflip & LoRA Transform#

Step 2 — Understand the Training Budget#

Step 3 — Launch Fine-Tuning#

Step 4 — Monitor Training#

Step 5 — Output#

Results#

Training Curves#

Comparison with Baselines#

Resources#

Appendix: Evaluation Scripts#