Random Bitflip on CLM#
This tutorial covers two workflows:
Post-training bitflip evaluation — load a pretrained checkpoint, inject random bitflips, and evaluate the transformed model.
Bitflip-aware pretraining — pretrain a model from scratch with bitflip noise injected during every forward pass.
Note
If you have not set up the environment yet, follow Installation first.
Overview#
Post-training evaluation entry point: experiments/llm-bitflip/transform/minimal.py
Bitflip-aware pretraining entry point: experiments/llm-bitflip/pretrain/run.py
Random bitflip kernels are implemented in mase-triton. The core function is
mase_triton.random_bitflip.core.random_bitflip_fn, which supports independent bitflip probabilities for sign-exponent bits and mantissa bits, and can zero out outliers / NaN values via a threshold.
Note
The bitflip probability must be a power of 0.5 (e.g., 0.5, 0.5², 0.5³, …).
The kernel snaps to the nearest valid value automatically.
The minimum supported probability is 0.5²⁴ ≈ 5.96 × 10⁻⁸ due to the Philox
pseudo-random number generator used internally.
Post-Training Bitflip Evaluation#
We provide minimal scripts that apply a bitflip transform to all linear layers
(>90% of FLOPs in Transformers) in a HuggingFace model, then evaluate with
lm-eval-harness.
Transform and evaluate#
cd experiments/llm-bitflip/transform
# Use the HuggingFace pretrained checkpoint:
model_name="AICrossSim/clm-60m"
# Or use a locally trained checkpoint (convert first — see the pretraining tutorial):
# model_name="/path/to/experiments/llm-digital/pretrain/outputs/hf/aixsim-60M"
batch_size="8"
x_p_exp=$(bc <<< "scale=15; 0.5^12")
w_p_exp=$(bc <<< "scale=15; 0.5^12")
x_p_frac=$(bc <<< "scale=15; 0.5^12")
w_p_frac=$(bc <<< "scale=15; 0.5^12")
x_zero_out_t="30"
w_zero_out_t="1.25"
python minimal.py eval-bitflip \
--model_name ${model_name} \
--batch_size ${batch_size} \
--bitflip_config "default" \
--default_bitflip_config.x_p_exp=${x_p_exp} \
--default_bitflip_config.x_p_frac=${x_p_frac} \
--default_bitflip_config.x_zero_out_t=${x_zero_out_t} \
--default_bitflip_config.w_p_exp=${w_p_exp} \
--default_bitflip_config.w_p_frac=${w_p_frac} \
--default_bitflip_config.w_zero_out_t=${w_zero_out_t} \
--tasks ['wikitext']
Note
eval-bitflip uses lm-eval-harness’s simple_evaluate.
See the evaluation section of LLM Pretraining & Evaluation for argument details.
Evaluate the original model (clean baseline)#
model_name="AICrossSim/clm-60m"
# Or your local checkpoint:
# model_name="/path/to/experiments/llm-digital/pretrain/outputs/hf/aixsim-60M"
batch_size="8"
python minimal.py eval-ori \
--model_name ${model_name} \
--batch_size ${batch_size} \
--tasks ['wikitext']
Text generation with bitflip#
model_name="AICrossSim/clm-60m"
# Or your local checkpoint:
# model_name="/path/to/experiments/llm-digital/pretrain/outputs/hf/aixsim-60M"
prompt="London is"
max_new_tokens="100"
x_p_exp=$(bc <<< "scale=15; 0.5^12")
w_p_exp=$(bc <<< "scale=15; 0.5^12")
x_p_frac=$(bc <<< "scale=15; 0.5^12")
w_p_frac=$(bc <<< "scale=15; 0.5^12")
x_zero_out_t="30"
w_zero_out_t="1.25"
python minimal.py hf-gen \
${model_name} \
--prompt "${prompt}" \
--max_new_tokens ${max_new_tokens} \
--do_sample true \
--temperature 0.6 \
--top_k 50 \
--top_p 0.9 \
--bitflip_config "default" \
--default_bitflip_config.x_p_exp=${x_p_exp} \
--default_bitflip_config.x_p_frac=${x_p_frac} \
--default_bitflip_config.x_zero_out_t=${x_zero_out_t} \
--default_bitflip_config.w_p_exp=${w_p_exp} \
--default_bitflip_config.w_p_frac=${w_p_frac} \
--default_bitflip_config.w_zero_out_t=${w_zero_out_t}
Tip
We swept x_p_frac and w_p_frac on AICrossSim/clm-1.1b and observed
that when perplexity increases by only ~1%, generated text remains coherent with
the clean model.
Sample outputs: Google Sheets
Bitflip-Aware Pretraining#
The script experiments/llm-bitflip/pretrain/run.py extends the standard CLM
pretraining script (see LLM Pretraining & Evaluation) with an additional
argument for the bitflip transform configuration.
We demonstrate with AICrossSim-CLM-60M on 2 × H100 96 GB.
Generate a config with bitflip settings:
cd experiments/llm-bitflip/pretrain bitflip_transform_config="./configs/meta/fc-only-w-a-exp-frac.yaml" python run.py generate-cfg \ ${bitflip_transform_config} \ --model_arch "aixsim" \ --model_flavor "60M" \ --batch_size 48 \ --data_parallel_replicate_degree 2 \ --data_parallel_shard_degree -1 \ --token_num_scale 22 \ --compile "false" \ --save_path "./configs/tutorial-60m.yaml"
Launch pretraining:
num_gpus="2" PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" STREAM_HF_DATA="1" \ torchrun --nproc_per_node=${num_gpus} --rdzv_backend c10d \ --rdzv_endpoint="localhost:0" --local-ranks-filter 0 \ --role rank --tee 3 \ run.py pretrain \ --config configs/tutorial-60m.yaml \ --metrics_args.enable_wandb false
Convert the checkpoint to HuggingFace format:
python run.py convert-ckpt pt2hf \ aixsim 60M \ ./outputs/checkpoints/aixsim-60M/<timestamp>/<step-xxx> \ ./outputs/hf/bitflip-60M
Evaluating and comparing the three settings#
Once you have converted checkpoints, you can compare three settings using the same
bitflip parameters and wikitext perplexity as the metric:
1. Digital baseline + post-training bitflip transform (no bitflip-aware training):
cd experiments/llm-bitflip/transform
model_name="AICrossSim/clm-60m" # or your local digital checkpoint
batch_size="8"
x_p_exp=$(bc <<< "scale=15; 0.5^12")
w_p_exp=$(bc <<< "scale=15; 0.5^12")
x_p_frac=$(bc <<< "scale=15; 0.5^12")
w_p_frac=$(bc <<< "scale=15; 0.5^12")
x_zero_out_t="30"
w_zero_out_t="1.25"
python minimal.py eval-bitflip \
--model_name ${model_name} \
--batch_size ${batch_size} \
--bitflip_config "default" \
--default_bitflip_config.x_p_exp=${x_p_exp} \
--default_bitflip_config.x_p_frac=${x_p_frac} \
--default_bitflip_config.x_zero_out_t=${x_zero_out_t} \
--default_bitflip_config.w_p_exp=${w_p_exp} \
--default_bitflip_config.w_p_frac=${w_p_frac} \
--default_bitflip_config.w_zero_out_t=${w_zero_out_t} \
--tasks ['wikitext']
2. Bitflip-aware pretrained model + post-training bitflip:
model_name="AICrossSim/bitflip-fc-clm-60m" # or your local bitflip checkpoint
batch_size="8"
x_p_exp=$(bc <<< "scale=15; 0.5^12")
w_p_exp=$(bc <<< "scale=15; 0.5^12")
x_p_frac=$(bc <<< "scale=15; 0.5^12")
w_p_frac=$(bc <<< "scale=15; 0.5^12")
x_zero_out_t="30"
w_zero_out_t="1.25"
python minimal.py eval-bitflip \
--model_name ${model_name} \
--batch_size ${batch_size} \
--bitflip_config "default" \
--default_bitflip_config.x_p_exp=${x_p_exp} \
--default_bitflip_config.x_p_frac=${x_p_frac} \
--default_bitflip_config.x_zero_out_t=${x_zero_out_t} \
--default_bitflip_config.w_p_exp=${w_p_exp} \
--default_bitflip_config.w_p_frac=${w_p_frac} \
--default_bitflip_config.w_zero_out_t=${w_zero_out_t} \
--tasks ['wikitext']
3. Bitflip-aware pretrained model, clean evaluation (no bitflip at inference):
model_name="AICrossSim/bitflip-fc-clm-60m" # or your local bitflip checkpoint
batch_size="8"
python minimal.py eval-ori \
--model_name ${model_name} \
--batch_size ${batch_size} \
--tasks ['wikitext']
Expected outcome: setting 2 should have lower perplexity than setting 1 under the same bitflip noise, showing that bitflip-aware pretraining improves robustness. Setting 3 shows the clean perplexity of the bitflip-aware model as an upper bound.
Results Summary#
Model |
Environment |
Training time |
Config |
W&B |
HuggingFace checkpoint |
|---|---|---|---|---|---|
60M |
2× H100 96 GB |
2.5 hours |
|||
200M |
2× H100 96 GB |
14.3 hours |
|||
400M |
6× A6000 48 GB |
33 hours |
|||
1.1B |
8× H200 141 GB |
51 hours |