LLM Pretraining & Evaluation#
This tutorial covers pretraining AICrossSim-CLM models and evaluating them on
language modeling benchmarks.
Note
If you have not set up the environment yet, follow Installation first.
Overview#
We pretrain
AICrossSim-CLM(60M, 200M, 400M, 1.1B) on the FineWeb-Edu dataset.We follow the Chinchilla scaling law to determine the number of training tokens:
num_tokens = 22 × num_params.The entry point is
experiments/llm-digital/pretrain/run.py.Run
python run.py -hto see all subcommands.Run
python run.py <subcommand> -hfor subcommand-specific help.
We use
torchrunfor distributed training.Pretrained checkpoints are available on HuggingFace: NewComputeBench-CLM-Digital.
Pretraining#
The workflow is the same for all model sizes: generate a config, then launch training.
We demonstrate with AICrossSim-CLM-60M.
AICrossSim-CLM-60M#
Change to the pretraining directory:
cd experiments/llm-digital/pretrain
Generate the training config:
Fast development run
Use these flags to reduce memory usage and shorten training for quick tests:
--batch_size— smaller batch size to avoid OOM.--data_parallel_replicate_degree— number of data-parallel replicas (typically equal to the number of GPUs).--data_parallel_shard_degree— shard model parameters across GPUs (FSDP). Default-1disables sharding.--token_num_scale— controls training length vianum_tokens = scale × num_params. Set to1for a short run.
data_parallel="1" batch_size="8" token_num_scale="22" python run.py generate-cfg \ --model_flavor 60M \ --batch_size ${batch_size} \ --data_parallel_replicate_degree ${data_parallel} \ --compile true \ --save_path ./configs/tutorial-60M.yaml
This generates
configs/tutorial-60M.yamlfor pretraining on a FineWeb-Edu subset of22 × 60Mtokens with per-device batch size 8 and 1-GPU data parallelism. The--compileflag enablestorch.compilefor faster training.Launch pretraining:
num_gpus="1" cuda_devices="0" # GPU indices to use, e.g. "0,1,2,3.." CUDA_VISIBLE_DEVICES=${cuda_devices} \ PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" STREAM_HF_DATA="1" \ torchrun --nproc_per_node=${num_gpus} --rdzv_backend c10d \ --rdzv_endpoint="localhost:0" --local-ranks-filter 0 \ --role rank --tee 3 \ run.py pretrain --config configs/tutorial-60M.yaml \ --metrics_args.enable_wandb false
STREAM_HF_DATA=1streams the FineWeb-Edu dataset instead of downloading it.Checkpoints are saved to
./outputs/checkpoints/aixsim-60M/<timestamp>/.Disable W&B logging with
--metrics_args.enable_wandb falseif you have not runwandb login.
Troubleshooting: Fatal Python error: Aborted
After training finishes,
torchrunmay raiseFatal Python error: Abortedwhen destroying the process group. This does not affect the training results as long as the error appears after the final checkpoint is saved — look for a log line similar to:[rank0]: Finished saving the checkpoint ... in 5.53 seconds. [rank0]: Training completed
(Optional) Convert the checkpoint to HuggingFace format:
Why convert?
The training code uses custom distributed model classes. Converting to HuggingFace format lets you use the full HuggingFace ecosystem (generation, evaluation, etc.). This is also required if you want to use your locally trained checkpoint in the bitflip simulation tutorials (see Random Bitflip on CLM).
python run.py convert-ckpt pt2hf \ aixsim 60M \ ./outputs/checkpoints/aixsim-60M/<timestamp>/<step-xxx> \ ./outputs/hf/aixsim-60M
Tip
Our 60M results — pretrained on 2 × H100 96 GB for 1 hour.
Config: experiments/llm-digital/pretrain/configs/aixsim-60M.yaml
W&B logs: link
HuggingFace checkpoint: AICrossSim/clm-60m
AICrossSim-CLM-200M#
The 200M model uses Fully Sharded Data Parallel (FSDP) to reduce per-GPU memory at the cost of slightly longer training.
batch_size="32"
data_parallel_replicate="1"
data_parallel_shard="2"
python run.py generate-cfg \
--model_flavor 200M \
--batch_size ${batch_size} \
--data_parallel_replicate_degree ${data_parallel_replicate} \
--data_parallel_shard_degree ${data_parallel_shard} \
--compile true \
--save_path ./configs/tutorial-200M.yaml
num_gpus="2"
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" STREAM_HF_DATA="1" \
torchrun --nproc_per_node=${num_gpus} --rdzv_backend c10d \
--rdzv_endpoint="localhost:0" --local-ranks-filter 0 \
--role rank --tee 3 \
run.py pretrain --config configs/tutorial-200M.yaml \
--metrics_args.enable_wandb false
Tip
Our 200M results — pretrained on 2 × H100 96 GB for 6.5 hours.
W&B logs: link
HuggingFace checkpoint: AICrossSim/clm-200m
AICrossSim-CLM-400M#
batch_size="12"
data_parallel_replicate="1"
data_parallel_shard="8"
python run.py generate-cfg \
--model_flavor 400M \
--batch_size ${batch_size} \
--data_parallel_replicate_degree ${data_parallel_replicate} \
--data_parallel_shard_degree ${data_parallel_shard} \
--compile true \
--save_path ./configs/tutorial-400M.yaml
num_gpus="8"
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" STREAM_HF_DATA="1" \
torchrun --nproc_per_node=${num_gpus} --rdzv_backend c10d \
--rdzv_endpoint="localhost:0" --local-ranks-filter 0 \
--role rank --tee 3 \
run.py pretrain --config configs/tutorial-400M.yaml \
--metrics_args.enable_wandb false
Tip
Our 400M results — pretrained on 8 × A6000 for 21 hours.
W&B logs: link
HuggingFace checkpoint: AICrossSim/clm-400m
AICrossSim-CLM-1.1B#
batch_size="24"
data_parallel_replicate="1"
data_parallel_shard="8"
python run.py generate-cfg \
--model_flavor 1.1B \
--batch_size ${batch_size} \
--data_parallel_replicate_degree ${data_parallel_replicate} \
--data_parallel_shard_degree ${data_parallel_shard} \
--compile true \
--save_path ./configs/tutorial-1.1B.yaml
num_gpus="8"
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" STREAM_HF_DATA="1" \
torchrun --nproc_per_node=${num_gpus} --rdzv_backend c10d \
--rdzv_endpoint="localhost:0" --local-ranks-filter 0 \
--role rank --tee 3 \
run.py pretrain --config configs/tutorial-1.1B.yaml \
--metrics_args.enable_wandb false
Tip
Our 1.1B results — pretrained on 8 × H100 96 GB for 33 hours.
W&B logs: link
HuggingFace checkpoint: AICrossSim/clm-1.1b
Raw torchrun checkpoints (for resuming): AICrossSim/clm-1.1b-torch-ckpt
Evaluation#
Pretraining dataset perplexity#
Evaluate a checkpoint on the pretraining dataset:
# torchrun checkpoint
python run.py eval pt-ppl \
aixsim 60M \
./outputs/checkpoints/aixsim-60M/<timestamp>/<step-xxx>
# HuggingFace checkpoint
python run.py eval hf-ppl \
--batch_size 8 \
AICrossSim/clm-60m
Downstream tasks (lm-eval-harness)#
We integrate lm-eval-harness for downstream evaluation:
# change model_name="/workspace/experiments/llm-digital/pretrain/outputs/hf/aixsim-60M" for testing the checkpoint hf model
model_name="AICrossSim/clm-60m"
python run.py eval hf-lm-eval \
${model_name} \
--tasks ['wikitext'] \
--dtype float16
Run python run.py hf-lm-eval -h for all available arguments.
Note
Under the hood hf-lm-eval calls lm-eval-harness’s simple_evaluate.
Key arguments:
--tasks— list of task names (same naming as lm-eval-harness).--num_fewshot— few-shot count;Noneuses the task default.--limit— if > 1, maximum number of examples; if ≤ 1, fraction of the dataset.
Simple text generation#
prompt="London is"
python run.py hf-gen \
--model_name AICrossSim/clm-60m \
--prompt "${prompt}" \
--max_new_tokens 100 \
--do_sample true \
--temperature 0.6 \
--top_k 50 \
--top_p 0.9