LLM Pretraining & Evaluation#
This tutorial covers pretraining AICrossSim-CLM models and evaluating them on
language modeling benchmarks.
Note
If you have not set up the environment yet, follow Installation first.
Overview#
We pretrain
AICrossSim-CLM(60M, 200M, 400M, 1.1B) on the FineWeb-Edu dataset.We follow the Chinchilla scaling law to determine the number of training tokens:
num_tokens = 22 × num_params.The entry point is
experiments/llm-digital/pretrain/run.py.Run
python run.py -hto see all subcommands.Run
python run.py <subcommand> -hfor subcommand-specific help.
We use
torchrunfor distributed training.Pretrained checkpoints are available on HuggingFace: NewComputeBench-CLM-Digital.
Pretraining#
The workflow is the same for all model sizes: generate a config, then launch training.
We demonstrate with AICrossSim-CLM-60M.
AICrossSim-CLM-60M#
Change to the pretraining directory and activate the environment:
cd experiments/llm-digital/pretrain
uv:
source .venv/bin/activate
conda:
conda activate new-compute
Generate the training config:
Fast development run
Use these flags to reduce memory usage and shorten training for quick tests:
--batch_size— smaller batch size to avoid OOM.--data_parallel_replicate_degree— number of data-parallel replicas (typically equal to the number of GPUs).--data_parallel_shard_degree— shard model parameters across GPUs (FSDP). Default-1disables sharding.--token_num_scale— controls training length vianum_tokens = scale × num_params. Set to1for a short run.
data_parallel="2" batch_size="48" token_num_scale="22" python run.py generate-cfg \ --model_flavor 60M \ --batch_size ${batch_size} \ --data_parallel_replicate_degree ${data_parallel} \ --compile true \ --save_path ./configs/tutorial-60M.yaml
This generates
configs/tutorial-60M.yamlfor pretraining on a FineWeb-Edu subset of22 × 60Mtokens with per-device batch size 48 and 2-GPU data parallelism. The--compileflag enablestorch.compilefor faster training.Launch pretraining:
num_gpus="2" cuda_devices="1,2" # GPU indices to use, e.g. "0,1" for the first two GPUs CUDA_VISIBLE_DEVICES=${cuda_devices} \ PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" STREAM_HF_DATA="1" \ torchrun --nproc_per_node=${num_gpus} --rdzv_backend c10d \ --rdzv_endpoint="localhost:0" --local-ranks-filter 0 \ --role rank --tee 3 \ run.py pretrain --config configs/tutorial-60M.yaml \ --metrics_args.enable_wandb false
STREAM_HF_DATA=1streams the FineWeb-Edu dataset instead of downloading it.Checkpoints are saved to
./outputs/checkpoints/aixsim-60M/<timestamp>/.Disable W&B logging with
--metrics_args.enable_wandb falseif you have not runwandb login.
Troubleshooting: Fatal Python error: Aborted
After training finishes,
torchrunmay raiseFatal Python error: Abortedwhen destroying the process group. This does not affect the training results as long as the error appears after the final checkpoint is saved — look for a log line similar to:[rank0]: Finished saving the checkpoint ... in 5.53 seconds. [rank0]: Training completed
(Optional) Convert the checkpoint to HuggingFace format:
Why convert?
The training code uses custom distributed model classes. Converting to HuggingFace format lets you use the full HuggingFace ecosystem (generation, evaluation, etc.).
python run.py convert-ckpt aixsim 60M \ ./outputs/checkpoints/aixsim-60M/<timestamp>/<step-xxx> \ path/to/huggingface/checkpoint
Tip
Our 60M results — pretrained on 2 × H100 96 GB for 1 hour.
Config: experiments/llm-digital/pretrain/configs/aixsim-60M.yaml
W&B logs: link
HuggingFace checkpoint: AICrossSim/clm-60m
AICrossSim-CLM-200M#
The 200M model uses Fully Sharded Data Parallel (FSDP) to reduce per-GPU memory at the cost of slightly longer training.
batch_size="32"
data_parallel_replicate="1"
data_parallel_shard="2"
python run.py generate-cfg \
--model_flavor 200M \
--batch_size ${batch_size} \
--data_parallel_replicate_degree ${data_parallel_replicate} \
--data_parallel_shard_degree ${data_parallel_shard} \
--compile true \
--save_path ./configs/tutorial-200M.yaml
num_gpus="2"
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" STREAM_HF_DATA="1" \
torchrun --nproc_per_node=${num_gpus} --rdzv_backend c10d \
--rdzv_endpoint="localhost:0" --local-ranks-filter 0 \
--role rank --tee 3 \
run.py pretrain --config configs/tutorial-200M.yaml \
--metrics_args.enable_wandb false
Tip
Our 200M results — pretrained on 2 × H100 96 GB for 6.5 hours.
W&B logs: link
HuggingFace checkpoint: AICrossSim/clm-200m
AICrossSim-CLM-400M#
batch_size="12"
data_parallel_replicate="1"
data_parallel_shard="8"
python run.py generate-cfg \
--model_flavor 400M \
--batch_size ${batch_size} \
--data_parallel_replicate_degree ${data_parallel_replicate} \
--data_parallel_shard_degree ${data_parallel_shard} \
--compile true \
--save_path ./configs/tutorial-400M.yaml
num_gpus="8"
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" STREAM_HF_DATA="1" \
torchrun --nproc_per_node=${num_gpus} --rdzv_backend c10d \
--rdzv_endpoint="localhost:0" --local-ranks-filter 0 \
--role rank --tee 3 \
run.py pretrain --config configs/tutorial-400M.yaml \
--metrics_args.enable_wandb false
Tip
Our 400M results — pretrained on 8 × A6000 for 21 hours.
W&B logs: link
HuggingFace checkpoint: AICrossSim/clm-400m
AICrossSim-CLM-1.1B#
batch_size="24"
data_parallel_replicate="1"
data_parallel_shard="8"
python run.py generate-cfg \
--model_flavor 1.1B \
--batch_size ${batch_size} \
--data_parallel_replicate_degree ${data_parallel_replicate} \
--data_parallel_shard_degree ${data_parallel_shard} \
--compile true \
--save_path ./configs/tutorial-1.1B.yaml
num_gpus="8"
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" STREAM_HF_DATA="1" \
torchrun --nproc_per_node=${num_gpus} --rdzv_backend c10d \
--rdzv_endpoint="localhost:0" --local-ranks-filter 0 \
--role rank --tee 3 \
run.py pretrain --config configs/tutorial-1.1B.yaml \
--metrics_args.enable_wandb false
Tip
Our 1.1B results — pretrained on 8 × H100 96 GB for 33 hours.
W&B logs: link
HuggingFace checkpoint: AICrossSim/clm-1.1b
Raw torchrun checkpoints (for resuming): AICrossSim/clm-1.1b-torch-ckpt
Evaluation#
Pretraining dataset perplexity#
Evaluate a checkpoint on the pretraining dataset:
# torchrun checkpoint
python run.py eval pt-ppl \
aixsim 60M \
./outputs/checkpoints/aixsim-60M/<timestamp>/<step-xxx>
# HuggingFace checkpoint
python run.py eval hf-ppl \
AICrossSim/clm-60m
Downstream tasks (lm-eval-harness)#
We integrate lm-eval-harness for downstream evaluation:
model_name="AICrossSim/clm-60m"
python run.py eval hf-lm-eval \
${model_name} \
--tasks ['wikitext'] \
--dtype float16
Run python run.py hf-lm-eval -h for all available arguments.
Note
Under the hood hf-lm-eval calls lm-eval-harness’s simple_evaluate.
Key arguments:
--tasks— list of task names (same naming as lm-eval-harness).--num_fewshot— few-shot count;Noneuses the task default.--limit— if > 1, maximum number of examples; if ≤ 1, fraction of the dataset.
Simple text generation#
prompt="London is"
python run.py hf-gen \
--model_name AICrossSim/clm-60m \
--prompt "${prompt}" \
--max_new_tokens 100 \
--do_sample true \
--temperature 0.6 \
--top_k 50 \
--top_p 0.9