Skip to content

Evaluation commands

Every PLENA evaluation is a Python module under quant_eval.cli. They share a common pattern: pass --model_name, a TOML via --quant_config, and any eval-specific flags. The reference below is auto-generated from each module's docstrings, so it stays in sync with the code.

Perplexity — eval_ppl

Language-modeling perplexity evaluation with optional MX quantization.

Sets up an HF causal language model, optionally applies a MASE quantization pass driven by a TOML recipe, then computes perplexity on the chosen dataset (WikiText by default).

Example — baseline (fp16):

python -m quant_eval.cli.eval_ppl --model_name unsloth/Llama-3.2-1B

Example — quantized:

python -m quant_eval.cli.eval_ppl \
    --model_name unsloth/Llama-3.2-1B \
    --quant_config quant_eval/configs/llama_mxint4.toml

main(model_name='Qwen/Qwen3-30B-A3B', dataset='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, seqlen=2048, log_dir=None)

Evaluate language-modeling perplexity, optionally with MX quantization.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID (e.g. meta-llama/Llama-2-7b-hf).

'Qwen/Qwen3-30B-A3B'
dataset str

Dataset for perplexity scoring. Default "wikitext".

'wikitext'
device_id str

CUDA device string, e.g. "cuda:0".

'cuda:0'
dtype str

Model dtype — one of "float16", "bfloat16", "float32".

'bfloat16'
quant_config Union[str, None]

Path to a TOML quantization recipe. None runs the unquantized baseline.

None
model_parallel bool

Distribute the model across all visible GPUs using HF device_map="auto".

False
seqlen int

Maximum sequence length passed to the perplexity evaluator.

2048
log_dir Union[str, None]

Directory in which to write args.json, quant_config.toml (when quantized), and results.json. None disables logging.

None

Returns:

Type Description

Dict of metric name to scalar. For wikitext: {"ppl": float}.

Source code in quant_eval/cli/eval_ppl.py
def main(
    model_name: str = "Qwen/Qwen3-30B-A3B",
    dataset: str = "wikitext",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: Union[str, None] = None,
    model_parallel: bool = False,
    seqlen: int = 2048,
    log_dir: Union[str, None] = None,
):
    """
    Evaluate language-modeling perplexity, optionally with MX quantization.

    Args:
        model_name: HuggingFace model ID (e.g. ``meta-llama/Llama-2-7b-hf``).
        dataset: Dataset for perplexity scoring. Default ``"wikitext"``.
        device_id: CUDA device string, e.g. ``"cuda:0"``.
        dtype: Model dtype — one of ``"float16"``, ``"bfloat16"``, ``"float32"``.
        quant_config: Path to a TOML quantization recipe. ``None`` runs the
            unquantized baseline.
        model_parallel: Distribute the model across all visible GPUs using HF
            ``device_map="auto"``.
        seqlen: Maximum sequence length passed to the perplexity evaluator.
        log_dir: Directory in which to write ``args.json``, ``quant_config.toml``
            (when quantized), and ``results.json``. ``None`` disables logging.

    Returns:
        Dict of metric name to scalar. For wikitext: ``{"ppl": float}``.
    """
    print("=" * 60)
    print("Perplexity Evaluation")
    print("=" * 60)
    print(f"Model: {model_name}")
    print(f"Dataset: {dataset}")

    quantize = quant_config is not None
    if quantize:
        print(f"Quantization config: {quant_config}")
    else:
        print("Quantization: None (baseline)")
    print("=" * 60)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        if quant_config:
            import shutil
            shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    dtype_map = {"float16": torch.float16, "bfloat16": torch.bfloat16, "float32": torch.float32}
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    attn_impl = "eager" if quant_config else "sdpa"

    tokenizer, model = setup_model(
        model_name, model_parallel, dtype=torch_dtype,
        device=device_id if not model_parallel else None,
        attn_implementation=attn_impl,
    )
    model.eval()

    if quantize:
        from chop.passes.module.transforms import quantize_module_transform_pass

        pass_args = load_quant_config(quant_config)
        if "gptq" in pass_args:
            pass_args["gptq"]["device"] = device_id
        if "rotation_search" in pass_args:
            pass_args["rotation_search"]["device"] = device_id

        n_linear = sum(1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear))
        logger.info("Quantizing %d linear layers...", n_linear)
        t0 = time.time()
        model, _ = quantize_module_transform_pass(model, pass_args)
        logger.info("Quantization complete in %.1fs", time.time() - t0)

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    if quantize:
        print_all_layers(model)

    results = evaluate_perplexity(
        model=model,
        tokenizer=tokenizer,
        dataset_name=dataset,
        max_length=seqlen,
        verbose=True,
    )

    print("\n" + "=" * 60)
    print("Results:")
    print("=" * 60)
    for k, v in results.items():
        print(f"  {k}: {v}")

    if log_dir:
        save_results(log_dir, results)

    return results

lm-eval-harness — eval_lm

lm-eval-harness driver with optional MX quantization.

Applies a TOML quantization recipe once before evaluation; activation precision stays fixed for the whole run.

Example:

python -m quant_eval.cli.eval_lm \
    --model_name unsloth/Llama-3.2-1B \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --tasks arc_easy,hellaswag,winogrande \
    --limit 500

main(model_name='Qwen/Qwen2.5-1.5B', tasks='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, seqlen=2048, batch_size=64, limit=None, log_dir=None)

Run lm-eval-harness on an optionally MX-quantized HF model.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID.

'Qwen/Qwen2.5-1.5B'
tasks Union[str, list[str]]

lm-eval task name(s) — comma-separated string or list (e.g. "arc_easy,hellaswag").

'wikitext'
device_id str

CUDA device string.

'cuda:0'
dtype str

Model dtype — "float16", "bfloat16", or "float32".

'bfloat16'
quant_config Union[str, None]

Path to a TOML quantization recipe. None runs the unquantized baseline.

'quant_eval/configs/llama_mxint4.toml'
model_parallel bool

Distribute across GPUs with device_map="auto".

False
seqlen int

Maximum context length passed to lm-eval.

2048
batch_size Union[int, str]

Eval batch size. Pass an int for a fixed size, or the string "auto" for lm-eval's auto-batching.

64
limit Union[int, float, None]

Cap samples per task. Int = absolute count; float in (0, 1) = fraction of the full dataset; None = full.

None
log_dir Union[str, None]

Directory for args.json and results.json. None disables logging.

None

Returns:

Type Description

lm-eval results dict — per-task metrics plus aggregate scores.

Source code in quant_eval/cli/eval_lm.py
def main(
    model_name: str = "Qwen/Qwen2.5-1.5B",
    tasks: Union[str, list[str]] = "wikitext",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: Union[str, None] = "quant_eval/configs/llama_mxint4.toml",
    model_parallel: bool = False,
    seqlen: int = 2048,
    batch_size: Union[int, str] = 64,
    limit: Union[int, float, None] = None,
    log_dir: Union[str, None] = None,
):
    """
    Run lm-eval-harness on an optionally MX-quantized HF model.

    Args:
        model_name: HuggingFace model ID.
        tasks: lm-eval task name(s) — comma-separated string or list
            (e.g. ``"arc_easy,hellaswag"``).
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe. ``None`` runs the
            unquantized baseline.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        seqlen: Maximum context length passed to lm-eval.
        batch_size: Eval batch size. Pass an int for a fixed size, or the
            string ``"auto"`` for lm-eval's auto-batching.
        limit: Cap samples per task. Int = absolute count; float in
            ``(0, 1)`` = fraction of the full dataset; ``None`` = full.
        log_dir: Directory for ``args.json`` and ``results.json``. ``None``
            disables logging.

    Returns:
        lm-eval results dict — per-task metrics plus aggregate scores.
    """
    print("=" * 64)
    print("lm-eval — fixed activation precision (no phase switch)")
    print("=" * 64)
    print(f"  Model  : {model_name}")
    print(f"  Tasks  : {tasks}")
    print(f"  Weights: {quant_config or 'none (fp)'}")
    print(f"  Seqlen : {seqlen}")
    print("=" * 64)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        if quant_config:
            import shutil
            shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    dtype_map = {
        "float16":  torch.float16,
        "bfloat16": torch.bfloat16,
        "float32":  torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    # Quantized attention modules (MXInt / MXFP / *Rotate) replace the
    # eager forward path and assert _attn_implementation == "eager". Force
    # eager whenever a quant_config is supplied, regardless of TOML pattern.
    attn_impl = "eager" if quant_config else "sdpa"

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
        attn_implementation=attn_impl,
    )
    model.eval()

    # ``token_collector`` is a side-effect pass that just attaches a hook;
    # if it's the *only* thing in the TOML this run is "calibration mode"
    # and we'll skip module quantization entirely.
    collector_info = None
    pass_args = load_quant_config(quant_config) if quant_config else None

    if pass_args and "token_collector" in pass_args:
        from chop.passes.module.transforms import attach_token_collector_pass

        tc_cfg = pass_args.pop("token_collector")
        logger.info("Attaching TokenCollector: %s", tc_cfg)
        model.to(device_id)
        model, collector_info = attach_token_collector_pass(model, tc_cfg)

    # Run quant pass only if there are real quant blocks left (selectors or
    # gptq); if pass_args is just {"by": ...} after popping token_collector,
    # we're in pure calibration mode and skip quantization.
    has_quant = pass_args is not None and (
        "gptq" in pass_args
        or any(k != "by" for k in pass_args.keys())
    )
    if has_quant:
        from chop.passes.module.transforms import quantize_module_transform_pass

        if "gptq" in pass_args:
            pass_args["gptq"]["device"] = device_id
        # Plumb device + model_name into rotation_search the same way; the
        # MASE pass needs them but they don't belong in the TOML schema.
        if "rotation_search" in pass_args:
            pass_args["rotation_search"]["device"] = device_id
            pass_args["rotation_search"].setdefault("model_name", model_name)

        n_linear = sum(
            1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear)
        )
        logger.info("Quantizing %d linear layers...", n_linear)
        t0 = time.time()
        model, _ = quantize_module_transform_pass(model, pass_args)
        logger.info("Quantization complete in %.1fs", time.time() - t0)

        # Surface which classes the dispatch landed on (so you can confirm
        # rotate variants are wired in when the TOML asks for them).
        from collections import Counter
        cls_count = Counter(
            type(m).__name__ for _, m in model.named_modules()
            if "MX" in type(m).__name__
        )
        logger.info(
            "Post-quant module classes:\n%s",
            "\n".join(f"  {c}: {n}" for c, n in cls_count.most_common()),
        )

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    # In calibration-only mode, the TokenCollector hook will raise
    # ``CollectorFull`` from inside model.forward once enough tokens have
    # been buffered — we catch it here so the eval pass exits cleanly with
    # the calibration file already on disk.
    from chop.passes.module.transforms.gptq import CollectorFull
    try:
        results = evaluate_with_lm_eval(
            model=model,
            tokenizer=tokenizer,
            tasks=tasks,
            max_length=seqlen,
            batch_size=batch_size,
            log_samples=False,
            limit=limit,
        )
    except CollectorFull as e:
        logger.info("[calibration mode] aborted lm-eval as planned: %s", e)
        results = {"calibration_only": True}

    if collector_info is not None and not collector_info["collector"].complete:
        # lm-eval finished its limit without filling the buffer — flush whatever
        # we have to disk so downstream GPTQ has *something* to work with.
        collector_info["collector"].finalize()

    print("\n" + "=" * 64)
    print("Results:")
    print("=" * 64)
    if "results" in results:
        for task_name, task_results in results["results"].items():
            print(f"  {task_name}:")
            for metric, value in task_results.items():
                if isinstance(value, (int, float)):
                    print(f"    {metric}: {value:.4f}")
    else:
        for k, v in results.items():
            print(f"  {k}: {v}")

    if log_dir:
        save_results(log_dir, results)

    return results

Code generation — eval_evalplus

HumanEval+/MBPP+ code-generation evaluation with optional MX quantization.

Routes through evalplus to score pass@1 (or pass@k) on the HumanEval+ or MBPP+ benchmarks under a single fixed-precision quantization profile. Use this when you want to check whether a quantization recipe still preserves the reasoning required for code generation.

Requires the evalplus extra:

uv sync --extra evalplus

Example:

python -m quant_eval.cli.eval_evalplus \
    --model_name unsloth/Llama-3.2-1B \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --dataset humaneval \
    --greedy \
    --evalplus_output_dir logs/evalplus

main(model_name='Qwen/Qwen2.5-1.5B', dataset='humaneval', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, batch_size=1, greedy=False, n_samples=1, max_new_tokens=4096, evalplus_output_dir=None, overwrite=False, base_only=False, parallel=None, version='default', log_dir=None)

Run evalplus (HumanEval+ / MBPP+) on an optionally MX-quantized HF model.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID.

'Qwen/Qwen2.5-1.5B'
dataset str

"humaneval" or "mbpp".

'humaneval'
device_id str

CUDA device string.

'cuda:0'
dtype str

Model dtype — "float16", "bfloat16", or "float32".

'bfloat16'
quant_config Union[str, None]

Path to a TOML quantization recipe. None runs the unquantized baseline.

'quant_eval/configs/llama_mxint4.toml'
model_parallel bool

Distribute across GPUs with device_map="auto".

False
batch_size int

Generation batch size (samples per task per call).

1
greedy bool

Greedy decoding (forces temperature=0 and n_samples=1).

False
n_samples int

Samples per task. Ignored when greedy=True.

1
max_new_tokens int

Maximum tokens generated per sample.

4096
evalplus_output_dir Union[str, None]

Directory where evalplus writes the generated solutions JSONL and per-problem evaluation results.

None
overwrite bool

Regenerate solutions even if a previous JSONL exists.

False
base_only bool

Score against base tests only (skip the +/plus tests).

False
parallel Union[int, None]

Worker count for evalplus's code-execution stage. None runs serially.

None
version str

evalplus dataset version (e.g. "default").

'default'
log_dir Union[str, None]

Directory for args.json and results.json.

None

Returns:

Type Description

evalplus results dict — pass@1 (and pass@k when applicable) plus

per-problem outcomes.

Raises:

Type Description
ValueError

dataset is not "humaneval" or "mbpp".

Source code in quant_eval/cli/eval_evalplus.py
def main(
    model_name: str = "Qwen/Qwen2.5-1.5B",
    dataset: str = "humaneval",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: Union[str, None] = "quant_eval/configs/llama_mxint4.toml",
    model_parallel: bool = False,
    batch_size: int = 1,
    greedy: bool = False,
    n_samples: int = 1,
    max_new_tokens: int = 4096,
    evalplus_output_dir: Union[str, None] = None,
    overwrite: bool = False,
    base_only: bool = False,
    parallel: Union[int, None] = None,
    version: str = "default",
    log_dir: Union[str, None] = None,
):
    """
    Run evalplus (HumanEval+ / MBPP+) on an optionally MX-quantized HF model.

    Args:
        model_name: HuggingFace model ID.
        dataset: ``"humaneval"`` or ``"mbpp"``.
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe. ``None`` runs the
            unquantized baseline.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        batch_size: Generation batch size (samples per task per call).
        greedy: Greedy decoding (forces ``temperature=0`` and ``n_samples=1``).
        n_samples: Samples per task. Ignored when ``greedy=True``.
        max_new_tokens: Maximum tokens generated per sample.
        evalplus_output_dir: Directory where evalplus writes the generated
            solutions JSONL and per-problem evaluation results.
        overwrite: Regenerate solutions even if a previous JSONL exists.
        base_only: Score against base tests only (skip the +/plus tests).
        parallel: Worker count for evalplus's code-execution stage. ``None``
            runs serially.
        version: evalplus dataset version (e.g. ``"default"``).
        log_dir: Directory for ``args.json`` and ``results.json``.

    Returns:
        evalplus results dict — pass@1 (and pass@k when applicable) plus
        per-problem outcomes.

    Raises:
        ValueError: ``dataset`` is not ``"humaneval"`` or ``"mbpp"``.
    """
    if dataset not in ("humaneval", "mbpp"):
        raise ValueError(f"dataset must be 'humaneval' or 'mbpp', got {dataset!r}")

    print("=" * 64)
    print("evalplus — fixed activation precision (no phase switch)")
    print("=" * 64)
    print(f"  Model  : {model_name}")
    print(f"  Dataset: {dataset}")
    print(f"  Weights: {quant_config or 'none (fp)'}")
    print(f"  Greedy : {greedy}  (n_samples={n_samples}, batch_size={batch_size})")
    print("=" * 64)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        if quant_config:
            import shutil
            shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    dtype_map = {
        "float16":  torch.float16,
        "bfloat16": torch.bfloat16,
        "float32":  torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    # Quantized attention modules (MXInt / *Rotate) replace the eager forward
    # path and assert _attn_implementation == "eager". Force eager whenever a
    # quant_config is supplied.
    attn_impl = "eager" if quant_config else "sdpa"

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
        attn_implementation=attn_impl,
    )
    model.eval()

    # ``token_collector`` is a side-effect pass that attaches a hook; if it's
    # the *only* thing in the TOML this run is calibration mode and we skip
    # module quantization entirely.
    collector_info = None
    pass_args = load_quant_config(quant_config) if quant_config else None

    if pass_args and "token_collector" in pass_args:
        from chop.passes.module.transforms import attach_token_collector_pass

        tc_cfg = pass_args.pop("token_collector")
        logger.info("Attaching TokenCollector: %s", tc_cfg)
        model.to(device_id)
        model, collector_info = attach_token_collector_pass(model, tc_cfg)

    has_quant = pass_args is not None and (
        "gptq" in pass_args
        or any(k != "by" for k in pass_args.keys())
    )
    if has_quant:
        from chop.passes.module.transforms import quantize_module_transform_pass

        if "gptq" in pass_args:
            pass_args["gptq"]["device"] = device_id

        n_linear = sum(
            1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear)
        )
        logger.info("Quantizing %d linear layers...", n_linear)
        t0 = time.time()
        model, _ = quantize_module_transform_pass(model, pass_args)
        logger.info("Quantization complete in %.1fs", time.time() - t0)

        from collections import Counter
        cls_count = Counter(
            type(m).__name__ for _, m in model.named_modules()
            if "MX" in type(m).__name__
        )
        logger.info(
            "Post-quant module classes:\n%s",
            "\n".join(f"  {c}: {n}" for c, n in cls_count.most_common()),
        )

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    # In calibration-only mode, the hook raises CollectorFull from inside
    # forward once the buffer is full — catch it so eval exits cleanly with
    # the calibration file already saved.
    from chop.passes.module.transforms.gptq import CollectorFull
    try:
        results = evaluate_with_evalplus(
            model=model,
            tokenizer=tokenizer,
            dataset=dataset,
            batch_size=batch_size,
            greedy=greedy,
            n_samples=n_samples,
            max_new_tokens=max_new_tokens,
            output_dir=evalplus_output_dir,
            parallel=parallel,
            base_only=base_only,
            version=version,
            overwrite=overwrite,
        )
    except CollectorFull as e:
        logger.info("[calibration mode] aborted evalplus as planned: %s", e)
        results = {"calibration_only": True}

    if collector_info is not None and not collector_info["collector"].complete:
        collector_info["collector"].finalize()

    print("\n" + "=" * 64)
    print("Results:")
    print("=" * 64)
    if isinstance(results, dict):
        # evalplus stores per-task pass@k under "pass_at_k"; surface anything
        # numeric so the log is useful even if the schema shifts.
        if "pass_at_k" in results:
            for split, metrics in results["pass_at_k"].items():
                print(f"  {split}:")
                for k, v in metrics.items():
                    print(f"    {k}: {v}")
        else:
            for k, v in results.items():
                if isinstance(v, (int, float, str)):
                    print(f"  {k}: {v}")

    if log_dir:
        save_results(log_dir, results)

    return results

Phase-dependent precision — eval_phase_lm

lm-eval-harness with phase- and layer-type-dependent MX quantization.

Activation precision switches based on (phase, layer_type):

  • phase: prefill (seq len > 1) vs decode (seq len == 1), detected from input shape at runtime.
  • layer_type: attention vs FFN, detected from module names.

The four resulting widths (prefill-attn, prefill-ffn, decode-attn, decode-ffn) are set independently. Weight quantization comes from the TOML recipe; activation widths come from CLI flags. lm-eval itself is unmodified.

Example — disaggregated W4 prefill / W8 decode:

python -m quant_eval.cli.eval_phase_lm \
    --model_name Qwen/Qwen2.5-1.5B \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --prefill_attn_width 4 --prefill_ffn_width 4 \
    --decode_attn_width  8 --decode_ffn_width  8 \
    --tasks gsm8k --limit 200

main(model_name='Qwen/Qwen2.5-1.5B', tasks='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, seqlen=2048, batch_size=64, prefill_attn_width=4, prefill_ffn_width=4, prefill_attn_block_size=32, prefill_ffn_block_size=32, decode_attn_width=8, decode_ffn_width=8, decode_attn_block_size=32, decode_ffn_block_size=32, attn_keywords=None, ffn_keywords=None, limit=None, log_dir=None)

Run lm-eval with phase- and layer-type-dependent activation precision.

Phase is detected from input sequence length at runtime; layer type is detected from module names (override via attn_keywords / ffn_keywords for non-standard architectures). The weight quantization recipe from quant_config is applied once; the activation widths and block sizes specified here override the recipe's activation sections per (phase, layer_type) pair.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID.

'Qwen/Qwen2.5-1.5B'
tasks Union[str, list[str]]

lm-eval task name(s) — comma-separated string or list.

'wikitext'
device_id str

CUDA device string.

'cuda:0'
dtype str

Model dtype — "float16", "bfloat16", or "float32".

'bfloat16'
quant_config str

Path to a TOML quantization recipe.

'quant_eval/configs/llama_mxint4.toml'
model_parallel bool

Distribute across GPUs with device_map="auto".

False
seqlen int

Maximum context length passed to lm-eval.

2048
batch_size Union[int, str]

lm-eval batch size — int or "auto".

64
prefill_attn_width int

Activation bit-width for attention layers during prefill.

4
prefill_ffn_width int

Activation bit-width for FFN layers during prefill.

4
prefill_attn_block_size int

MX block size for attention during prefill.

32
prefill_ffn_block_size int

MX block size for FFN during prefill.

32
decode_attn_width int

Activation bit-width for attention layers during decode.

8
decode_ffn_width int

Activation bit-width for FFN layers during decode.

8
decode_attn_block_size int

MX block size for attention during decode.

32
decode_ffn_block_size int

MX block size for FFN during decode.

32
attn_keywords Union[list[str], None]

Module-name substrings that identify attention blocks. None uses the built-in defaults.

None
ffn_keywords Union[list[str], None]

Module-name substrings that identify FFN blocks. None uses the built-in defaults.

None
limit Union[int, float, None]

Cap samples per task. Int = absolute count; float in (0, 1) = fraction; None = full dataset.

None
log_dir Union[str, None]

Directory for args.json and results.json.

None

Returns:

Type Description

lm-eval results dict — per-task metrics plus aggregate scores.

Source code in quant_eval/cli/eval_phase_lm.py
def main(
    model_name: str = "Qwen/Qwen2.5-1.5B",
    tasks: Union[str, list[str]] = "wikitext",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: str = "quant_eval/configs/llama_mxint4.toml",
    model_parallel: bool = False,
    seqlen: int = 2048,
    batch_size: Union[int, str] = 64,
    # ── Activation precision: prefill ──────────────────────────────
    prefill_attn_width:      int = 4,
    prefill_ffn_width:       int = 4,
    prefill_attn_block_size: int = 32,
    prefill_ffn_block_size:  int = 32,
    # ── Activation precision: decode ───────────────────────────────
    decode_attn_width:       int = 8,
    decode_ffn_width:        int = 8,
    decode_attn_block_size:  int = 32,
    decode_ffn_block_size:   int = 32,
    # ── Optional keyword overrides (for non-standard architectures) ─
    attn_keywords: Union[list[str], None] = None,
    ffn_keywords:  Union[list[str], None] = None,
    limit: Union[int, float, None] = None,
    log_dir: Union[str, None] = None,
):
    """
    Run lm-eval with phase- and layer-type-dependent activation precision.

    Phase is detected from input sequence length at runtime; layer type is
    detected from module names (override via ``attn_keywords`` / ``ffn_keywords``
    for non-standard architectures). The weight quantization recipe from
    ``quant_config`` is applied once; the activation widths and block sizes
    specified here override the recipe's activation sections per
    (phase, layer_type) pair.

    Args:
        model_name: HuggingFace model ID.
        tasks: lm-eval task name(s) — comma-separated string or list.
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        seqlen: Maximum context length passed to lm-eval.
        batch_size: lm-eval batch size — int or ``"auto"``.
        prefill_attn_width: Activation bit-width for attention layers during
            prefill.
        prefill_ffn_width: Activation bit-width for FFN layers during prefill.
        prefill_attn_block_size: MX block size for attention during prefill.
        prefill_ffn_block_size: MX block size for FFN during prefill.
        decode_attn_width: Activation bit-width for attention layers during
            decode.
        decode_ffn_width: Activation bit-width for FFN layers during decode.
        decode_attn_block_size: MX block size for attention during decode.
        decode_ffn_block_size: MX block size for FFN during decode.
        attn_keywords: Module-name substrings that identify attention blocks.
            ``None`` uses the built-in defaults.
        ffn_keywords: Module-name substrings that identify FFN blocks. ``None``
            uses the built-in defaults.
        limit: Cap samples per task. Int = absolute count; float in
            ``(0, 1)`` = fraction; ``None`` = full dataset.
        log_dir: Directory for ``args.json`` and ``results.json``.

    Returns:
        lm-eval results dict — per-task metrics plus aggregate scores.
    """
    # ------------------------------------------------------------------
    # Build the nested phase × layer config
    # ------------------------------------------------------------------
    phase_configs = {
        "prefill": {
            "attn": {
                "data_in_width":      prefill_attn_width,
                "data_in_block_size": prefill_attn_block_size,
            },
            "ffn": {
                "data_in_width":      prefill_ffn_width,
                "data_in_block_size": prefill_ffn_block_size,
            },
        },
        "decode": {
            "attn": {
                "data_in_width":      decode_attn_width,
                "data_in_block_size": decode_attn_block_size,
            },
            "ffn": {
                "data_in_width":      decode_ffn_width,
                "data_in_block_size": decode_ffn_block_size,
            },
        },
    }

    # ------------------------------------------------------------------
    # Print header
    # ------------------------------------------------------------------
    _pa = f"MXInt{prefill_attn_width}(bs={prefill_attn_block_size})"
    _pf = f"MXInt{prefill_ffn_width}(bs={prefill_ffn_block_size})"
    _da = f"MXInt{decode_attn_width}(bs={decode_attn_block_size})"
    _df = f"MXInt{decode_ffn_width}(bs={decode_ffn_block_size})"

    print("=" * 64)
    print("lm-eval — Phase × Layer-Type Disaggregated Quantization")
    print("=" * 64)
    print(f"  Model  : {model_name}")
    print(f"  Tasks  : {tasks}")
    print(f"  Weights: {quant_config}")
    print()
    print(f"  {'':10s}  {'attn':>24s}  {'ffn':>24s}")
    print(f"  {'prefill':10s}  {_pa:>24s}  {_pf:>24s}")
    print(f"  {'decode':10s}  {_da:>24s}  {_df:>24s}")
    print("=" * 64)

    # ------------------------------------------------------------------
    # Logging / experiment directory
    # ------------------------------------------------------------------
    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        import shutil
        shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    # ------------------------------------------------------------------
    # Model setup
    # ------------------------------------------------------------------
    dtype_map = {
        "float16":  torch.float16,
        "bfloat16": torch.bfloat16,
        "float32":  torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
    )
    model.eval()

    # ------------------------------------------------------------------
    # Apply weight quantization (activation configs are set by the hook)
    # ------------------------------------------------------------------
    from chop.passes.module.transforms import quantize_module_transform_pass

    pass_args = load_quant_config(quant_config)
    if "gptq" in pass_args:
        pass_args["gptq"]["device"] = device_id

    n_linear = sum(1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear))
    logger.info("Quantizing %d linear layers...", n_linear)
    t0 = time.time()
    model, _ = quantize_module_transform_pass(model, pass_args)
    logger.info("Quantization complete in %.1fs", time.time() - t0)

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    # ------------------------------------------------------------------
    # Enable disaggregated quantization hook
    # ------------------------------------------------------------------
    switch_kwargs = {}
    if attn_keywords:
        switch_kwargs["attn_keywords"] = tuple(attn_keywords)
    if ffn_keywords:
        switch_kwargs["ffn_keywords"] = tuple(ffn_keywords)

    switch = PhaseLayerAutoSwitch(model, phase_configs, **switch_kwargs)
    switch.enable()
    logger.info("\n%s", switch.summary())

    # ------------------------------------------------------------------
    # Run lm-eval (hook fires transparently on every forward pass)
    # ------------------------------------------------------------------
    results = evaluate_with_lm_eval(
        model=model,
        tokenizer=tokenizer,
        tasks=tasks,
        max_length=seqlen,
        batch_size=128,
        log_samples=False,
        limit=limit,
    )

    switch.disable()

    # ------------------------------------------------------------------
    # Print results
    # ------------------------------------------------------------------
    print("\n" + "=" * 64)
    print("Results:")
    print("=" * 64)
    print(f"\n  {'':10s}  {'attn':>24s}  {'ffn':>24s}")
    print(f"  {'prefill':10s}  {_pa:>24s}  {_pf:>24s}")
    print(f"  {'decode':10s}  {_da:>24s}  {_df:>24s}")
    print()

    if "results" in results:
        for task_name, task_results in results["results"].items():
            print(f"  {task_name}:")
            for metric, value in task_results.items():
                if isinstance(value, (int, float)):
                    print(f"    {metric}: {value:.4f}")
    else:
        for k, v in results.items():
            print(f"  {k}: {v}")

    if log_dir:
        results["phase_layer_configs"] = phase_configs
        save_results(log_dir, results)

    return results

BFCL with phase-dependent precision — eval_phase_bfcl

BFCL web-search evaluation with phase- and layer-type-dependent MX quantization.

Serves an MX-quantized model through a lightweight OpenAI-compatible HTTP server (backed by HuggingFace generate), then drives the standard BFCL CLI against it. Activation precision is set independently for each (phase, layer_type) pair via the prefill/decode × attn/FFN flags.

BFCL is a two-step flow:

  1. bfcl generate calls the local server to produce model responses.
  2. bfcl evaluate scores those responses (no model needed).

This script orchestrates both steps automatically and exposes the local server on server_host:server_port.

Requires the bfcl extra and the bfcl-eval package:

uv sync --extra bfcl
pip install bfcl-eval

Example:

python -m quant_eval.cli.eval_phase_bfcl \
    --model_name Qwen/Qwen2.5-1.5B \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --prefill_attn_width 4 --prefill_ffn_width 4 \
    --decode_attn_width  8 --decode_ffn_width  8 \
    --bfcl_test_categories web_search_base \
    --limit 50

main(model_name='Qwen/Qwen3-8B-FC', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, bfcl_test_categories=None, bfcl_num_threads=1, server_host=DEFAULT_HOST, server_port=DEFAULT_PORT, prefill_attn_width=4, prefill_ffn_width=4, prefill_attn_block_size=32, prefill_ffn_block_size=32, decode_attn_width=8, decode_ffn_width=8, decode_attn_block_size=32, decode_ffn_block_size=32, attn_keywords=None, ffn_keywords=None, limit=None, log_dir=None)

Run BFCL web-search evaluation with phase- and layer-type-dependent activation precision.

Spawns a local OpenAI-compatible HTTP server backed by HF generate so that the unmodified bfcl generate CLI can drive inference, then runs bfcl evaluate to score responses.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID. For function-calling, must be an instruction-tuned model with a function-call template (e.g. Qwen/Qwen3-8B-FC).

'Qwen/Qwen3-8B-FC'
device_id str

CUDA device string.

'cuda:0'
dtype str

Model dtype — "float16", "bfloat16", or "float32".

'bfloat16'
quant_config str

Path to a TOML quantization recipe.

'quant_eval/configs/llama_mxint4.toml'
model_parallel bool

Distribute across GPUs with device_map="auto".

False
bfcl_test_categories Union[list[str], None]

BFCL category names to evaluate (e.g. ["web_search_base", "web_search_no_snippet"]). None uses the default web-search category set.

None
bfcl_num_threads int

Parallel inference threads for bfcl generate.

1
server_host str

Host for the local OpenAI-compatible server.

DEFAULT_HOST
server_port int

Port for the local OpenAI-compatible server.

DEFAULT_PORT
prefill_attn_width int

Activation bit-width for attention during prefill.

4
prefill_ffn_width int

Activation bit-width for FFN during prefill.

4
prefill_attn_block_size int

MX block size for attention during prefill.

32
prefill_ffn_block_size int

MX block size for FFN during prefill.

32
decode_attn_width int

Activation bit-width for attention during decode.

8
decode_ffn_width int

Activation bit-width for FFN during decode.

8
decode_attn_block_size int

MX block size for attention during decode.

32
decode_ffn_block_size int

MX block size for FFN during decode.

32
attn_keywords Union[list[str], None]

Module-name substrings that identify attention blocks. None uses the built-in defaults.

None
ffn_keywords Union[list[str], None]

Module-name substrings that identify FFN blocks. None uses the built-in defaults.

None
limit Union[int, None]

Cap the number of samples per category. None = full dataset.

None
log_dir Union[str, None]

Directory for args.json and results.json.

None

Returns:

Type Description

BFCL evaluation summary — per-category scores plus aggregate metrics,

with phase_layer_configs recording the resolved (phase, layer)

precision table.

Source code in quant_eval/cli/eval_phase_bfcl.py
def main(
    model_name:  str = "Qwen/Qwen3-8B-FC",
    device_id:   str = "cuda:0",
    dtype:       str = "bfloat16",
    quant_config: str = "quant_eval/configs/llama_mxint4.toml",
    model_parallel: bool = False,
    # ── BFCL settings ──────────────────────────────────────────────────────
    bfcl_test_categories: Union[list[str], None] = None,
    bfcl_num_threads:     int   = 1,
    server_host:          str   = DEFAULT_HOST,
    server_port:          int   = DEFAULT_PORT,
    # ── Activation precision: prefill ──────────────────────────────────────
    prefill_attn_width:      int = 4,
    prefill_ffn_width:       int = 4,
    prefill_attn_block_size: int = 32,
    prefill_ffn_block_size:  int = 32,
    # ── Activation precision: decode ───────────────────────────────────────
    decode_attn_width:       int = 8,
    decode_ffn_width:        int = 8,
    decode_attn_block_size:  int = 32,
    decode_ffn_block_size:   int = 32,
    # ── Optional keyword overrides ─────────────────────────────────────────
    attn_keywords: Union[list[str], None] = None,
    ffn_keywords:  Union[list[str], None] = None,
    limit: Union[int, None] = None,
    log_dir: Union[str, None] = None,
):
    """
    Run BFCL web-search evaluation with phase- and layer-type-dependent
    activation precision.

    Spawns a local OpenAI-compatible HTTP server backed by HF ``generate``
    so that the unmodified ``bfcl generate`` CLI can drive inference, then
    runs ``bfcl evaluate`` to score responses.

    Args:
        model_name: HuggingFace model ID. For function-calling, must be an
            instruction-tuned model with a function-call template
            (e.g. ``Qwen/Qwen3-8B-FC``).
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        bfcl_test_categories: BFCL category names to evaluate (e.g.
            ``["web_search_base", "web_search_no_snippet"]``). ``None`` uses
            the default web-search category set.
        bfcl_num_threads: Parallel inference threads for ``bfcl generate``.
        server_host: Host for the local OpenAI-compatible server.
        server_port: Port for the local OpenAI-compatible server.
        prefill_attn_width: Activation bit-width for attention during prefill.
        prefill_ffn_width: Activation bit-width for FFN during prefill.
        prefill_attn_block_size: MX block size for attention during prefill.
        prefill_ffn_block_size: MX block size for FFN during prefill.
        decode_attn_width: Activation bit-width for attention during decode.
        decode_ffn_width: Activation bit-width for FFN during decode.
        decode_attn_block_size: MX block size for attention during decode.
        decode_ffn_block_size: MX block size for FFN during decode.
        attn_keywords: Module-name substrings that identify attention blocks.
            ``None`` uses the built-in defaults.
        ffn_keywords: Module-name substrings that identify FFN blocks. ``None``
            uses the built-in defaults.
        limit: Cap the number of samples per category. ``None`` = full dataset.
        log_dir: Directory for ``args.json`` and ``results.json``.

    Returns:
        BFCL evaluation summary — per-category scores plus aggregate metrics,
        with ``phase_layer_configs`` recording the resolved (phase, layer)
        precision table.
    """
    if bfcl_test_categories is None:
        bfcl_test_categories = list(BFCL_WEB_SEARCH_CATEGORIES)

    # ------------------------------------------------------------------
    # Build the nested phase × layer config
    # ------------------------------------------------------------------
    phase_configs = {
        "prefill": {
            "attn": {
                "data_in_width":      prefill_attn_width,
                "data_in_block_size": prefill_attn_block_size,
            },
            "ffn": {
                "data_in_width":      prefill_ffn_width,
                "data_in_block_size": prefill_ffn_block_size,
            },
        },
        "decode": {
            "attn": {
                "data_in_width":      decode_attn_width,
                "data_in_block_size": decode_attn_block_size,
            },
            "ffn": {
                "data_in_width":      decode_ffn_width,
                "data_in_block_size": decode_ffn_block_size,
            },
        },
    }

    # ------------------------------------------------------------------
    # Print header
    # ------------------------------------------------------------------
    _pa = f"MXInt{prefill_attn_width}(bs={prefill_attn_block_size})"
    _pf = f"MXInt{prefill_ffn_width}(bs={prefill_ffn_block_size})"
    _da = f"MXInt{decode_attn_width}(bs={decode_attn_block_size})"
    _df = f"MXInt{decode_ffn_width}(bs={decode_ffn_block_size})"

    print("=" * 64)
    print("BFCL Web Search — Phase × Layer-Type Disaggregated Quantization")
    print("=" * 64)
    print(f"  Model      : {model_name}")
    print(f"  Categories : {bfcl_test_categories}")
    print(f"  Weights    : {quant_config}")
    print(f"  Server     : http://{server_host}:{server_port}")
    print()
    print(f"  {'':10s}  {'attn':>24s}  {'ffn':>24s}")
    print(f"  {'prefill':10s}  {_pa:>24s}  {_pf:>24s}")
    print(f"  {'decode':10s}  {_da:>24s}  {_df:>24s}")
    print("=" * 64)
    logger.info("Model Parallel", model_parallel)

    # ------------------------------------------------------------------
    # Resolve output directories (persistent if log_dir given)
    # ------------------------------------------------------------------
    _tmpdir_ctx = tempfile.TemporaryDirectory()
    _tmpdir     = Path(_tmpdir_ctx.name)

    result_dir = _tmpdir / "bfcl_results"
    score_dir  = _tmpdir / "bfcl_scores"
    result_dir.mkdir(parents=True)
    score_dir.mkdir(parents=True)

    if log_dir:
        log_dir    = create_experiment_log_dir(log_dir)
        result_dir = log_dir / "bfcl_results"
        score_dir  = log_dir / "bfcl_scores"
        result_dir.mkdir(parents=True)
        score_dir.mkdir(parents=True)
        save_args(log_dir, locals().copy())
        import shutil
        shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    # ------------------------------------------------------------------
    # Model setup
    # ------------------------------------------------------------------
    dtype_map = {
        "float16":  torch.float16,
        "bfloat16": torch.bfloat16,
        "float32":  torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
    )
    model.eval()

    # ------------------------------------------------------------------
    # Weight quantization
    # ------------------------------------------------------------------
    from chop.passes.module.transforms import quantize_module_transform_pass

    pass_args = load_quant_config(quant_config)
    if "gptq" in pass_args:
        pass_args["gptq"]["device"] = device_id

    n_linear = sum(
        1 for _, m in model.named_modules()
        if isinstance(m, torch.nn.Linear)
    )
    logger.info("Quantizing %d linear layers...", n_linear)
    t0 = time.time()
    model, _ = quantize_module_transform_pass(model, pass_args)
    logger.info("Quantization complete in %.1fs", time.time() - t0)

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    # ------------------------------------------------------------------
    # Enable disaggregated quantization hook
    # ------------------------------------------------------------------
    switch_kwargs = {}
    if attn_keywords:
        switch_kwargs["attn_keywords"] = tuple(attn_keywords)
    if ffn_keywords:
        switch_kwargs["ffn_keywords"] = tuple(ffn_keywords)

    switch = PhaseLayerAutoSwitch(model, phase_configs, **switch_kwargs)
    switch.enable()
    logger.info("\n%s", switch.summary())

    # ------------------------------------------------------------------
    # Start the OpenAI-compatible server (hook fires on every request)
    # ------------------------------------------------------------------
    device_str = device_id if not model_parallel else "cuda"
    app = _build_server_app(model, tokenizer, device_str)
    _start_server(app, server_host, server_port)


    # ------------------------------------------------------------------
    # Step 1: bfcl generate  (calls the local server)
    # ------------------------------------------------------------------
    print("\n[1/2] Generating BFCL responses via local server...")
    gen_rc = _run_bfcl_generate(
        model_name      = model_name,
        test_categories = "web_search_base",
        host            = server_host,
        port            = server_port,
        result_dir      = result_dir,
        num_threads     = bfcl_num_threads,
        limit           = limit,
    )
    if gen_rc != 0:
        logger.error("bfcl generate exited with code %d", gen_rc)

#     # ------------------------------------------------------------------
#     # Step 2: bfcl evaluate  (pure scoring, no model needed)
#     # ------------------------------------------------------------------
    print("[2/2] Evaluating BFCL responses...")
    eval_rc, scores = _run_bfcl_evaluate(
        model_name      = model_name,
        test_categories = bfcl_test_categories,
        result_dir      = result_dir,
        score_dir       = score_dir,
    )

    switch.disable()

    # ------------------------------------------------------------------
    # Print results
    # ------------------------------------------------------------------
    print("\n" + "=" * 64)
    print("Results:")
    print("=" * 64)
    print(f"\n  {'':10s}  {'attn':>24s}  {'ffn':>24s}")
    print(f"  {'prefill':10s}  {_pa:>24s}  {_pf:>24s}")
    print(f"  {'decode':10s}  {_da:>24s}  {_df:>24s}")
    print()

    per_cat = scores.pop("per_category", {})
    for cat, cat_scores in per_cat.items():
        print(f"  {cat}:")
        if isinstance(cat_scores, dict):
            for metric, value in cat_scores.items():
                if isinstance(value, (int, float)):
                    print(f"    {metric}: {value:.4f}")
                else:
                    print(f"    {metric}: {value}")
        else:
            print(f"    {cat_scores}")

    if scores:
        print("\n  Overall (from data_overall.csv):")
        for k, v in scores.items():
            print(f"    {k}: {v}")

    # Restore per_category before saving.
    scores["per_category"] = per_cat
    scores["phase_layer_configs"] = phase_configs

    if log_dir:
        save_results(log_dir, scores)

    _tmpdir_ctx.cleanup()
    return scores

Diffusion LLMs — eval_dllm

Fast-dLLM v2 (block-diffusion language model) evaluation with optional MX quantization.

Evaluates diffusion-based language models via lm-eval-harness, using block-diffusion sampling instead of standard autoregressive decoding. Quantization is applied via the same TOML-config interface as the rest of the toolkit.

Example — baseline:

python -m quant_eval.cli.eval_dllm \
    --model_name Efficient-Large-Model/Fast_dLLM_v2_1.5B \
    --tasks gsm8k

Example — quantized:

python -m quant_eval.cli.eval_dllm \
    --model_name Efficient-Large-Model/Fast_dLLM_v2_1.5B \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --tasks gsm8k

main(model_name='Efficient-Large-Model/Fast_dLLM_v2_1.5B', tasks='gsm8k', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, batch_size=32, max_new_tokens=2048, num_fewshot=0, mask_id=151665, bd_size=32, small_block_size=8, threshold=1.0, show_speed=True, log_dir=None)

Evaluate a Fast-dLLM v2 model with optional MX quantization.

Decoding is block-diffusion: bd_size tokens are generated per outer block, then refined through small_block_size sub-blocks of iterative unmasking.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID (must be a Fast-dLLM v2 checkpoint).

'Efficient-Large-Model/Fast_dLLM_v2_1.5B'
tasks Union[str, list[str]]

lm-eval task name(s) — comma-separated string or list (e.g. "gsm8k,minerva_math").

'gsm8k'
device_id str

CUDA device string.

'cuda:0'
dtype str

Model dtype — "float16", "bfloat16", or "float32".

'bfloat16'
quant_config Union[str, None]

Path to a TOML quantization recipe. None runs the unquantized baseline.

None
model_parallel bool

Distribute across GPUs with device_map="auto".

False
batch_size int

lm-eval batch size.

32
max_new_tokens int

Maximum tokens generated per sample.

2048
num_fewshot int

Few-shot examples prepended to each task prompt.

0
mask_id int

Token ID used as the diffusion mask. Default 151665 (matches Qwen-based Fast-dLLM checkpoints).

151665
bd_size int

Outer block-diffusion block size — tokens generated per outer sampling step.

32
small_block_size int

Inner block size for iterative unmasking within each outer block.

8
threshold float

Confidence threshold for committing unmasked tokens.

1.0
show_speed bool

Log throughput metrics (tokens/second).

True
log_dir Union[str, None]

Directory for args.json and results.json.

None

Returns:

Type Description

lm-eval results dict — per-task metrics plus aggregate scores.

Source code in quant_eval/cli/eval_dllm.py
def main(
    model_name: str = "Efficient-Large-Model/Fast_dLLM_v2_1.5B",
    tasks: Union[str, list[str]] = "gsm8k",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: Union[str, None] = None,
    model_parallel: bool = False,
    # dLLM specific
    batch_size: int = 32,
    max_new_tokens: int = 2048,
    num_fewshot: int = 0,
    mask_id: int = 151665,
    bd_size: int = 32,
    small_block_size: int = 8,
    threshold: float = 1.0,
    show_speed: bool = True,
    log_dir: Union[str, None] = None,
):
    """
    Evaluate a Fast-dLLM v2 model with optional MX quantization.

    Decoding is block-diffusion: ``bd_size`` tokens are generated per outer
    block, then refined through ``small_block_size`` sub-blocks of iterative
    unmasking.

    Args:
        model_name: HuggingFace model ID (must be a Fast-dLLM v2 checkpoint).
        tasks: lm-eval task name(s) — comma-separated string or list
            (e.g. ``"gsm8k,minerva_math"``).
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe. ``None`` runs the
            unquantized baseline.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        batch_size: lm-eval batch size.
        max_new_tokens: Maximum tokens generated per sample.
        num_fewshot: Few-shot examples prepended to each task prompt.
        mask_id: Token ID used as the diffusion mask. Default ``151665``
            (matches Qwen-based Fast-dLLM checkpoints).
        bd_size: Outer block-diffusion block size — tokens generated per
            outer sampling step.
        small_block_size: Inner block size for iterative unmasking within
            each outer block.
        threshold: Confidence threshold for committing unmasked tokens.
        show_speed: Log throughput metrics (tokens/second).
        log_dir: Directory for ``args.json`` and ``results.json``.

    Returns:
        lm-eval results dict — per-task metrics plus aggregate scores.
    """
    print("=" * 60)
    print("Fast-dLLM Evaluation")
    print("=" * 60)
    print(f"Model: {model_name}")
    print(f"Tasks: {tasks}")
    print(f"Block size: {bd_size}, Sub-block: {small_block_size}, Threshold: {threshold}")

    quantize = quant_config is not None
    if quantize:
        print(f"Quantization config: {quant_config}")
    else:
        print("Quantization: None (baseline)")
    print("=" * 60)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        if quant_config:
            import shutil
            shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    dtype_map = {
        "float16": torch.float16,
        "bfloat16": torch.bfloat16,
        "float32": torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
    )
    model.eval()

    if quantize:
        from chop.passes.module.transforms import quantize_module_transform_pass

        pass_args = load_quant_config(quant_config)
        if "gptq" in pass_args:
            pass_args["gptq"]["device"] = device_id

        n_linear = sum(
            1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear)
        )
        logger.info("Quantizing %d linear layers...", n_linear)
        t0 = time.time()
        model, _ = quantize_module_transform_pass(model, pass_args)
        logger.info("Quantization complete in %.1fs", time.time() - t0)

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    if quantize:
        print_all_layers(model)

    # Attach block diffusion sampling method
    setup_dllm_generation(model)

    device = torch.device(device_id)
    results = evaluate_dllm(
        model=model,
        tokenizer=tokenizer,
        tasks=tasks,
        device=device,
        model_name=model_name,
        batch_size=batch_size,
        max_new_tokens=max_new_tokens,
        num_fewshot=num_fewshot,
        mask_id=mask_id,
        bd_size=bd_size,
        small_block_size=small_block_size,
        threshold=threshold,
        show_speed=show_speed,
    )

    print("\n" + "=" * 60)
    print("Results:")
    print("=" * 60)
    for task_name, task_results in results.get("results", {}).items():
        print(f"\n{task_name}:")
        for metric, value in task_results.items():
            if isinstance(value, (int, float)):
                print(f"  {metric}: {value:.4f}")

    if log_dir:
        save_results(log_dir, results)

    return results

LLaDA diffusion — eval_llada

LLaDA (diffusion-style language model) evaluation with optional MX quantization.

Wraps lm-eval-harness's CLI. Use --model llada_dist and pass model and quantization options through lm-eval's --model_args flag.

Example — baseline (prefix cache):

python -m quant_eval.cli.eval_llada \
    --tasks gsm8k --num_fewshot 0 \
    --model llada_dist \
    --model_args model_path='GSAI-ML/LLaDA-8B-Instruct',gen_length=256,steps=256,block_length=32,use_cache=True

Example — with MXINT4 KV-cache quantization:

python -m quant_eval.cli.eval_llada \
    --tasks gsm8k --num_fewshot 0 \
    --model llada_dist \
    --model_args model_path='GSAI-ML/LLaDA-8B-Instruct',gen_length=256,steps=256,block_length=32,use_cache=True,quant_config='quant_eval/configs/llama_mxint4.toml'

Agentic — eval_osworld

OSWorld agentic evaluation with optional MX quantization.

Runs the OSWorld desktop-task benchmark in text-only (a11y_tree) mode, so quantized language models can serve as OSWorld agents without vision capabilities. The agent observes the desktop via the accessibility tree and generates pyautogui code to act on it; the VM is rolled back between tasks.

Prerequisites:

  • The OSWorld repository cloned at osworld_path.
  • A configured VM provider (Docker recommended) with the corresponding image.
  • An instruction-tuned / chat model.

Example:

python -m quant_eval.cli.eval_osworld \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --osworld_path quant_eval/benchmarks/OSWorld \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --domain chrome --max_steps 15

main(model_name='Qwen/Qwen3-30B-A3B-Instruct-2507', osworld_path='quant_eval/benchmarks/OSWorld', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, provider_name='docker', path_to_vm=None, domain='all', max_steps=15, max_tokens=1500, temperature=0.5, top_p=0.9, max_trajectory_length=3, a11y_tree_max_tokens=10000, result_dir='./results', client_password='password', screen_width=1920, screen_height=1080, headless=True, sleep_after_execution=0.0, test_all_meta_path=None, log_dir=None)

Run OSWorld agentic evaluation with optional MX quantization.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID — must be an instruction-tuned / chat model.

'Qwen/Qwen3-30B-A3B-Instruct-2507'
osworld_path str

Local path to the OSWorld repository checkout.

'quant_eval/benchmarks/OSWorld'
device_id str

CUDA device string.

'cuda:0'
dtype str

Model dtype — "float16", "bfloat16", or "float32".

'bfloat16'
quant_config Union[str, None]

Path to a TOML quantization recipe. None runs the unquantized baseline.

None
model_parallel bool

Distribute across GPUs with device_map="auto".

False
provider_name str

VM provider — one of "docker", "vmware", "virtualbox", "aws".

'docker'
path_to_vm Union[str, None]

Path to the VM image (used by vmware / virtualbox; ignored for Docker).

None
domain str

Task domain to evaluate — "all" or one of OSWorld's domains ("chrome", "libreoffice_calc", "gimp", ...).

'all'
max_steps int

Maximum agent steps per task.

15
max_tokens int

Maximum tokens generated per agent step.

1500
temperature float

Sampling temperature for agent generation.

0.5
top_p float

Top-p sampling for agent generation.

0.9
max_trajectory_length int

Steps of action history retained in the prompt context.

3
a11y_tree_max_tokens int

Maximum tokens for the serialized accessibility-tree observation.

10000
result_dir str

Directory for per-task outputs (trajectories, screenshots, scores).

'./results'
client_password str

VM user password (for sudo operations inside the VM).

'password'
screen_width int

VM display width in pixels.

1920
screen_height int

VM display height in pixels.

1080
headless bool

Run the VM without a visible GUI window.

True
sleep_after_execution float

Seconds to pause after each pyautogui action (useful when the VM needs time to render).

0.0
test_all_meta_path Union[str, None]

Path to a test_all.json task list. None uses OSWorld's default meta file for the chosen domain.

None
log_dir Union[str, None]

Directory for top-level args.json and aggregate results.json.

None

Returns:

Type Description

Dict with keys avg_score, total_tasks, total_success,

all_scores (per-task score list), and per_domain (mapping

each domain to {avg_score, num_tasks, num_success}).

Source code in quant_eval/cli/eval_osworld.py
def main(
    model_name: str = "Qwen/Qwen3-30B-A3B-Instruct-2507",
    osworld_path: str = "quant_eval/benchmarks/OSWorld",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: Union[str, None] = None,
    model_parallel: bool = False,
    # OSWorld environment settings
    provider_name: str = "docker",
    path_to_vm: Union[str, None] = None,
    domain: str = "all",
    max_steps: int = 15,
    max_tokens: int = 1500,
    temperature: float = 0.5,
    top_p: float = 0.9,
    max_trajectory_length: int = 3,
    a11y_tree_max_tokens: int = 10000,
    result_dir: str = "./results",
    client_password: str = "password",
    screen_width: int = 1920,
    screen_height: int = 1080,
    headless: bool = True,
    sleep_after_execution: float = 0.0,
    test_all_meta_path: Union[str, None] = None,
    log_dir: Union[str, None] = None,
):
    """
    Run OSWorld agentic evaluation with optional MX quantization.

    Args:
        model_name: HuggingFace model ID — must be an instruction-tuned /
            chat model.
        osworld_path: Local path to the OSWorld repository checkout.
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe. ``None`` runs the
            unquantized baseline.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        provider_name: VM provider — one of ``"docker"``, ``"vmware"``,
            ``"virtualbox"``, ``"aws"``.
        path_to_vm: Path to the VM image (used by ``vmware`` / ``virtualbox``;
            ignored for Docker).
        domain: Task domain to evaluate — ``"all"`` or one of OSWorld's
            domains (``"chrome"``, ``"libreoffice_calc"``, ``"gimp"``, ...).
        max_steps: Maximum agent steps per task.
        max_tokens: Maximum tokens generated per agent step.
        temperature: Sampling temperature for agent generation.
        top_p: Top-p sampling for agent generation.
        max_trajectory_length: Steps of action history retained in the
            prompt context.
        a11y_tree_max_tokens: Maximum tokens for the serialized
            accessibility-tree observation.
        result_dir: Directory for per-task outputs (trajectories,
            screenshots, scores).
        client_password: VM user password (for sudo operations inside the VM).
        screen_width: VM display width in pixels.
        screen_height: VM display height in pixels.
        headless: Run the VM without a visible GUI window.
        sleep_after_execution: Seconds to pause after each ``pyautogui``
            action (useful when the VM needs time to render).
        test_all_meta_path: Path to a ``test_all.json`` task list. ``None``
            uses OSWorld's default meta file for the chosen domain.
        log_dir: Directory for top-level ``args.json`` and aggregate
            ``results.json``.

    Returns:
        Dict with keys ``avg_score``, ``total_tasks``, ``total_success``,
        ``all_scores`` (per-task score list), and ``per_domain`` (mapping
        each domain to ``{avg_score, num_tasks, num_success}``).
    """
    print("=" * 60)
    print("OSWorld Agentic Evaluation")
    print("=" * 60)
    print(f"Model: {model_name}")
    print(f"OSWorld path: {osworld_path}")
    print(f"Domain: {domain}")
    print(f"Observation: a11y_tree (text-only)")
    print(f"Action space: pyautogui")

    quantize = quant_config is not None
    if quantize:
        print(f"Quantization config: {quant_config}")
    else:
        print("Quantization: None (baseline)")
    print("=" * 60)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        if quant_config:
            import shutil
            shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    dtype_map = {
        "float16": torch.float16,
        "bfloat16": torch.bfloat16,
        "float32": torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
    )
    model.eval()

    if quantize:
        from chop.passes.module.transforms import quantize_module_transform_pass

        pass_args = load_quant_config(quant_config)
        if "gptq" in pass_args:
            pass_args["gptq"]["device"] = device_id

        n_linear = sum(
            1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear)
        )
        logger.info("Quantizing %d linear layers...", n_linear)
        t0 = time.time()
        model, _ = quantize_module_transform_pass(model, pass_args)
        logger.info("Quantization complete in %.1fs", time.time() - t0)

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    if quantize:
        print_all_layers(model)

    results = evaluate_osworld(
        model=model,
        tokenizer=tokenizer,
        osworld_path=osworld_path,
        provider_name=provider_name,
        path_to_vm=path_to_vm,
        domain=domain,
        max_steps=max_steps,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        max_trajectory_length=max_trajectory_length,
        a11y_tree_max_tokens=a11y_tree_max_tokens,
        result_dir=result_dir,
        client_password=client_password,
        screen_width=screen_width,
        screen_height=screen_height,
        headless=headless,
        sleep_after_execution=sleep_after_execution,
        test_all_meta_path=test_all_meta_path,
    )

    print("\n" + "=" * 60)
    print("Results:")
    print("=" * 60)
    print(f"  Average score: {results['avg_score']:.4f}")
    print(f"  Tasks completed: {results['total_success']}/{results['total_tasks']}")
    print()
    if results.get("per_domain"):
        print("Per-domain breakdown:")
        for domain_name, domain_results in results["per_domain"].items():
            print(
                f"  {domain_name}: "
                f"avg={domain_results['avg_score']:.4f}, "
                f"success={domain_results['num_success']}/{domain_results['num_tasks']}"
            )

    if log_dir:
        save_results(log_dir, results)

    return results

Rotation search — search_rotation

Calibration-aware per-matmul Hadamard rotation search.

Loads a base TOML quantization config (with all online rotations off), runs rotation_search_transform_pass from MASE, and writes a JSON summary listing the per-matmul-type perplexities, the greedy winners, and the final combined perplexity.

This is a calibration tool, not an evaluation — its output (rotation_decisions.json) is consumed by subsequent eval_* runs whose configs include a [rotation_search] block.

Example:

python -m quant_eval.cli.search_rotation \
    --model_name unsloth/Llama-3.2-1B \
    --base_config quant_eval/configs/llama_mxint4.toml \
    --calib_data wikitext2 \
    --calib_nsamples 128 \
    --output_json checkpoints/rotation_decisions.json

main(model_name='Qwen/Qwen3-8B', base_config='plena_experiments/table9/configs/gsm8k/05_w4_act4_kv4_gptq_erryclip.toml', calib_data='file:calib/Qwen_Qwen3-8B_gsm8k_n64_s1024.pt', device_id='cuda:0', dtype='bfloat16', calib_nsamples=32, calib_seqlen=1024, matmul_types=None, output_json=None, improvement_eps=0.0, log_dir=None)

Greedy forward search for per-matmul online Hadamard rotations that minimise calibration perplexity.

Each round tries enabling rotation on every remaining matmul type, commits the one with the largest ppl drop above improvement_eps, and repeats until no candidate helps.

Parameters:

Name Type Description Default
model_name str

HuggingFace model ID — must match the model that base_config targets.

'Qwen/Qwen3-8B'
base_config str

TOML quantization recipe with all matmul rotations currently off.

'plena_experiments/table9/configs/gsm8k/05_w4_act4_kv4_gptq_erryclip.toml'
calib_data str

Calibration data spec — a saved-token-batch path ("file:calib/foo.pt") or a dataset name ("wikitext2", "c4").

'file:calib/Qwen_Qwen3-8B_gsm8k_n64_s1024.pt'
device_id str

CUDA device for forward passes.

'cuda:0'
dtype str

Model dtype — "float16", "bfloat16", or "float32".

'bfloat16'
calib_nsamples int

Number of calibration samples used to score ppl.

32
calib_seqlen int

Sequence length for the calibration loader.

1024
matmul_types Union[str, None]

Comma-separated subset of matmul types to search (e.g. "q_proj,o_proj,qk_matmul"). None searches all.

None
output_json Union[str, None]

Path where the JSON summary is written.

None
improvement_eps float

Minimum ppl drop required to commit a rotation as a winner. 0.0 accepts any improvement.

0.0
log_dir Union[str, None]

Directory for args.json and results.json.

None

Returns:

Type Description

Dict summarising the search. Core keys: baseline_ppl,

final_ppl, winners (matmul-type names in commit order),

rounds (per-round candidate-vs-ppl history), n_trials,

per_type_swap_count, improvement_eps,

matmul_types_searched. When the search resumes from an existing

output_json, additionally from_cache: True.

Source code in quant_eval/cli/search_rotation.py
def main(
    model_name: str = "Qwen/Qwen3-8B",
    base_config: str = "plena_experiments/table9/configs/gsm8k/05_w4_act4_kv4_gptq_erryclip.toml",
    calib_data: str = "file:calib/Qwen_Qwen3-8B_gsm8k_n64_s1024.pt",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    calib_nsamples: int = 32,
    calib_seqlen: int = 1024,
    matmul_types: Union[str, None] = None,
    output_json: Union[str, None] = None,
    improvement_eps: float = 0.0,
    log_dir: Union[str, None] = None,
):
    """
    Greedy forward search for per-matmul online Hadamard rotations that
    minimise calibration perplexity.

    Each round tries enabling rotation on every remaining matmul type,
    commits the one with the largest ppl drop above ``improvement_eps``,
    and repeats until no candidate helps.

    Args:
        model_name: HuggingFace model ID — must match the model that
            ``base_config`` targets.
        base_config: TOML quantization recipe with all matmul rotations
            currently off.
        calib_data: Calibration data spec — a saved-token-batch path
            (``"file:calib/foo.pt"``) or a dataset name (``"wikitext2"``,
            ``"c4"``).
        device_id: CUDA device for forward passes.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        calib_nsamples: Number of calibration samples used to score ppl.
        calib_seqlen: Sequence length for the calibration loader.
        matmul_types: Comma-separated subset of matmul types to search
            (e.g. ``"q_proj,o_proj,qk_matmul"``). ``None`` searches all.
        output_json: Path where the JSON summary is written.
        improvement_eps: Minimum ppl drop required to commit a rotation
            as a winner. ``0.0`` accepts any improvement.
        log_dir: Directory for ``args.json`` and ``results.json``.

    Returns:
        Dict summarising the search. Core keys: ``baseline_ppl``,
        ``final_ppl``, ``winners`` (matmul-type names in commit order),
        ``rounds`` (per-round candidate-vs-ppl history), ``n_trials``,
        ``per_type_swap_count``, ``improvement_eps``,
        ``matmul_types_searched``. When the search resumes from an existing
        ``output_json``, additionally ``from_cache: True``.
    """
    print("=" * 64)
    print("Calibration-aware per-matmul rotation search")
    print("=" * 64)
    print(f"  Model       : {model_name}")
    print(f"  Base config : {base_config}")
    print(f"  Calib data  : {calib_data}")
    print(f"  Calib n/seq : {calib_nsamples} x {calib_seqlen}")
    print(f"  Output JSON : {output_json}")
    print("=" * 64)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        import shutil
        shutil.copy(base_config, log_dir / "base_config.toml")

    transformers.set_seed(0)

    dtype_map = {
        "float16":  torch.float16,
        "bfloat16": torch.bfloat16,
        "float32":  torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    # Quantized attention modules (MXInt + *Rotate variants) replace the
    # eager forward and assert _attn_implementation == "eager".
    tokenizer, model = setup_model(
        model_name,
        model_parallel=False,
        dtype=torch_dtype,
        device=device_id,
        attn_implementation="eager",
    )
    model.eval()

    # Load the calibration loader the same way GPTQ does — same data
    # everyone in this project already feeds into quantization.
    from chop.passes.module.transforms.gptq.data_utils import get_loaders
    calib_loader = get_loaders(
        calib_data,
        nsamples=calib_nsamples,
        seed=0,
        seqlen=calib_seqlen,
        model=model_name,
    )
    logger.info("Loaded %d calibration batches.", len(calib_loader))

    base_pass_args = load_quant_config(base_config)
    if "gptq" in base_pass_args:
        # Plumb the device through so GPTQ runs on the same GPU.
        base_pass_args["gptq"]["device"] = device_id

    selected_types = None
    if matmul_types:
        selected_types = [t.strip() for t in matmul_types.split(",") if t.strip()]

    from chop.passes.module.transforms import rotation_search_transform_pass
    from chop.passes.module.transforms.quantize import ALL_MATMUL_TYPES

    search_args = {
        "base_quantize_args": base_pass_args,
        "calib_loader": calib_loader,
        "device": device_id,
        "matmul_types": selected_types or ALL_MATMUL_TYPES,
        "output_json": output_json,
        "improvement_eps": improvement_eps,
    }

    t0 = time.time()
    model, results = rotation_search_transform_pass(model, search_args)
    logger.info("Rotation search complete in %.1fs", time.time() - t0)

    print("\n" + "=" * 64)
    print("Rotation search results:")
    print("=" * 64)
    print(f"  baseline_ppl : {results['baseline_ppl']:.4f}")
    print(f"  final_ppl    : {results['final_ppl']:.4f}  "
          f"(Δ={results['baseline_ppl']-results['final_ppl']:+.4f} from baseline)")
    print(f"  winners      : {results['winners']}  (in commit order)")
    print(f"  total trials : {results.get('n_trials', 'n/a')}")

    print("\n  Round history:")
    for r in results.get("rounds", []):
        sel = r["selected"]
        if sel is None:
            print(
                f"   round {r['round']}: stopped at ppl={r['current_ppl_after']:.4f}"
            )
            continue
        delta = r["current_ppl_before"] - r["current_ppl_after"]
        print(
            f"   round {r['round']}: +{sel:<11s}  "
            f"ppl {r['current_ppl_before']:.4f} -> {r['current_ppl_after']:.4f}  "
            f"Δ={delta:+.4f}"
        )

    if log_dir:
        save_results(log_dir, results)

    return results