Evaluation commands¶

Every PLENA evaluation is a Python module under quant_eval.cli. They share a common pattern: pass --model_name, a TOML via --quant_config, and any eval-specific flags. The reference below is auto-generated from each module's docstrings, so it stays in sync with the code.

Perplexity — `eval_ppl`¶

Language-modeling perplexity evaluation with optional MX quantization.

Sets up an HF causal language model, optionally applies a MASE quantization pass driven by a TOML recipe, then computes perplexity on the chosen dataset (WikiText by default).

Example — baseline (fp16):

python -m quant_eval.cli.eval_ppl --model_name unsloth/Llama-3.2-1B

Example — quantized:

python -m quant_eval.cli.eval_ppl \
    --model_name unsloth/Llama-3.2-1B \
    --quant_config quant_eval/configs/llama_mxint4.toml

`main(model_name='Qwen/Qwen3-30B-A3B', dataset='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, seqlen=2048, log_dir=None)` ¶

Evaluate language-modeling perplexity, optionally with MX quantization.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID (e.g. `meta-llama/Llama-2-7b-hf`).	`'Qwen/Qwen3-30B-A3B'`
`dataset`	`str`	Dataset for perplexity scoring. Default `"wikitext"`.	`'wikitext'`
`device_id`	`str`	CUDA device string, e.g. `"cuda:0"`.	`'cuda:0'`
`dtype`	`str`	Model dtype — one of `"float16"`, `"bfloat16"`, `"float32"`.	`'bfloat16'`
`quant_config`	`Union[str, None]`	Path to a TOML quantization recipe. `None` runs the unquantized baseline.	`None`
`model_parallel`	`bool`	Distribute the model across all visible GPUs using HF `device_map="auto"`.	`False`
`seqlen`	`int`	Maximum sequence length passed to the perplexity evaluator.	`2048`
`log_dir`	`Union[str, None]`	Directory in which to write `args.json`, `quant_config.toml` (when quantized), and `results.json`. `None` disables logging.	`None`

Returns:

Type	Description
	Dict of metric name to scalar. For wikitext: `{"ppl": float}`.

Source code in quant_eval/cli/eval_ppl.py

def main(
    model_name: str = "Qwen/Qwen3-30B-A3B",
    dataset: str = "wikitext",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: Union[str, None] = None,
    model_parallel: bool = False,
    seqlen: int = 2048,
    log_dir: Union[str, None] = None,
):
    """
    Evaluate language-modeling perplexity, optionally with MX quantization.

    Args:
        model_name: HuggingFace model ID (e.g. ``meta-llama/Llama-2-7b-hf``).
        dataset: Dataset for perplexity scoring. Default ``"wikitext"``.
        device_id: CUDA device string, e.g. ``"cuda:0"``.
        dtype: Model dtype — one of ``"float16"``, ``"bfloat16"``, ``"float32"``.
        quant_config: Path to a TOML quantization recipe. ``None`` runs the
            unquantized baseline.
        model_parallel: Distribute the model across all visible GPUs using HF
            ``device_map="auto"``.
        seqlen: Maximum sequence length passed to the perplexity evaluator.
        log_dir: Directory in which to write ``args.json``, ``quant_config.toml``
            (when quantized), and ``results.json``. ``None`` disables logging.

    Returns:
        Dict of metric name to scalar. For wikitext: ``{"ppl": float}``.
    """
    print("=" * 60)
    print("Perplexity Evaluation")
    print("=" * 60)
    print(f"Model: {model_name}")
    print(f"Dataset: {dataset}")

    quantize = quant_config is not None
    if quantize:
        print(f"Quantization config: {quant_config}")
    else:
        print("Quantization: None (baseline)")
    print("=" * 60)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        if quant_config:
            import shutil
            shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    dtype_map = {"float16": torch.float16, "bfloat16": torch.bfloat16, "float32": torch.float32}
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    attn_impl = "eager" if quant_config else "sdpa"

    tokenizer, model = setup_model(
        model_name, model_parallel, dtype=torch_dtype,
        device=device_id if not model_parallel else None,
        attn_implementation=attn_impl,
    )
    model.eval()

    if quantize:
        from chop.passes.module.transforms import quantize_module_transform_pass

        pass_args = load_quant_config(quant_config)
        if "gptq" in pass_args:
            pass_args["gptq"]["device"] = device_id
        if "rotation_search" in pass_args:
            pass_args["rotation_search"]["device"] = device_id

        n_linear = sum(1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear))
        logger.info("Quantizing %d linear layers...", n_linear)
        t0 = time.time()
        model, _ = quantize_module_transform_pass(model, pass_args)
        logger.info("Quantization complete in %.1fs", time.time() - t0)

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    if quantize:
        print_all_layers(model)

    results = evaluate_perplexity(
        model=model,
        tokenizer=tokenizer,
        dataset_name=dataset,
        max_length=seqlen,
        verbose=True,
    )

    print("\n" + "=" * 60)
    print("Results:")
    print("=" * 60)
    for k, v in results.items():
        print(f"  {k}: {v}")

    if log_dir:
        save_results(log_dir, results)

    return results

lm-eval-harness — `eval_lm`¶

lm-eval-harness driver with optional MX quantization.

Applies a TOML quantization recipe once before evaluation; activation precision stays fixed for the whole run.

Example:

python -m quant_eval.cli.eval_lm \
    --model_name unsloth/Llama-3.2-1B \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --tasks arc_easy,hellaswag,winogrande \
    --limit 500

`main(model_name='Qwen/Qwen2.5-1.5B', tasks='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, seqlen=2048, batch_size=64, limit=None, log_dir=None)` ¶

Run lm-eval-harness on an optionally MX-quantized HF model.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID.	`'Qwen/Qwen2.5-1.5B'`
`tasks`	`Union[str, list[str]]`	lm-eval task name(s) — comma-separated string or list (e.g. `"arc_easy,hellaswag"`).	`'wikitext'`
`device_id`	`str`	CUDA device string.	`'cuda:0'`
`dtype`	`str`	Model dtype — `"float16"`, `"bfloat16"`, or `"float32"`.	`'bfloat16'`
`quant_config`	`Union[str, None]`	Path to a TOML quantization recipe. `None` runs the unquantized baseline.	`'quant_eval/configs/llama_mxint4.toml'`
`model_parallel`	`bool`	Distribute across GPUs with `device_map="auto"`.	`False`
`seqlen`	`int`	Maximum context length passed to lm-eval.	`2048`
`batch_size`	`Union[int, str]`	Eval batch size. Pass an int for a fixed size, or the string `"auto"` for lm-eval's auto-batching.	`64`
`limit`	`Union[int, float, None]`	Cap samples per task. Int = absolute count; float in `(0, 1)` = fraction of the full dataset; `None` = full.	`None`
`log_dir`	`Union[str, None]`	Directory for `args.json` and `results.json`. `None` disables logging.	`None`

Returns:

Type	Description
	lm-eval results dict — per-task metrics plus aggregate scores.

Source code in quant_eval/cli/eval_lm.py

def main(
    model_name: str = "Qwen/Qwen2.5-1.5B",
    tasks: Union[str, list[str]] = "wikitext",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: Union[str, None] = "quant_eval/configs/llama_mxint4.toml",
    model_parallel: bool = False,
    seqlen: int = 2048,
    batch_size: Union[int, str] = 64,
    limit: Union[int, float, None] = None,
    log_dir: Union[str, None] = None,
):
    """
    Run lm-eval-harness on an optionally MX-quantized HF model.

    Args:
        model_name: HuggingFace model ID.
        tasks: lm-eval task name(s) — comma-separated string or list
            (e.g. ``"arc_easy,hellaswag"``).
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe. ``None`` runs the
            unquantized baseline.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        seqlen: Maximum context length passed to lm-eval.
        batch_size: Eval batch size. Pass an int for a fixed size, or the
            string ``"auto"`` for lm-eval's auto-batching.
        limit: Cap samples per task. Int = absolute count; float in
            ``(0, 1)`` = fraction of the full dataset; ``None`` = full.
        log_dir: Directory for ``args.json`` and ``results.json``. ``None``
            disables logging.

    Returns:
        lm-eval results dict — per-task metrics plus aggregate scores.
    """
    print("=" * 64)
    print("lm-eval — fixed activation precision (no phase switch)")
    print("=" * 64)
    print(f"  Model  : {model_name}")
    print(f"  Tasks  : {tasks}")
    print(f"  Weights: {quant_config or 'none (fp)'}")
    print(f"  Seqlen : {seqlen}")
    print("=" * 64)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        if quant_config:
            import shutil
            shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    dtype_map = {
        "float16":  torch.float16,
        "bfloat16": torch.bfloat16,
        "float32":  torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    # Quantized attention modules (MXInt / MXFP / *Rotate) replace the
    # eager forward path and assert _attn_implementation == "eager". Force
    # eager whenever a quant_config is supplied, regardless of TOML pattern.
    attn_impl = "eager" if quant_config else "sdpa"

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
        attn_implementation=attn_impl,
    )
    model.eval()

    # ``token_collector`` is a side-effect pass that just attaches a hook;
    # if it's the *only* thing in the TOML this run is "calibration mode"
    # and we'll skip module quantization entirely.
    collector_info = None
    pass_args = load_quant_config(quant_config) if quant_config else None

    if pass_args and "token_collector" in pass_args:
        from chop.passes.module.transforms import attach_token_collector_pass

        tc_cfg = pass_args.pop("token_collector")
        logger.info("Attaching TokenCollector: %s", tc_cfg)
        model.to(device_id)
        model, collector_info = attach_token_collector_pass(model, tc_cfg)

    # Run quant pass only if there are real quant blocks left (selectors or
    # gptq); if pass_args is just {"by": ...} after popping token_collector,
    # we're in pure calibration mode and skip quantization.
    has_quant = pass_args is not None and (
        "gptq" in pass_args
        or any(k != "by" for k in pass_args.keys())
    )
    if has_quant:
        from chop.passes.module.transforms import quantize_module_transform_pass

        if "gptq" in pass_args:
            pass_args["gptq"]["device"] = device_id
        # Plumb device + model_name into rotation_search the same way; the
        # MASE pass needs them but they don't belong in the TOML schema.
        if "rotation_search" in pass_args:
            pass_args["rotation_search"]["device"] = device_id
            pass_args["rotation_search"].setdefault("model_name", model_name)

        n_linear = sum(
            1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear)
        )
        logger.info("Quantizing %d linear layers...", n_linear)
        t0 = time.time()
        model, _ = quantize_module_transform_pass(model, pass_args)
        logger.info("Quantization complete in %.1fs", time.time() - t0)

        # Surface which classes the dispatch landed on (so you can confirm
        # rotate variants are wired in when the TOML asks for them).
        from collections import Counter
        cls_count = Counter(
            type(m).__name__ for _, m in model.named_modules()
            if "MX" in type(m).__name__
        )
        logger.info(
            "Post-quant module classes:\n%s",
            "\n".join(f"  {c}: {n}" for c, n in cls_count.most_common()),
        )

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    # In calibration-only mode, the TokenCollector hook will raise
    # ``CollectorFull`` from inside model.forward once enough tokens have
    # been buffered — we catch it here so the eval pass exits cleanly with
    # the calibration file already on disk.
    from chop.passes.module.transforms.gptq import CollectorFull
    try:
        results = evaluate_with_lm_eval(
            model=model,
            tokenizer=tokenizer,
            tasks=tasks,
            max_length=seqlen,
            batch_size=batch_size,
            log_samples=False,
            limit=limit,
        )
    except CollectorFull as e:
        logger.info("[calibration mode] aborted lm-eval as planned: %s", e)
        results = {"calibration_only": True}

    if collector_info is not None and not collector_info["collector"].complete:
        # lm-eval finished its limit without filling the buffer — flush whatever
        # we have to disk so downstream GPTQ has *something* to work with.
        collector_info["collector"].finalize()

    print("\n" + "=" * 64)
    print("Results:")
    print("=" * 64)
    if "results" in results:
        for task_name, task_results in results["results"].items():
            print(f"  {task_name}:")
            for metric, value in task_results.items():
                if isinstance(value, (int, float)):
                    print(f"    {metric}: {value:.4f}")
    else:
        for k, v in results.items():
            print(f"  {k}: {v}")

    if log_dir:
        save_results(log_dir, results)

    return results

Code generation — `eval_evalplus`¶

HumanEval+/MBPP+ code-generation evaluation with optional MX quantization.

Routes through evalplus to score pass@1 (or pass@k) on the HumanEval+ or MBPP+ benchmarks under a single fixed-precision quantization profile. Use this when you want to check whether a quantization recipe still preserves the reasoning required for code generation.

Requires the evalplus extra:

uv sync --extra evalplus

Example:

python -m quant_eval.cli.eval_evalplus \
    --model_name unsloth/Llama-3.2-1B \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --dataset humaneval \
    --greedy \
    --evalplus_output_dir logs/evalplus

`main(model_name='Qwen/Qwen2.5-1.5B', dataset='humaneval', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, batch_size=1, greedy=False, n_samples=1, max_new_tokens=4096, evalplus_output_dir=None, overwrite=False, base_only=False, parallel=None, version='default', log_dir=None)` ¶

Run evalplus (HumanEval+ / MBPP+) on an optionally MX-quantized HF model.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID.	`'Qwen/Qwen2.5-1.5B'`
`dataset`	`str`	`"humaneval"` or `"mbpp"`.	`'humaneval'`
`device_id`	`str`	CUDA device string.	`'cuda:0'`
`dtype`	`str`	Model dtype — `"float16"`, `"bfloat16"`, or `"float32"`.	`'bfloat16'`
`quant_config`	`Union[str, None]`	Path to a TOML quantization recipe. `None` runs the unquantized baseline.	`'quant_eval/configs/llama_mxint4.toml'`
`model_parallel`	`bool`	Distribute across GPUs with `device_map="auto"`.	`False`
`batch_size`	`int`	Generation batch size (samples per task per call).	`1`
`greedy`	`bool`	Greedy decoding (forces `temperature=0` and `n_samples=1`).	`False`
`n_samples`	`int`	Samples per task. Ignored when `greedy=True`.	`1`
`max_new_tokens`	`int`	Maximum tokens generated per sample.	`4096`
`evalplus_output_dir`	`Union[str, None]`	Directory where evalplus writes the generated solutions JSONL and per-problem evaluation results.	`None`
`overwrite`	`bool`	Regenerate solutions even if a previous JSONL exists.	`False`
`base_only`	`bool`	Score against base tests only (skip the +/plus tests).	`False`
`parallel`	`Union[int, None]`	Worker count for evalplus's code-execution stage. `None` runs serially.	`None`
`version`	`str`	evalplus dataset version (e.g. `"default"`).	`'default'`
`log_dir`	`Union[str, None]`	Directory for `args.json` and `results.json`.	`None`

Returns:

Type	Description
	evalplus results dict — pass@1 (and pass@k when applicable) plus
	per-problem outcomes.

Raises:

Type	Description
`ValueError`	`dataset` is not `"humaneval"` or `"mbpp"`.

Source code in quant_eval/cli/eval_evalplus.py

def main(
    model_name: str = "Qwen/Qwen2.5-1.5B",
    dataset: str = "humaneval",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: Union[str, None] = "quant_eval/configs/llama_mxint4.toml",
    model_parallel: bool = False,
    batch_size: int = 1,
    greedy: bool = False,
    n_samples: int = 1,
    max_new_tokens: int = 4096,
    evalplus_output_dir: Union[str, None] = None,
    overwrite: bool = False,
    base_only: bool = False,
    parallel: Union[int, None] = None,
    version: str = "default",
    log_dir: Union[str, None] = None,
):
    """
    Run evalplus (HumanEval+ / MBPP+) on an optionally MX-quantized HF model.

    Args:
        model_name: HuggingFace model ID.
        dataset: ``"humaneval"`` or ``"mbpp"``.
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe. ``None`` runs the
            unquantized baseline.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        batch_size: Generation batch size (samples per task per call).
        greedy: Greedy decoding (forces ``temperature=0`` and ``n_samples=1``).
        n_samples: Samples per task. Ignored when ``greedy=True``.
        max_new_tokens: Maximum tokens generated per sample.
        evalplus_output_dir: Directory where evalplus writes the generated
            solutions JSONL and per-problem evaluation results.
        overwrite: Regenerate solutions even if a previous JSONL exists.
        base_only: Score against base tests only (skip the +/plus tests).
        parallel: Worker count for evalplus's code-execution stage. ``None``
            runs serially.
        version: evalplus dataset version (e.g. ``"default"``).
        log_dir: Directory for ``args.json`` and ``results.json``.

    Returns:
        evalplus results dict — pass@1 (and pass@k when applicable) plus
        per-problem outcomes.

    Raises:
        ValueError: ``dataset`` is not ``"humaneval"`` or ``"mbpp"``.
    """
    if dataset not in ("humaneval", "mbpp"):
        raise ValueError(f"dataset must be 'humaneval' or 'mbpp', got {dataset!r}")

    print("=" * 64)
    print("evalplus — fixed activation precision (no phase switch)")
    print("=" * 64)
    print(f"  Model  : {model_name}")
    print(f"  Dataset: {dataset}")
    print(f"  Weights: {quant_config or 'none (fp)'}")
    print(f"  Greedy : {greedy}  (n_samples={n_samples}, batch_size={batch_size})")
    print("=" * 64)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        if quant_config:
            import shutil
            shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    dtype_map = {
        "float16":  torch.float16,
        "bfloat16": torch.bfloat16,
        "float32":  torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    # Quantized attention modules (MXInt / *Rotate) replace the eager forward
    # path and assert _attn_implementation == "eager". Force eager whenever a
    # quant_config is supplied.
    attn_impl = "eager" if quant_config else "sdpa"

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
        attn_implementation=attn_impl,
    )
    model.eval()

    # ``token_collector`` is a side-effect pass that attaches a hook; if it's
    # the *only* thing in the TOML this run is calibration mode and we skip
    # module quantization entirely.
    collector_info = None
    pass_args = load_quant_config(quant_config) if quant_config else None

    if pass_args and "token_collector" in pass_args:
        from chop.passes.module.transforms import attach_token_collector_pass

        tc_cfg = pass_args.pop("token_collector")
        logger.info("Attaching TokenCollector: %s", tc_cfg)
        model.to(device_id)
        model, collector_info = attach_token_collector_pass(model, tc_cfg)

    has_quant = pass_args is not None and (
        "gptq" in pass_args
        or any(k != "by" for k in pass_args.keys())
    )
    if has_quant:
        from chop.passes.module.transforms import quantize_module_transform_pass

        if "gptq" in pass_args:
            pass_args["gptq"]["device"] = device_id

        n_linear = sum(
            1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear)
        )
        logger.info("Quantizing %d linear layers...", n_linear)
        t0 = time.time()
        model, _ = quantize_module_transform_pass(model, pass_args)
        logger.info("Quantization complete in %.1fs", time.time() - t0)

        from collections import Counter
        cls_count = Counter(
            type(m).__name__ for _, m in model.named_modules()
            if "MX" in type(m).__name__
        )
        logger.info(
            "Post-quant module classes:\n%s",
            "\n".join(f"  {c}: {n}" for c, n in cls_count.most_common()),
        )

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    # In calibration-only mode, the hook raises CollectorFull from inside
    # forward once the buffer is full — catch it so eval exits cleanly with
    # the calibration file already saved.
    from chop.passes.module.transforms.gptq import CollectorFull
    try:
        results = evaluate_with_evalplus(
            model=model,
            tokenizer=tokenizer,
            dataset=dataset,
            batch_size=batch_size,
            greedy=greedy,
            n_samples=n_samples,
            max_new_tokens=max_new_tokens,
            output_dir=evalplus_output_dir,
            parallel=parallel,
            base_only=base_only,
            version=version,
            overwrite=overwrite,
        )
    except CollectorFull as e:
        logger.info("[calibration mode] aborted evalplus as planned: %s", e)
        results = {"calibration_only": True}

    if collector_info is not None and not collector_info["collector"].complete:
        collector_info["collector"].finalize()

    print("\n" + "=" * 64)
    print("Results:")
    print("=" * 64)
    if isinstance(results, dict):
        # evalplus stores per-task pass@k under "pass_at_k"; surface anything
        # numeric so the log is useful even if the schema shifts.
        if "pass_at_k" in results:
            for split, metrics in results["pass_at_k"].items():
                print(f"  {split}:")
                for k, v in metrics.items():
                    print(f"    {k}: {v}")
        else:
            for k, v in results.items():
                if isinstance(v, (int, float, str)):
                    print(f"  {k}: {v}")

    if log_dir:
        save_results(log_dir, results)

    return results

Phase-dependent precision — `eval_phase_lm`¶

lm-eval-harness with phase- and layer-type-dependent MX quantization.

Activation precision switches based on (phase, layer_type):

phase: prefill (seq len > 1) vs decode (seq len == 1), detected from input shape at runtime.
layer_type: attention vs FFN, detected from module names.

The four resulting widths (prefill-attn, prefill-ffn, decode-attn, decode-ffn) are set independently. Weight quantization comes from the TOML recipe; activation widths come from CLI flags. lm-eval itself is unmodified.

Example — disaggregated W4 prefill / W8 decode:

python -m quant_eval.cli.eval_phase_lm \
    --model_name Qwen/Qwen2.5-1.5B \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --prefill_attn_width 4 --prefill_ffn_width 4 \
    --decode_attn_width  8 --decode_ffn_width  8 \
    --tasks gsm8k --limit 200

`main(model_name='Qwen/Qwen2.5-1.5B', tasks='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, seqlen=2048, batch_size=64, prefill_attn_width=4, prefill_ffn_width=4, prefill_attn_block_size=32, prefill_ffn_block_size=32, decode_attn_width=8, decode_ffn_width=8, decode_attn_block_size=32, decode_ffn_block_size=32, attn_keywords=None, ffn_keywords=None, limit=None, log_dir=None)` ¶

Run lm-eval with phase- and layer-type-dependent activation precision.

Phase is detected from input sequence length at runtime; layer type is detected from module names (override via attn_keywords / ffn_keywords for non-standard architectures). The weight quantization recipe from quant_config is applied once; the activation widths and block sizes specified here override the recipe's activation sections per (phase, layer_type) pair.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID.	`'Qwen/Qwen2.5-1.5B'`
`tasks`	`Union[str, list[str]]`	lm-eval task name(s) — comma-separated string or list.	`'wikitext'`
`device_id`	`str`	CUDA device string.	`'cuda:0'`
`dtype`	`str`	Model dtype — `"float16"`, `"bfloat16"`, or `"float32"`.	`'bfloat16'`
`quant_config`	`str`	Path to a TOML quantization recipe.	`'quant_eval/configs/llama_mxint4.toml'`
`model_parallel`	`bool`	Distribute across GPUs with `device_map="auto"`.	`False`
`seqlen`	`int`	Maximum context length passed to lm-eval.	`2048`
`batch_size`	`Union[int, str]`	lm-eval batch size — int or `"auto"`.	`64`
`prefill_attn_width`	`int`	Activation bit-width for attention layers during prefill.	`4`
`prefill_ffn_width`	`int`	Activation bit-width for FFN layers during prefill.	`4`
`prefill_attn_block_size`	`int`	MX block size for attention during prefill.	`32`
`prefill_ffn_block_size`	`int`	MX block size for FFN during prefill.	`32`
`decode_attn_width`	`int`	Activation bit-width for attention layers during decode.	`8`
`decode_ffn_width`	`int`	Activation bit-width for FFN layers during decode.	`8`
`decode_attn_block_size`	`int`	MX block size for attention during decode.	`32`
`decode_ffn_block_size`	`int`	MX block size for FFN during decode.	`32`
`attn_keywords`	`Union[list[str], None]`	Module-name substrings that identify attention blocks. `None` uses the built-in defaults.	`None`
`ffn_keywords`	`Union[list[str], None]`	Module-name substrings that identify FFN blocks. `None` uses the built-in defaults.	`None`
`limit`	`Union[int, float, None]`	Cap samples per task. Int = absolute count; float in `(0, 1)` = fraction; `None` = full dataset.	`None`
`log_dir`	`Union[str, None]`	Directory for `args.json` and `results.json`.	`None`

Returns:

Type	Description
	lm-eval results dict — per-task metrics plus aggregate scores.

Source code in quant_eval/cli/eval_phase_lm.py

def main(
    model_name: str = "Qwen/Qwen2.5-1.5B",
    tasks: Union[str, list[str]] = "wikitext",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: str = "quant_eval/configs/llama_mxint4.toml",
    model_parallel: bool = False,
    seqlen: int = 2048,
    batch_size: Union[int, str] = 64,
    # ── Activation precision: prefill ──────────────────────────────
    prefill_attn_width:      int = 4,
    prefill_ffn_width:       int = 4,
    prefill_attn_block_size: int = 32,
    prefill_ffn_block_size:  int = 32,
    # ── Activation precision: decode ───────────────────────────────
    decode_attn_width:       int = 8,
    decode_ffn_width:        int = 8,
    decode_attn_block_size:  int = 32,
    decode_ffn_block_size:   int = 32,
    # ── Optional keyword overrides (for non-standard architectures) ─
    attn_keywords: Union[list[str], None] = None,
    ffn_keywords:  Union[list[str], None] = None,
    limit: Union[int, float, None] = None,
    log_dir: Union[str, None] = None,
):
    """
    Run lm-eval with phase- and layer-type-dependent activation precision.

    Phase is detected from input sequence length at runtime; layer type is
    detected from module names (override via ``attn_keywords`` / ``ffn_keywords``
    for non-standard architectures). The weight quantization recipe from
    ``quant_config`` is applied once; the activation widths and block sizes
    specified here override the recipe's activation sections per
    (phase, layer_type) pair.

    Args:
        model_name: HuggingFace model ID.
        tasks: lm-eval task name(s) — comma-separated string or list.
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        seqlen: Maximum context length passed to lm-eval.
        batch_size: lm-eval batch size — int or ``"auto"``.
        prefill_attn_width: Activation bit-width for attention layers during
            prefill.
        prefill_ffn_width: Activation bit-width for FFN layers during prefill.
        prefill_attn_block_size: MX block size for attention during prefill.
        prefill_ffn_block_size: MX block size for FFN during prefill.
        decode_attn_width: Activation bit-width for attention layers during
            decode.
        decode_ffn_width: Activation bit-width for FFN layers during decode.
        decode_attn_block_size: MX block size for attention during decode.
        decode_ffn_block_size: MX block size for FFN during decode.
        attn_keywords: Module-name substrings that identify attention blocks.
            ``None`` uses the built-in defaults.
        ffn_keywords: Module-name substrings that identify FFN blocks. ``None``
            uses the built-in defaults.
        limit: Cap samples per task. Int = absolute count; float in
            ``(0, 1)`` = fraction; ``None`` = full dataset.
        log_dir: Directory for ``args.json`` and ``results.json``.

    Returns:
        lm-eval results dict — per-task metrics plus aggregate scores.
    """
    # ------------------------------------------------------------------
    # Build the nested phase × layer config
    # ------------------------------------------------------------------
    phase_configs = {
        "prefill": {
            "attn": {
                "data_in_width":      prefill_attn_width,
                "data_in_block_size": prefill_attn_block_size,
            },
            "ffn": {
                "data_in_width":      prefill_ffn_width,
                "data_in_block_size": prefill_ffn_block_size,
            },
        },
        "decode": {
            "attn": {
                "data_in_width":      decode_attn_width,
                "data_in_block_size": decode_attn_block_size,
            },
            "ffn": {
                "data_in_width":      decode_ffn_width,
                "data_in_block_size": decode_ffn_block_size,
            },
        },
    }

    # ------------------------------------------------------------------
    # Print header
    # ------------------------------------------------------------------
    _pa = f"MXInt{prefill_attn_width}(bs={prefill_attn_block_size})"
    _pf = f"MXInt{prefill_ffn_width}(bs={prefill_ffn_block_size})"
    _da = f"MXInt{decode_attn_width}(bs={decode_attn_block_size})"
    _df = f"MXInt{decode_ffn_width}(bs={decode_ffn_block_size})"

    print("=" * 64)
    print("lm-eval — Phase × Layer-Type Disaggregated Quantization")
    print("=" * 64)
    print(f"  Model  : {model_name}")
    print(f"  Tasks  : {tasks}")
    print(f"  Weights: {quant_config}")
    print()
    print(f"  {'':10s}  {'attn':>24s}  {'ffn':>24s}")
    print(f"  {'prefill':10s}  {_pa:>24s}  {_pf:>24s}")
    print(f"  {'decode':10s}  {_da:>24s}  {_df:>24s}")
    print("=" * 64)

    # ------------------------------------------------------------------
    # Logging / experiment directory
    # ------------------------------------------------------------------
    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        import shutil
        shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    # ------------------------------------------------------------------
    # Model setup
    # ------------------------------------------------------------------
    dtype_map = {
        "float16":  torch.float16,
        "bfloat16": torch.bfloat16,
        "float32":  torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
    )
    model.eval()

    # ------------------------------------------------------------------
    # Apply weight quantization (activation configs are set by the hook)
    # ------------------------------------------------------------------
    from chop.passes.module.transforms import quantize_module_transform_pass

    pass_args = load_quant_config(quant_config)
    if "gptq" in pass_args:
        pass_args["gptq"]["device"] = device_id

    n_linear = sum(1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear))
    logger.info("Quantizing %d linear layers...", n_linear)
    t0 = time.time()
    model, _ = quantize_module_transform_pass(model, pass_args)
    logger.info("Quantization complete in %.1fs", time.time() - t0)

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    # ------------------------------------------------------------------
    # Enable disaggregated quantization hook
    # ------------------------------------------------------------------
    switch_kwargs = {}
    if attn_keywords:
        switch_kwargs["attn_keywords"] = tuple(attn_keywords)
    if ffn_keywords:
        switch_kwargs["ffn_keywords"] = tuple(ffn_keywords)

    switch = PhaseLayerAutoSwitch(model, phase_configs, **switch_kwargs)
    switch.enable()
    logger.info("\n%s", switch.summary())

    # ------------------------------------------------------------------
    # Run lm-eval (hook fires transparently on every forward pass)
    # ------------------------------------------------------------------
    results = evaluate_with_lm_eval(
        model=model,
        tokenizer=tokenizer,
        tasks=tasks,
        max_length=seqlen,
        batch_size=128,
        log_samples=False,
        limit=limit,
    )

    switch.disable()

    # ------------------------------------------------------------------
    # Print results
    # ------------------------------------------------------------------
    print("\n" + "=" * 64)
    print("Results:")
    print("=" * 64)
    print(f"\n  {'':10s}  {'attn':>24s}  {'ffn':>24s}")
    print(f"  {'prefill':10s}  {_pa:>24s}  {_pf:>24s}")
    print(f"  {'decode':10s}  {_da:>24s}  {_df:>24s}")
    print()

    if "results" in results:
        for task_name, task_results in results["results"].items():
            print(f"  {task_name}:")
            for metric, value in task_results.items():
                if isinstance(value, (int, float)):
                    print(f"    {metric}: {value:.4f}")
    else:
        for k, v in results.items():
            print(f"  {k}: {v}")

    if log_dir:
        results["phase_layer_configs"] = phase_configs
        save_results(log_dir, results)

    return results

BFCL with phase-dependent precision — `eval_phase_bfcl`¶

BFCL web-search evaluation with phase- and layer-type-dependent MX quantization.

Serves an MX-quantized model through a lightweight OpenAI-compatible HTTP server (backed by HuggingFace generate), then drives the standard BFCL CLI against it. Activation precision is set independently for each (phase, layer_type) pair via the prefill/decode × attn/FFN flags.

BFCL is a two-step flow:

bfcl generate calls the local server to produce model responses.
bfcl evaluate scores those responses (no model needed).

This script orchestrates both steps automatically and exposes the local server on server_host:server_port.

Requires the bfcl extra and the bfcl-eval package:

uv sync --extra bfcl
pip install bfcl-eval

Example:

python -m quant_eval.cli.eval_phase_bfcl \
    --model_name Qwen/Qwen2.5-1.5B \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --prefill_attn_width 4 --prefill_ffn_width 4 \
    --decode_attn_width  8 --decode_ffn_width  8 \
    --bfcl_test_categories web_search_base \
    --limit 50

main(model_name='Qwen/Qwen3-8B-FC', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, bfcl_test_categories=None, bfcl_num_threads=1, server_host=DEFAULT_HOST, server_port=DEFAULT_PORT, prefill_attn_width=4, prefill_ffn_width=4, prefill_attn_block_size=32, prefill_ffn_block_size=32, decode_attn_width=8, decode_ffn_width=8, decode_attn_block_size=32, decode_ffn_block_size=32, attn_keywords=None, ffn_keywords=None, limit=None, log_dir=None) ¶

Run BFCL web-search evaluation with phase- and layer-type-dependent activation precision.

Spawns a local OpenAI-compatible HTTP server backed by HF generate so that the unmodified bfcl generate CLI can drive inference, then runs bfcl evaluate to score responses.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID. For function-calling, must be an instruction-tuned model with a function-call template (e.g. `Qwen/Qwen3-8B-FC`).	`'Qwen/Qwen3-8B-FC'`
`device_id`	`str`	CUDA device string.	`'cuda:0'`
`dtype`	`str`	Model dtype — `"float16"`, `"bfloat16"`, or `"float32"`.	`'bfloat16'`
`quant_config`	`str`	Path to a TOML quantization recipe.	`'quant_eval/configs/llama_mxint4.toml'`
`model_parallel`	`bool`	Distribute across GPUs with `device_map="auto"`.	`False`
`bfcl_test_categories`	`Union[list[str], None]`	BFCL category names to evaluate (e.g. `["web_search_base", "web_search_no_snippet"]`). `None` uses the default web-search category set.	`None`
`bfcl_num_threads`	`int`	Parallel inference threads for `bfcl generate`.	`1`
`server_host`	`str`	Host for the local OpenAI-compatible server.	`DEFAULT_HOST`
`server_port`	`int`	Port for the local OpenAI-compatible server.	`DEFAULT_PORT`
`prefill_attn_width`	`int`	Activation bit-width for attention during prefill.	`4`
`prefill_ffn_width`	`int`	Activation bit-width for FFN during prefill.	`4`
`prefill_attn_block_size`	`int`	MX block size for attention during prefill.	`32`
`prefill_ffn_block_size`	`int`	MX block size for FFN during prefill.	`32`
`decode_attn_width`	`int`	Activation bit-width for attention during decode.	`8`
`decode_ffn_width`	`int`	Activation bit-width for FFN during decode.	`8`
`decode_attn_block_size`	`int`	MX block size for attention during decode.	`32`
`decode_ffn_block_size`	`int`	MX block size for FFN during decode.	`32`
`attn_keywords`	`Union[list[str], None]`	Module-name substrings that identify attention blocks. `None` uses the built-in defaults.	`None`
`ffn_keywords`	`Union[list[str], None]`	Module-name substrings that identify FFN blocks. `None` uses the built-in defaults.	`None`
`limit`	`Union[int, None]`	Cap the number of samples per category. `None` = full dataset.	`None`
`log_dir`	`Union[str, None]`	Directory for `args.json` and `results.json`.	`None`

Returns:

Type	Description
	BFCL evaluation summary — per-category scores plus aggregate metrics,
	with `phase_layer_configs` recording the resolved (phase, layer)
	precision table.

Source code in quant_eval/cli/eval_phase_bfcl.py

def main(
    model_name:  str = "Qwen/Qwen3-8B-FC",
    device_id:   str = "cuda:0",
    dtype:       str = "bfloat16",
    quant_config: str = "quant_eval/configs/llama_mxint4.toml",
    model_parallel: bool = False,
    # ── BFCL settings ──────────────────────────────────────────────────────
    bfcl_test_categories: Union[list[str], None] = None,
    bfcl_num_threads:     int   = 1,
    server_host:          str   = DEFAULT_HOST,
    server_port:          int   = DEFAULT_PORT,
    # ── Activation precision: prefill ──────────────────────────────────────
    prefill_attn_width:      int = 4,
    prefill_ffn_width:       int = 4,
    prefill_attn_block_size: int = 32,
    prefill_ffn_block_size:  int = 32,
    # ── Activation precision: decode ───────────────────────────────────────
    decode_attn_width:       int = 8,
    decode_ffn_width:        int = 8,
    decode_attn_block_size:  int = 32,
    decode_ffn_block_size:   int = 32,
    # ── Optional keyword overrides ─────────────────────────────────────────
    attn_keywords: Union[list[str], None] = None,
    ffn_keywords:  Union[list[str], None] = None,
    limit: Union[int, None] = None,
    log_dir: Union[str, None] = None,
):
    """
    Run BFCL web-search evaluation with phase- and layer-type-dependent
    activation precision.

    Spawns a local OpenAI-compatible HTTP server backed by HF ``generate``
    so that the unmodified ``bfcl generate`` CLI can drive inference, then
    runs ``bfcl evaluate`` to score responses.

    Args:
        model_name: HuggingFace model ID. For function-calling, must be an
            instruction-tuned model with a function-call template
            (e.g. ``Qwen/Qwen3-8B-FC``).
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        bfcl_test_categories: BFCL category names to evaluate (e.g.
            ``["web_search_base", "web_search_no_snippet"]``). ``None`` uses
            the default web-search category set.
        bfcl_num_threads: Parallel inference threads for ``bfcl generate``.
        server_host: Host for the local OpenAI-compatible server.
        server_port: Port for the local OpenAI-compatible server.
        prefill_attn_width: Activation bit-width for attention during prefill.
        prefill_ffn_width: Activation bit-width for FFN during prefill.
        prefill_attn_block_size: MX block size for attention during prefill.
        prefill_ffn_block_size: MX block size for FFN during prefill.
        decode_attn_width: Activation bit-width for attention during decode.
        decode_ffn_width: Activation bit-width for FFN during decode.
        decode_attn_block_size: MX block size for attention during decode.
        decode_ffn_block_size: MX block size for FFN during decode.
        attn_keywords: Module-name substrings that identify attention blocks.
            ``None`` uses the built-in defaults.
        ffn_keywords: Module-name substrings that identify FFN blocks. ``None``
            uses the built-in defaults.
        limit: Cap the number of samples per category. ``None`` = full dataset.
        log_dir: Directory for ``args.json`` and ``results.json``.

    Returns:
        BFCL evaluation summary — per-category scores plus aggregate metrics,
        with ``phase_layer_configs`` recording the resolved (phase, layer)
        precision table.
    """
    if bfcl_test_categories is None:
        bfcl_test_categories = list(BFCL_WEB_SEARCH_CATEGORIES)

    # ------------------------------------------------------------------
    # Build the nested phase × layer config
    # ------------------------------------------------------------------
    phase_configs = {
        "prefill": {
            "attn": {
                "data_in_width":      prefill_attn_width,
                "data_in_block_size": prefill_attn_block_size,
            },
            "ffn": {
                "data_in_width":      prefill_ffn_width,
                "data_in_block_size": prefill_ffn_block_size,
            },
        },
        "decode": {
            "attn": {
                "data_in_width":      decode_attn_width,
                "data_in_block_size": decode_attn_block_size,
            },
            "ffn": {
                "data_in_width":      decode_ffn_width,
                "data_in_block_size": decode_ffn_block_size,
            },
        },
    }

    # ------------------------------------------------------------------
    # Print header
    # ------------------------------------------------------------------
    _pa = f"MXInt{prefill_attn_width}(bs={prefill_attn_block_size})"
    _pf = f"MXInt{prefill_ffn_width}(bs={prefill_ffn_block_size})"
    _da = f"MXInt{decode_attn_width}(bs={decode_attn_block_size})"
    _df = f"MXInt{decode_ffn_width}(bs={decode_ffn_block_size})"

    print("=" * 64)
    print("BFCL Web Search — Phase × Layer-Type Disaggregated Quantization")
    print("=" * 64)
    print(f"  Model      : {model_name}")
    print(f"  Categories : {bfcl_test_categories}")
    print(f"  Weights    : {quant_config}")
    print(f"  Server     : http://{server_host}:{server_port}")
    print()
    print(f"  {'':10s}  {'attn':>24s}  {'ffn':>24s}")
    print(f"  {'prefill':10s}  {_pa:>24s}  {_pf:>24s}")
    print(f"  {'decode':10s}  {_da:>24s}  {_df:>24s}")
    print("=" * 64)
    logger.info("Model Parallel", model_parallel)

    # ------------------------------------------------------------------
    # Resolve output directories (persistent if log_dir given)
    # ------------------------------------------------------------------
    _tmpdir_ctx = tempfile.TemporaryDirectory()
    _tmpdir     = Path(_tmpdir_ctx.name)

    result_dir = _tmpdir / "bfcl_results"
    score_dir  = _tmpdir / "bfcl_scores"
    result_dir.mkdir(parents=True)
    score_dir.mkdir(parents=True)

    if log_dir:
        log_dir    = create_experiment_log_dir(log_dir)
        result_dir = log_dir / "bfcl_results"
        score_dir  = log_dir / "bfcl_scores"
        result_dir.mkdir(parents=True)
        score_dir.mkdir(parents=True)
        save_args(log_dir, locals().copy())
        import shutil
        shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    # ------------------------------------------------------------------
    # Model setup
    # ------------------------------------------------------------------
    dtype_map = {
        "float16":  torch.float16,
        "bfloat16": torch.bfloat16,
        "float32":  torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
    )
    model.eval()

    # ------------------------------------------------------------------
    # Weight quantization
    # ------------------------------------------------------------------
    from chop.passes.module.transforms import quantize_module_transform_pass

    pass_args = load_quant_config(quant_config)
    if "gptq" in pass_args:
        pass_args["gptq"]["device"] = device_id

    n_linear = sum(
        1 for _, m in model.named_modules()
        if isinstance(m, torch.nn.Linear)
    )
    logger.info("Quantizing %d linear layers...", n_linear)
    t0 = time.time()
    model, _ = quantize_module_transform_pass(model, pass_args)
    logger.info("Quantization complete in %.1fs", time.time() - t0)

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    # ------------------------------------------------------------------
    # Enable disaggregated quantization hook
    # ------------------------------------------------------------------
    switch_kwargs = {}
    if attn_keywords:
        switch_kwargs["attn_keywords"] = tuple(attn_keywords)
    if ffn_keywords:
        switch_kwargs["ffn_keywords"] = tuple(ffn_keywords)

    switch = PhaseLayerAutoSwitch(model, phase_configs, **switch_kwargs)
    switch.enable()
    logger.info("\n%s", switch.summary())

    # ------------------------------------------------------------------
    # Start the OpenAI-compatible server (hook fires on every request)
    # ------------------------------------------------------------------
    device_str = device_id if not model_parallel else "cuda"
    app = _build_server_app(model, tokenizer, device_str)
    _start_server(app, server_host, server_port)


    # ------------------------------------------------------------------
    # Step 1: bfcl generate  (calls the local server)
    # ------------------------------------------------------------------
    print("\n[1/2] Generating BFCL responses via local server...")
    gen_rc = _run_bfcl_generate(
        model_name      = model_name,
        test_categories = "web_search_base",
        host            = server_host,
        port            = server_port,
        result_dir      = result_dir,
        num_threads     = bfcl_num_threads,
        limit           = limit,
    )
    if gen_rc != 0:
        logger.error("bfcl generate exited with code %d", gen_rc)

#     # ------------------------------------------------------------------
#     # Step 2: bfcl evaluate  (pure scoring, no model needed)
#     # ------------------------------------------------------------------
    print("[2/2] Evaluating BFCL responses...")
    eval_rc, scores = _run_bfcl_evaluate(
        model_name      = model_name,
        test_categories = bfcl_test_categories,
        result_dir      = result_dir,
        score_dir       = score_dir,
    )

    switch.disable()

    # ------------------------------------------------------------------
    # Print results
    # ------------------------------------------------------------------
    print("\n" + "=" * 64)
    print("Results:")
    print("=" * 64)
    print(f"\n  {'':10s}  {'attn':>24s}  {'ffn':>24s}")
    print(f"  {'prefill':10s}  {_pa:>24s}  {_pf:>24s}")
    print(f"  {'decode':10s}  {_da:>24s}  {_df:>24s}")
    print()

    per_cat = scores.pop("per_category", {})
    for cat, cat_scores in per_cat.items():
        print(f"  {cat}:")
        if isinstance(cat_scores, dict):
            for metric, value in cat_scores.items():
                if isinstance(value, (int, float)):
                    print(f"    {metric}: {value:.4f}")
                else:
                    print(f"    {metric}: {value}")
        else:
            print(f"    {cat_scores}")

    if scores:
        print("\n  Overall (from data_overall.csv):")
        for k, v in scores.items():
            print(f"    {k}: {v}")

    # Restore per_category before saving.
    scores["per_category"] = per_cat
    scores["phase_layer_configs"] = phase_configs

    if log_dir:
        save_results(log_dir, scores)

    _tmpdir_ctx.cleanup()
    return scores

Diffusion LLMs — `eval_dllm`¶

Fast-dLLM v2 (block-diffusion language model) evaluation with optional MX quantization.

Evaluates diffusion-based language models via lm-eval-harness, using block-diffusion sampling instead of standard autoregressive decoding. Quantization is applied via the same TOML-config interface as the rest of the toolkit.

Example — baseline:

python -m quant_eval.cli.eval_dllm \
    --model_name Efficient-Large-Model/Fast_dLLM_v2_1.5B \
    --tasks gsm8k

Example — quantized:

python -m quant_eval.cli.eval_dllm \
    --model_name Efficient-Large-Model/Fast_dLLM_v2_1.5B \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --tasks gsm8k

`main(model_name='Efficient-Large-Model/Fast_dLLM_v2_1.5B', tasks='gsm8k', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, batch_size=32, max_new_tokens=2048, num_fewshot=0, mask_id=151665, bd_size=32, small_block_size=8, threshold=1.0, show_speed=True, log_dir=None)` ¶

Evaluate a Fast-dLLM v2 model with optional MX quantization.

Decoding is block-diffusion: bd_size tokens are generated per outer block, then refined through small_block_size sub-blocks of iterative unmasking.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID (must be a Fast-dLLM v2 checkpoint).	`'Efficient-Large-Model/Fast_dLLM_v2_1.5B'`
`tasks`	`Union[str, list[str]]`	lm-eval task name(s) — comma-separated string or list (e.g. `"gsm8k,minerva_math"`).	`'gsm8k'`
`device_id`	`str`	CUDA device string.	`'cuda:0'`
`dtype`	`str`	Model dtype — `"float16"`, `"bfloat16"`, or `"float32"`.	`'bfloat16'`
`quant_config`	`Union[str, None]`	Path to a TOML quantization recipe. `None` runs the unquantized baseline.	`None`
`model_parallel`	`bool`	Distribute across GPUs with `device_map="auto"`.	`False`
`batch_size`	`int`	lm-eval batch size.	`32`
`max_new_tokens`	`int`	Maximum tokens generated per sample.	`2048`
`num_fewshot`	`int`	Few-shot examples prepended to each task prompt.	`0`
`mask_id`	`int`	Token ID used as the diffusion mask. Default `151665` (matches Qwen-based Fast-dLLM checkpoints).	`151665`
`bd_size`	`int`	Outer block-diffusion block size — tokens generated per outer sampling step.	`32`
`small_block_size`	`int`	Inner block size for iterative unmasking within each outer block.	`8`
`threshold`	`float`	Confidence threshold for committing unmasked tokens.	`1.0`
`show_speed`	`bool`	Log throughput metrics (tokens/second).	`True`
`log_dir`	`Union[str, None]`	Directory for `args.json` and `results.json`.	`None`

Returns:

Type	Description
	lm-eval results dict — per-task metrics plus aggregate scores.

Source code in quant_eval/cli/eval_dllm.py

def main(
    model_name: str = "Efficient-Large-Model/Fast_dLLM_v2_1.5B",
    tasks: Union[str, list[str]] = "gsm8k",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: Union[str, None] = None,
    model_parallel: bool = False,
    # dLLM specific
    batch_size: int = 32,
    max_new_tokens: int = 2048,
    num_fewshot: int = 0,
    mask_id: int = 151665,
    bd_size: int = 32,
    small_block_size: int = 8,
    threshold: float = 1.0,
    show_speed: bool = True,
    log_dir: Union[str, None] = None,
):
    """
    Evaluate a Fast-dLLM v2 model with optional MX quantization.

    Decoding is block-diffusion: ``bd_size`` tokens are generated per outer
    block, then refined through ``small_block_size`` sub-blocks of iterative
    unmasking.

    Args:
        model_name: HuggingFace model ID (must be a Fast-dLLM v2 checkpoint).
        tasks: lm-eval task name(s) — comma-separated string or list
            (e.g. ``"gsm8k,minerva_math"``).
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe. ``None`` runs the
            unquantized baseline.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        batch_size: lm-eval batch size.
        max_new_tokens: Maximum tokens generated per sample.
        num_fewshot: Few-shot examples prepended to each task prompt.
        mask_id: Token ID used as the diffusion mask. Default ``151665``
            (matches Qwen-based Fast-dLLM checkpoints).
        bd_size: Outer block-diffusion block size — tokens generated per
            outer sampling step.
        small_block_size: Inner block size for iterative unmasking within
            each outer block.
        threshold: Confidence threshold for committing unmasked tokens.
        show_speed: Log throughput metrics (tokens/second).
        log_dir: Directory for ``args.json`` and ``results.json``.

    Returns:
        lm-eval results dict — per-task metrics plus aggregate scores.
    """
    print("=" * 60)
    print("Fast-dLLM Evaluation")
    print("=" * 60)
    print(f"Model: {model_name}")
    print(f"Tasks: {tasks}")
    print(f"Block size: {bd_size}, Sub-block: {small_block_size}, Threshold: {threshold}")

    quantize = quant_config is not None
    if quantize:
        print(f"Quantization config: {quant_config}")
    else:
        print("Quantization: None (baseline)")
    print("=" * 60)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        if quant_config:
            import shutil
            shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    dtype_map = {
        "float16": torch.float16,
        "bfloat16": torch.bfloat16,
        "float32": torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
    )
    model.eval()

    if quantize:
        from chop.passes.module.transforms import quantize_module_transform_pass

        pass_args = load_quant_config(quant_config)
        if "gptq" in pass_args:
            pass_args["gptq"]["device"] = device_id

        n_linear = sum(
            1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear)
        )
        logger.info("Quantizing %d linear layers...", n_linear)
        t0 = time.time()
        model, _ = quantize_module_transform_pass(model, pass_args)
        logger.info("Quantization complete in %.1fs", time.time() - t0)

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    if quantize:
        print_all_layers(model)

    # Attach block diffusion sampling method
    setup_dllm_generation(model)

    device = torch.device(device_id)
    results = evaluate_dllm(
        model=model,
        tokenizer=tokenizer,
        tasks=tasks,
        device=device,
        model_name=model_name,
        batch_size=batch_size,
        max_new_tokens=max_new_tokens,
        num_fewshot=num_fewshot,
        mask_id=mask_id,
        bd_size=bd_size,
        small_block_size=small_block_size,
        threshold=threshold,
        show_speed=show_speed,
    )

    print("\n" + "=" * 60)
    print("Results:")
    print("=" * 60)
    for task_name, task_results in results.get("results", {}).items():
        print(f"\n{task_name}:")
        for metric, value in task_results.items():
            if isinstance(value, (int, float)):
                print(f"  {metric}: {value:.4f}")

    if log_dir:
        save_results(log_dir, results)

    return results

LLaDA diffusion — `eval_llada`¶

LLaDA (diffusion-style language model) evaluation with optional MX quantization.

Wraps lm-eval-harness's CLI. Use --model llada_dist and pass model and quantization options through lm-eval's --model_args flag.

Example — baseline (prefix cache):

python -m quant_eval.cli.eval_llada \
    --tasks gsm8k --num_fewshot 0 \
    --model llada_dist \
    --model_args model_path='GSAI-ML/LLaDA-8B-Instruct',gen_length=256,steps=256,block_length=32,use_cache=True

Example — with MXINT4 KV-cache quantization:

python -m quant_eval.cli.eval_llada \
    --tasks gsm8k --num_fewshot 0 \
    --model llada_dist \
    --model_args model_path='GSAI-ML/LLaDA-8B-Instruct',gen_length=256,steps=256,block_length=32,use_cache=True,quant_config='quant_eval/configs/llama_mxint4.toml'

Agentic — `eval_osworld`¶

OSWorld agentic evaluation with optional MX quantization.

Runs the OSWorld desktop-task benchmark in text-only (a11y_tree) mode, so quantized language models can serve as OSWorld agents without vision capabilities. The agent observes the desktop via the accessibility tree and generates pyautogui code to act on it; the VM is rolled back between tasks.

Prerequisites:

The OSWorld repository cloned at osworld_path.
A configured VM provider (Docker recommended) with the corresponding image.
An instruction-tuned / chat model.

Example:

python -m quant_eval.cli.eval_osworld \
    --model_name Qwen/Qwen2.5-7B-Instruct \
    --osworld_path quant_eval/benchmarks/OSWorld \
    --quant_config quant_eval/configs/llama_mxint4.toml \
    --domain chrome --max_steps 15

main(model_name='Qwen/Qwen3-30B-A3B-Instruct-2507', osworld_path='quant_eval/benchmarks/OSWorld', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, provider_name='docker', path_to_vm=None, domain='all', max_steps=15, max_tokens=1500, temperature=0.5, top_p=0.9, max_trajectory_length=3, a11y_tree_max_tokens=10000, result_dir='./results', client_password='password', screen_width=1920, screen_height=1080, headless=True, sleep_after_execution=0.0, test_all_meta_path=None, log_dir=None) ¶

Run OSWorld agentic evaluation with optional MX quantization.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID — must be an instruction-tuned / chat model.	`'Qwen/Qwen3-30B-A3B-Instruct-2507'`
`osworld_path`	`str`	Local path to the OSWorld repository checkout.	`'quant_eval/benchmarks/OSWorld'`
`device_id`	`str`	CUDA device string.	`'cuda:0'`
`dtype`	`str`	Model dtype — `"float16"`, `"bfloat16"`, or `"float32"`.	`'bfloat16'`
`quant_config`	`Union[str, None]`	Path to a TOML quantization recipe. `None` runs the unquantized baseline.	`None`
`model_parallel`	`bool`	Distribute across GPUs with `device_map="auto"`.	`False`
`provider_name`	`str`	VM provider — one of `"docker"`, `"vmware"`, `"virtualbox"`, `"aws"`.	`'docker'`
`path_to_vm`	`Union[str, None]`	Path to the VM image (used by `vmware` / `virtualbox`; ignored for Docker).	`None`
`domain`	`str`	Task domain to evaluate — `"all"` or one of OSWorld's domains (`"chrome"`, `"libreoffice_calc"`, `"gimp"`, ...).	`'all'`
`max_steps`	`int`	Maximum agent steps per task.	`15`
`max_tokens`	`int`	Maximum tokens generated per agent step.	`1500`
`temperature`	`float`	Sampling temperature for agent generation.	`0.5`
`top_p`	`float`	Top-p sampling for agent generation.	`0.9`
`max_trajectory_length`	`int`	Steps of action history retained in the prompt context.	`3`
`a11y_tree_max_tokens`	`int`	Maximum tokens for the serialized accessibility-tree observation.	`10000`
`result_dir`	`str`	Directory for per-task outputs (trajectories, screenshots, scores).	`'./results'`
`client_password`	`str`	VM user password (for sudo operations inside the VM).	`'password'`
`screen_width`	`int`	VM display width in pixels.	`1920`
`screen_height`	`int`	VM display height in pixels.	`1080`
`headless`	`bool`	Run the VM without a visible GUI window.	`True`
`sleep_after_execution`	`float`	Seconds to pause after each `pyautogui` action (useful when the VM needs time to render).	`0.0`
`test_all_meta_path`	`Union[str, None]`	Path to a `test_all.json` task list. `None` uses OSWorld's default meta file for the chosen domain.	`None`
`log_dir`	`Union[str, None]`	Directory for top-level `args.json` and aggregate `results.json`.	`None`

Returns:

Type	Description
	Dict with keys `avg_score`, `total_tasks`, `total_success`,
	`all_scores` (per-task score list), and `per_domain` (mapping
	each domain to `{avg_score, num_tasks, num_success}`).

Source code in quant_eval/cli/eval_osworld.py

def main(
    model_name: str = "Qwen/Qwen3-30B-A3B-Instruct-2507",
    osworld_path: str = "quant_eval/benchmarks/OSWorld",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    quant_config: Union[str, None] = None,
    model_parallel: bool = False,
    # OSWorld environment settings
    provider_name: str = "docker",
    path_to_vm: Union[str, None] = None,
    domain: str = "all",
    max_steps: int = 15,
    max_tokens: int = 1500,
    temperature: float = 0.5,
    top_p: float = 0.9,
    max_trajectory_length: int = 3,
    a11y_tree_max_tokens: int = 10000,
    result_dir: str = "./results",
    client_password: str = "password",
    screen_width: int = 1920,
    screen_height: int = 1080,
    headless: bool = True,
    sleep_after_execution: float = 0.0,
    test_all_meta_path: Union[str, None] = None,
    log_dir: Union[str, None] = None,
):
    """
    Run OSWorld agentic evaluation with optional MX quantization.

    Args:
        model_name: HuggingFace model ID — must be an instruction-tuned /
            chat model.
        osworld_path: Local path to the OSWorld repository checkout.
        device_id: CUDA device string.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        quant_config: Path to a TOML quantization recipe. ``None`` runs the
            unquantized baseline.
        model_parallel: Distribute across GPUs with ``device_map="auto"``.
        provider_name: VM provider — one of ``"docker"``, ``"vmware"``,
            ``"virtualbox"``, ``"aws"``.
        path_to_vm: Path to the VM image (used by ``vmware`` / ``virtualbox``;
            ignored for Docker).
        domain: Task domain to evaluate — ``"all"`` or one of OSWorld's
            domains (``"chrome"``, ``"libreoffice_calc"``, ``"gimp"``, ...).
        max_steps: Maximum agent steps per task.
        max_tokens: Maximum tokens generated per agent step.
        temperature: Sampling temperature for agent generation.
        top_p: Top-p sampling for agent generation.
        max_trajectory_length: Steps of action history retained in the
            prompt context.
        a11y_tree_max_tokens: Maximum tokens for the serialized
            accessibility-tree observation.
        result_dir: Directory for per-task outputs (trajectories,
            screenshots, scores).
        client_password: VM user password (for sudo operations inside the VM).
        screen_width: VM display width in pixels.
        screen_height: VM display height in pixels.
        headless: Run the VM without a visible GUI window.
        sleep_after_execution: Seconds to pause after each ``pyautogui``
            action (useful when the VM needs time to render).
        test_all_meta_path: Path to a ``test_all.json`` task list. ``None``
            uses OSWorld's default meta file for the chosen domain.
        log_dir: Directory for top-level ``args.json`` and aggregate
            ``results.json``.

    Returns:
        Dict with keys ``avg_score``, ``total_tasks``, ``total_success``,
        ``all_scores`` (per-task score list), and ``per_domain`` (mapping
        each domain to ``{avg_score, num_tasks, num_success}``).
    """
    print("=" * 60)
    print("OSWorld Agentic Evaluation")
    print("=" * 60)
    print(f"Model: {model_name}")
    print(f"OSWorld path: {osworld_path}")
    print(f"Domain: {domain}")
    print(f"Observation: a11y_tree (text-only)")
    print(f"Action space: pyautogui")

    quantize = quant_config is not None
    if quantize:
        print(f"Quantization config: {quant_config}")
    else:
        print("Quantization: None (baseline)")
    print("=" * 60)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        if quant_config:
            import shutil
            shutil.copy(quant_config, log_dir / "quant_config.toml")

    transformers.set_seed(0)

    dtype_map = {
        "float16": torch.float16,
        "bfloat16": torch.bfloat16,
        "float32": torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    tokenizer, model = setup_model(
        model_name,
        model_parallel,
        dtype=torch_dtype,
        device=device_id if not model_parallel else None,
    )
    model.eval()

    if quantize:
        from chop.passes.module.transforms import quantize_module_transform_pass

        pass_args = load_quant_config(quant_config)
        if "gptq" in pass_args:
            pass_args["gptq"]["device"] = device_id

        n_linear = sum(
            1 for _, m in model.named_modules() if isinstance(m, torch.nn.Linear)
        )
        logger.info("Quantizing %d linear layers...", n_linear)
        t0 = time.time()
        model, _ = quantize_module_transform_pass(model, pass_args)
        logger.info("Quantization complete in %.1fs", time.time() - t0)

    if model_parallel:
        model = move_to_gpu(model, model_parallel)
    else:
        model.to(device_id)

    if quantize:
        print_all_layers(model)

    results = evaluate_osworld(
        model=model,
        tokenizer=tokenizer,
        osworld_path=osworld_path,
        provider_name=provider_name,
        path_to_vm=path_to_vm,
        domain=domain,
        max_steps=max_steps,
        max_tokens=max_tokens,
        temperature=temperature,
        top_p=top_p,
        max_trajectory_length=max_trajectory_length,
        a11y_tree_max_tokens=a11y_tree_max_tokens,
        result_dir=result_dir,
        client_password=client_password,
        screen_width=screen_width,
        screen_height=screen_height,
        headless=headless,
        sleep_after_execution=sleep_after_execution,
        test_all_meta_path=test_all_meta_path,
    )

    print("\n" + "=" * 60)
    print("Results:")
    print("=" * 60)
    print(f"  Average score: {results['avg_score']:.4f}")
    print(f"  Tasks completed: {results['total_success']}/{results['total_tasks']}")
    print()
    if results.get("per_domain"):
        print("Per-domain breakdown:")
        for domain_name, domain_results in results["per_domain"].items():
            print(
                f"  {domain_name}: "
                f"avg={domain_results['avg_score']:.4f}, "
                f"success={domain_results['num_success']}/{domain_results['num_tasks']}"
            )

    if log_dir:
        save_results(log_dir, results)

    return results

Rotation search — `search_rotation`¶

Calibration-aware per-matmul Hadamard rotation search.

Loads a base TOML quantization config (with all online rotations off), runs rotation_search_transform_pass from MASE, and writes a JSON summary listing the per-matmul-type perplexities, the greedy winners, and the final combined perplexity.

This is a calibration tool, not an evaluation — its output (rotation_decisions.json) is consumed by subsequent eval_* runs whose configs include a [rotation_search] block.

Example:

python -m quant_eval.cli.search_rotation \
    --model_name unsloth/Llama-3.2-1B \
    --base_config quant_eval/configs/llama_mxint4.toml \
    --calib_data wikitext2 \
    --calib_nsamples 128 \
    --output_json checkpoints/rotation_decisions.json

`main(model_name='Qwen/Qwen3-8B', base_config='plena_experiments/table9/configs/gsm8k/05_w4_act4_kv4_gptq_erryclip.toml', calib_data='file:calib/Qwen_Qwen3-8B_gsm8k_n64_s1024.pt', device_id='cuda:0', dtype='bfloat16', calib_nsamples=32, calib_seqlen=1024, matmul_types=None, output_json=None, improvement_eps=0.0, log_dir=None)` ¶

Greedy forward search for per-matmul online Hadamard rotations that minimise calibration perplexity.

Each round tries enabling rotation on every remaining matmul type, commits the one with the largest ppl drop above improvement_eps, and repeats until no candidate helps.

Parameters:

Name	Type	Description	Default
`model_name`	`str`	HuggingFace model ID — must match the model that `base_config` targets.	`'Qwen/Qwen3-8B'`
`base_config`	`str`	TOML quantization recipe with all matmul rotations currently off.	`'plena_experiments/table9/configs/gsm8k/05_w4_act4_kv4_gptq_erryclip.toml'`
`calib_data`	`str`	Calibration data spec — a saved-token-batch path (`"file:calib/foo.pt"`) or a dataset name (`"wikitext2"`, `"c4"`).	`'file:calib/Qwen_Qwen3-8B_gsm8k_n64_s1024.pt'`
`device_id`	`str`	CUDA device for forward passes.	`'cuda:0'`
`dtype`	`str`	Model dtype — `"float16"`, `"bfloat16"`, or `"float32"`.	`'bfloat16'`
`calib_nsamples`	`int`	Number of calibration samples used to score ppl.	`32`
`calib_seqlen`	`int`	Sequence length for the calibration loader.	`1024`
`matmul_types`	`Union[str, None]`	Comma-separated subset of matmul types to search (e.g. `"q_proj,o_proj,qk_matmul"`). `None` searches all.	`None`
`output_json`	`Union[str, None]`	Path where the JSON summary is written.	`None`
`improvement_eps`	`float`	Minimum ppl drop required to commit a rotation as a winner. `0.0` accepts any improvement.	`0.0`
`log_dir`	`Union[str, None]`	Directory for `args.json` and `results.json`.	`None`

Returns:

Type	Description
	Dict summarising the search. Core keys: `baseline_ppl`,
	`final_ppl`, `winners` (matmul-type names in commit order),
	`rounds` (per-round candidate-vs-ppl history), `n_trials`,
	`per_type_swap_count`, `improvement_eps`,
	`matmul_types_searched`. When the search resumes from an existing
	`output_json`, additionally `from_cache: True`.

Source code in quant_eval/cli/search_rotation.py

def main(
    model_name: str = "Qwen/Qwen3-8B",
    base_config: str = "plena_experiments/table9/configs/gsm8k/05_w4_act4_kv4_gptq_erryclip.toml",
    calib_data: str = "file:calib/Qwen_Qwen3-8B_gsm8k_n64_s1024.pt",
    device_id: str = "cuda:0",
    dtype: str = "bfloat16",
    calib_nsamples: int = 32,
    calib_seqlen: int = 1024,
    matmul_types: Union[str, None] = None,
    output_json: Union[str, None] = None,
    improvement_eps: float = 0.0,
    log_dir: Union[str, None] = None,
):
    """
    Greedy forward search for per-matmul online Hadamard rotations that
    minimise calibration perplexity.

    Each round tries enabling rotation on every remaining matmul type,
    commits the one with the largest ppl drop above ``improvement_eps``,
    and repeats until no candidate helps.

    Args:
        model_name: HuggingFace model ID — must match the model that
            ``base_config`` targets.
        base_config: TOML quantization recipe with all matmul rotations
            currently off.
        calib_data: Calibration data spec — a saved-token-batch path
            (``"file:calib/foo.pt"``) or a dataset name (``"wikitext2"``,
            ``"c4"``).
        device_id: CUDA device for forward passes.
        dtype: Model dtype — ``"float16"``, ``"bfloat16"``, or ``"float32"``.
        calib_nsamples: Number of calibration samples used to score ppl.
        calib_seqlen: Sequence length for the calibration loader.
        matmul_types: Comma-separated subset of matmul types to search
            (e.g. ``"q_proj,o_proj,qk_matmul"``). ``None`` searches all.
        output_json: Path where the JSON summary is written.
        improvement_eps: Minimum ppl drop required to commit a rotation
            as a winner. ``0.0`` accepts any improvement.
        log_dir: Directory for ``args.json`` and ``results.json``.

    Returns:
        Dict summarising the search. Core keys: ``baseline_ppl``,
        ``final_ppl``, ``winners`` (matmul-type names in commit order),
        ``rounds`` (per-round candidate-vs-ppl history), ``n_trials``,
        ``per_type_swap_count``, ``improvement_eps``,
        ``matmul_types_searched``. When the search resumes from an existing
        ``output_json``, additionally ``from_cache: True``.
    """
    print("=" * 64)
    print("Calibration-aware per-matmul rotation search")
    print("=" * 64)
    print(f"  Model       : {model_name}")
    print(f"  Base config : {base_config}")
    print(f"  Calib data  : {calib_data}")
    print(f"  Calib n/seq : {calib_nsamples} x {calib_seqlen}")
    print(f"  Output JSON : {output_json}")
    print("=" * 64)

    if log_dir:
        log_dir = create_experiment_log_dir(log_dir)
        save_args(log_dir, locals().copy())
        import shutil
        shutil.copy(base_config, log_dir / "base_config.toml")

    transformers.set_seed(0)

    dtype_map = {
        "float16":  torch.float16,
        "bfloat16": torch.bfloat16,
        "float32":  torch.float32,
    }
    torch_dtype = dtype_map.get(dtype, torch.bfloat16)

    # Quantized attention modules (MXInt + *Rotate variants) replace the
    # eager forward and assert _attn_implementation == "eager".
    tokenizer, model = setup_model(
        model_name,
        model_parallel=False,
        dtype=torch_dtype,
        device=device_id,
        attn_implementation="eager",
    )
    model.eval()

    # Load the calibration loader the same way GPTQ does — same data
    # everyone in this project already feeds into quantization.
    from chop.passes.module.transforms.gptq.data_utils import get_loaders
    calib_loader = get_loaders(
        calib_data,
        nsamples=calib_nsamples,
        seed=0,
        seqlen=calib_seqlen,
        model=model_name,
    )
    logger.info("Loaded %d calibration batches.", len(calib_loader))

    base_pass_args = load_quant_config(base_config)
    if "gptq" in base_pass_args:
        # Plumb the device through so GPTQ runs on the same GPU.
        base_pass_args["gptq"]["device"] = device_id

    selected_types = None
    if matmul_types:
        selected_types = [t.strip() for t in matmul_types.split(",") if t.strip()]

    from chop.passes.module.transforms import rotation_search_transform_pass
    from chop.passes.module.transforms.quantize import ALL_MATMUL_TYPES

    search_args = {
        "base_quantize_args": base_pass_args,
        "calib_loader": calib_loader,
        "device": device_id,
        "matmul_types": selected_types or ALL_MATMUL_TYPES,
        "output_json": output_json,
        "improvement_eps": improvement_eps,
    }

    t0 = time.time()
    model, results = rotation_search_transform_pass(model, search_args)
    logger.info("Rotation search complete in %.1fs", time.time() - t0)

    print("\n" + "=" * 64)
    print("Rotation search results:")
    print("=" * 64)
    print(f"  baseline_ppl : {results['baseline_ppl']:.4f}")
    print(f"  final_ppl    : {results['final_ppl']:.4f}  "
          f"(Δ={results['baseline_ppl']-results['final_ppl']:+.4f} from baseline)")
    print(f"  winners      : {results['winners']}  (in commit order)")
    print(f"  total trials : {results.get('n_trials', 'n/a')}")

    print("\n  Round history:")
    for r in results.get("rounds", []):
        sel = r["selected"]
        if sel is None:
            print(
                f"   round {r['round']}: stopped at ppl={r['current_ppl_after']:.4f}"
            )
            continue
        delta = r["current_ppl_before"] - r["current_ppl_after"]
        print(
            f"   round {r['round']}: +{sel:<11s}  "
            f"ppl {r['current_ppl_before']:.4f} -> {r['current_ppl_after']:.4f}  "
            f"Δ={delta:+.4f}"
        )

    if log_dir:
        save_results(log_dir, results)

    return results

Evaluation commands¶

Perplexity — eval_ppl¶

main(model_name='Qwen/Qwen3-30B-A3B', dataset='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, seqlen=2048, log_dir=None) ¶

lm-eval-harness — eval_lm¶

main(model_name='Qwen/Qwen2.5-1.5B', tasks='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, seqlen=2048, batch_size=64, limit=None, log_dir=None) ¶

Code generation — eval_evalplus¶

Phase-dependent precision — eval_phase_lm¶

BFCL with phase-dependent precision — eval_phase_bfcl¶

Diffusion LLMs — eval_dllm¶

LLaDA diffusion — eval_llada¶

Agentic — eval_osworld¶

Rotation search — search_rotation¶

Perplexity — `eval_ppl`¶

`main(model_name='Qwen/Qwen3-30B-A3B', dataset='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, seqlen=2048, log_dir=None)` ¶

lm-eval-harness — `eval_lm`¶

`main(model_name='Qwen/Qwen2.5-1.5B', tasks='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, seqlen=2048, batch_size=64, limit=None, log_dir=None)` ¶

Code generation — `eval_evalplus`¶

Phase-dependent precision — `eval_phase_lm`¶

BFCL with phase-dependent precision — `eval_phase_bfcl`¶

Diffusion LLMs — `eval_dllm`¶

LLaDA diffusion — `eval_llada`¶

Agentic — `eval_osworld`¶

Rotation search — `search_rotation`¶