Evaluation commands¶
Every PLENA evaluation is a Python module under quant_eval.cli. They share a
common pattern: pass --model_name, a TOML via --quant_config, and any
eval-specific flags. The reference below is auto-generated from each module's
docstrings, so it stays in sync with the code.
Perplexity — eval_ppl¶
Language-modeling perplexity evaluation with optional MX quantization.
Sets up an HF causal language model, optionally applies a MASE quantization pass driven by a TOML recipe, then computes perplexity on the chosen dataset (WikiText by default).
Example — baseline (fp16):
python -m quant_eval.cli.eval_ppl --model_name unsloth/Llama-3.2-1B
Example — quantized:
python -m quant_eval.cli.eval_ppl \
--model_name unsloth/Llama-3.2-1B \
--quant_config quant_eval/configs/llama_mxint4.toml
main(model_name='Qwen/Qwen3-30B-A3B', dataset='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, seqlen=2048, log_dir=None)
¶
Evaluate language-modeling perplexity, optionally with MX quantization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID (e.g. |
'Qwen/Qwen3-30B-A3B'
|
dataset
|
str
|
Dataset for perplexity scoring. Default |
'wikitext'
|
device_id
|
str
|
CUDA device string, e.g. |
'cuda:0'
|
dtype
|
str
|
Model dtype — one of |
'bfloat16'
|
quant_config
|
Union[str, None]
|
Path to a TOML quantization recipe. |
None
|
model_parallel
|
bool
|
Distribute the model across all visible GPUs using HF
|
False
|
seqlen
|
int
|
Maximum sequence length passed to the perplexity evaluator. |
2048
|
log_dir
|
Union[str, None]
|
Directory in which to write |
None
|
Returns:
| Type | Description |
|---|---|
|
Dict of metric name to scalar. For wikitext: |
Source code in quant_eval/cli/eval_ppl.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
lm-eval-harness — eval_lm¶
lm-eval-harness driver with optional MX quantization.
Applies a TOML quantization recipe once before evaluation; activation precision stays fixed for the whole run.
Example:
python -m quant_eval.cli.eval_lm \
--model_name unsloth/Llama-3.2-1B \
--quant_config quant_eval/configs/llama_mxint4.toml \
--tasks arc_easy,hellaswag,winogrande \
--limit 500
main(model_name='Qwen/Qwen2.5-1.5B', tasks='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, seqlen=2048, batch_size=64, limit=None, log_dir=None)
¶
Run lm-eval-harness on an optionally MX-quantized HF model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID. |
'Qwen/Qwen2.5-1.5B'
|
tasks
|
Union[str, list[str]]
|
lm-eval task name(s) — comma-separated string or list
(e.g. |
'wikitext'
|
device_id
|
str
|
CUDA device string. |
'cuda:0'
|
dtype
|
str
|
Model dtype — |
'bfloat16'
|
quant_config
|
Union[str, None]
|
Path to a TOML quantization recipe. |
'quant_eval/configs/llama_mxint4.toml'
|
model_parallel
|
bool
|
Distribute across GPUs with |
False
|
seqlen
|
int
|
Maximum context length passed to lm-eval. |
2048
|
batch_size
|
Union[int, str]
|
Eval batch size. Pass an int for a fixed size, or the
string |
64
|
limit
|
Union[int, float, None]
|
Cap samples per task. Int = absolute count; float in
|
None
|
log_dir
|
Union[str, None]
|
Directory for |
None
|
Returns:
| Type | Description |
|---|---|
|
lm-eval results dict — per-task metrics plus aggregate scores. |
Source code in quant_eval/cli/eval_lm.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | |
Code generation — eval_evalplus¶
HumanEval+/MBPP+ code-generation evaluation with optional MX quantization.
Routes through evalplus to score pass@1 (or pass@k) on the HumanEval+ or
MBPP+ benchmarks under a single fixed-precision quantization profile. Use
this when you want to check whether a quantization recipe still preserves
the reasoning required for code generation.
Requires the evalplus extra:
uv sync --extra evalplus
Example:
python -m quant_eval.cli.eval_evalplus \
--model_name unsloth/Llama-3.2-1B \
--quant_config quant_eval/configs/llama_mxint4.toml \
--dataset humaneval \
--greedy \
--evalplus_output_dir logs/evalplus
main(model_name='Qwen/Qwen2.5-1.5B', dataset='humaneval', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, batch_size=1, greedy=False, n_samples=1, max_new_tokens=4096, evalplus_output_dir=None, overwrite=False, base_only=False, parallel=None, version='default', log_dir=None)
¶
Run evalplus (HumanEval+ / MBPP+) on an optionally MX-quantized HF model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID. |
'Qwen/Qwen2.5-1.5B'
|
dataset
|
str
|
|
'humaneval'
|
device_id
|
str
|
CUDA device string. |
'cuda:0'
|
dtype
|
str
|
Model dtype — |
'bfloat16'
|
quant_config
|
Union[str, None]
|
Path to a TOML quantization recipe. |
'quant_eval/configs/llama_mxint4.toml'
|
model_parallel
|
bool
|
Distribute across GPUs with |
False
|
batch_size
|
int
|
Generation batch size (samples per task per call). |
1
|
greedy
|
bool
|
Greedy decoding (forces |
False
|
n_samples
|
int
|
Samples per task. Ignored when |
1
|
max_new_tokens
|
int
|
Maximum tokens generated per sample. |
4096
|
evalplus_output_dir
|
Union[str, None]
|
Directory where evalplus writes the generated solutions JSONL and per-problem evaluation results. |
None
|
overwrite
|
bool
|
Regenerate solutions even if a previous JSONL exists. |
False
|
base_only
|
bool
|
Score against base tests only (skip the +/plus tests). |
False
|
parallel
|
Union[int, None]
|
Worker count for evalplus's code-execution stage. |
None
|
version
|
str
|
evalplus dataset version (e.g. |
'default'
|
log_dir
|
Union[str, None]
|
Directory for |
None
|
Returns:
| Type | Description |
|---|---|
|
evalplus results dict — pass@1 (and pass@k when applicable) plus |
|
|
per-problem outcomes. |
Raises:
| Type | Description |
|---|---|
ValueError
|
|
Source code in quant_eval/cli/eval_evalplus.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | |
Phase-dependent precision — eval_phase_lm¶
lm-eval-harness with phase- and layer-type-dependent MX quantization.
Activation precision switches based on (phase, layer_type):
- phase: prefill (seq len > 1) vs decode (seq len == 1), detected from input shape at runtime.
- layer_type: attention vs FFN, detected from module names.
The four resulting widths (prefill-attn, prefill-ffn, decode-attn, decode-ffn) are set independently. Weight quantization comes from the TOML recipe; activation widths come from CLI flags. lm-eval itself is unmodified.
Example — disaggregated W4 prefill / W8 decode:
python -m quant_eval.cli.eval_phase_lm \
--model_name Qwen/Qwen2.5-1.5B \
--quant_config quant_eval/configs/llama_mxint4.toml \
--prefill_attn_width 4 --prefill_ffn_width 4 \
--decode_attn_width 8 --decode_ffn_width 8 \
--tasks gsm8k --limit 200
main(model_name='Qwen/Qwen2.5-1.5B', tasks='wikitext', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, seqlen=2048, batch_size=64, prefill_attn_width=4, prefill_ffn_width=4, prefill_attn_block_size=32, prefill_ffn_block_size=32, decode_attn_width=8, decode_ffn_width=8, decode_attn_block_size=32, decode_ffn_block_size=32, attn_keywords=None, ffn_keywords=None, limit=None, log_dir=None)
¶
Run lm-eval with phase- and layer-type-dependent activation precision.
Phase is detected from input sequence length at runtime; layer type is
detected from module names (override via attn_keywords / ffn_keywords
for non-standard architectures). The weight quantization recipe from
quant_config is applied once; the activation widths and block sizes
specified here override the recipe's activation sections per
(phase, layer_type) pair.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID. |
'Qwen/Qwen2.5-1.5B'
|
tasks
|
Union[str, list[str]]
|
lm-eval task name(s) — comma-separated string or list. |
'wikitext'
|
device_id
|
str
|
CUDA device string. |
'cuda:0'
|
dtype
|
str
|
Model dtype — |
'bfloat16'
|
quant_config
|
str
|
Path to a TOML quantization recipe. |
'quant_eval/configs/llama_mxint4.toml'
|
model_parallel
|
bool
|
Distribute across GPUs with |
False
|
seqlen
|
int
|
Maximum context length passed to lm-eval. |
2048
|
batch_size
|
Union[int, str]
|
lm-eval batch size — int or |
64
|
prefill_attn_width
|
int
|
Activation bit-width for attention layers during prefill. |
4
|
prefill_ffn_width
|
int
|
Activation bit-width for FFN layers during prefill. |
4
|
prefill_attn_block_size
|
int
|
MX block size for attention during prefill. |
32
|
prefill_ffn_block_size
|
int
|
MX block size for FFN during prefill. |
32
|
decode_attn_width
|
int
|
Activation bit-width for attention layers during decode. |
8
|
decode_ffn_width
|
int
|
Activation bit-width for FFN layers during decode. |
8
|
decode_attn_block_size
|
int
|
MX block size for attention during decode. |
32
|
decode_ffn_block_size
|
int
|
MX block size for FFN during decode. |
32
|
attn_keywords
|
Union[list[str], None]
|
Module-name substrings that identify attention blocks.
|
None
|
ffn_keywords
|
Union[list[str], None]
|
Module-name substrings that identify FFN blocks. |
None
|
limit
|
Union[int, float, None]
|
Cap samples per task. Int = absolute count; float in
|
None
|
log_dir
|
Union[str, None]
|
Directory for |
None
|
Returns:
| Type | Description |
|---|---|
|
lm-eval results dict — per-task metrics plus aggregate scores. |
Source code in quant_eval/cli/eval_phase_lm.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 | |
BFCL with phase-dependent precision — eval_phase_bfcl¶
BFCL web-search evaluation with phase- and layer-type-dependent MX quantization.
Serves an MX-quantized model through a lightweight OpenAI-compatible HTTP
server (backed by HuggingFace generate), then drives the standard BFCL
CLI against it. Activation precision is set independently for each
(phase, layer_type) pair via the prefill/decode × attn/FFN flags.
BFCL is a two-step flow:
bfcl generatecalls the local server to produce model responses.bfcl evaluatescores those responses (no model needed).
This script orchestrates both steps automatically and exposes the local
server on server_host:server_port.
Requires the bfcl extra and the bfcl-eval package:
uv sync --extra bfcl
pip install bfcl-eval
Example:
python -m quant_eval.cli.eval_phase_bfcl \
--model_name Qwen/Qwen2.5-1.5B \
--quant_config quant_eval/configs/llama_mxint4.toml \
--prefill_attn_width 4 --prefill_ffn_width 4 \
--decode_attn_width 8 --decode_ffn_width 8 \
--bfcl_test_categories web_search_base \
--limit 50
main(model_name='Qwen/Qwen3-8B-FC', device_id='cuda:0', dtype='bfloat16', quant_config='quant_eval/configs/llama_mxint4.toml', model_parallel=False, bfcl_test_categories=None, bfcl_num_threads=1, server_host=DEFAULT_HOST, server_port=DEFAULT_PORT, prefill_attn_width=4, prefill_ffn_width=4, prefill_attn_block_size=32, prefill_ffn_block_size=32, decode_attn_width=8, decode_ffn_width=8, decode_attn_block_size=32, decode_ffn_block_size=32, attn_keywords=None, ffn_keywords=None, limit=None, log_dir=None)
¶
Run BFCL web-search evaluation with phase- and layer-type-dependent activation precision.
Spawns a local OpenAI-compatible HTTP server backed by HF generate
so that the unmodified bfcl generate CLI can drive inference, then
runs bfcl evaluate to score responses.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID. For function-calling, must be an
instruction-tuned model with a function-call template
(e.g. |
'Qwen/Qwen3-8B-FC'
|
device_id
|
str
|
CUDA device string. |
'cuda:0'
|
dtype
|
str
|
Model dtype — |
'bfloat16'
|
quant_config
|
str
|
Path to a TOML quantization recipe. |
'quant_eval/configs/llama_mxint4.toml'
|
model_parallel
|
bool
|
Distribute across GPUs with |
False
|
bfcl_test_categories
|
Union[list[str], None]
|
BFCL category names to evaluate (e.g.
|
None
|
bfcl_num_threads
|
int
|
Parallel inference threads for |
1
|
server_host
|
str
|
Host for the local OpenAI-compatible server. |
DEFAULT_HOST
|
server_port
|
int
|
Port for the local OpenAI-compatible server. |
DEFAULT_PORT
|
prefill_attn_width
|
int
|
Activation bit-width for attention during prefill. |
4
|
prefill_ffn_width
|
int
|
Activation bit-width for FFN during prefill. |
4
|
prefill_attn_block_size
|
int
|
MX block size for attention during prefill. |
32
|
prefill_ffn_block_size
|
int
|
MX block size for FFN during prefill. |
32
|
decode_attn_width
|
int
|
Activation bit-width for attention during decode. |
8
|
decode_ffn_width
|
int
|
Activation bit-width for FFN during decode. |
8
|
decode_attn_block_size
|
int
|
MX block size for attention during decode. |
32
|
decode_ffn_block_size
|
int
|
MX block size for FFN during decode. |
32
|
attn_keywords
|
Union[list[str], None]
|
Module-name substrings that identify attention blocks.
|
None
|
ffn_keywords
|
Union[list[str], None]
|
Module-name substrings that identify FFN blocks. |
None
|
limit
|
Union[int, None]
|
Cap the number of samples per category. |
None
|
log_dir
|
Union[str, None]
|
Directory for |
None
|
Returns:
| Type | Description |
|---|---|
|
BFCL evaluation summary — per-category scores plus aggregate metrics, |
|
|
with |
|
|
precision table. |
Source code in quant_eval/cli/eval_phase_bfcl.py
598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 | |
Diffusion LLMs — eval_dllm¶
Fast-dLLM v2 (block-diffusion language model) evaluation with optional MX quantization.
Evaluates diffusion-based language models via lm-eval-harness, using block-diffusion sampling instead of standard autoregressive decoding. Quantization is applied via the same TOML-config interface as the rest of the toolkit.
Example — baseline:
python -m quant_eval.cli.eval_dllm \
--model_name Efficient-Large-Model/Fast_dLLM_v2_1.5B \
--tasks gsm8k
Example — quantized:
python -m quant_eval.cli.eval_dllm \
--model_name Efficient-Large-Model/Fast_dLLM_v2_1.5B \
--quant_config quant_eval/configs/llama_mxint4.toml \
--tasks gsm8k
main(model_name='Efficient-Large-Model/Fast_dLLM_v2_1.5B', tasks='gsm8k', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, batch_size=32, max_new_tokens=2048, num_fewshot=0, mask_id=151665, bd_size=32, small_block_size=8, threshold=1.0, show_speed=True, log_dir=None)
¶
Evaluate a Fast-dLLM v2 model with optional MX quantization.
Decoding is block-diffusion: bd_size tokens are generated per outer
block, then refined through small_block_size sub-blocks of iterative
unmasking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID (must be a Fast-dLLM v2 checkpoint). |
'Efficient-Large-Model/Fast_dLLM_v2_1.5B'
|
tasks
|
Union[str, list[str]]
|
lm-eval task name(s) — comma-separated string or list
(e.g. |
'gsm8k'
|
device_id
|
str
|
CUDA device string. |
'cuda:0'
|
dtype
|
str
|
Model dtype — |
'bfloat16'
|
quant_config
|
Union[str, None]
|
Path to a TOML quantization recipe. |
None
|
model_parallel
|
bool
|
Distribute across GPUs with |
False
|
batch_size
|
int
|
lm-eval batch size. |
32
|
max_new_tokens
|
int
|
Maximum tokens generated per sample. |
2048
|
num_fewshot
|
int
|
Few-shot examples prepended to each task prompt. |
0
|
mask_id
|
int
|
Token ID used as the diffusion mask. Default |
151665
|
bd_size
|
int
|
Outer block-diffusion block size — tokens generated per outer sampling step. |
32
|
small_block_size
|
int
|
Inner block size for iterative unmasking within each outer block. |
8
|
threshold
|
float
|
Confidence threshold for committing unmasked tokens. |
1.0
|
show_speed
|
bool
|
Log throughput metrics (tokens/second). |
True
|
log_dir
|
Union[str, None]
|
Directory for |
None
|
Returns:
| Type | Description |
|---|---|
|
lm-eval results dict — per-task metrics plus aggregate scores. |
Source code in quant_eval/cli/eval_dllm.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | |
LLaDA diffusion — eval_llada¶
LLaDA (diffusion-style language model) evaluation with optional MX quantization.
Wraps lm-eval-harness's CLI. Use --model llada_dist and pass model and
quantization options through lm-eval's --model_args flag.
Example — baseline (prefix cache):
python -m quant_eval.cli.eval_llada \
--tasks gsm8k --num_fewshot 0 \
--model llada_dist \
--model_args model_path='GSAI-ML/LLaDA-8B-Instruct',gen_length=256,steps=256,block_length=32,use_cache=True
Example — with MXINT4 KV-cache quantization:
python -m quant_eval.cli.eval_llada \
--tasks gsm8k --num_fewshot 0 \
--model llada_dist \
--model_args model_path='GSAI-ML/LLaDA-8B-Instruct',gen_length=256,steps=256,block_length=32,use_cache=True,quant_config='quant_eval/configs/llama_mxint4.toml'
Agentic — eval_osworld¶
OSWorld agentic evaluation with optional MX quantization.
Runs the OSWorld desktop-task benchmark in text-only (a11y_tree) mode, so
quantized language models can serve as OSWorld agents without vision
capabilities. The agent observes the desktop via the accessibility tree and
generates pyautogui code to act on it; the VM is rolled back between
tasks.
Prerequisites:
- The OSWorld repository cloned at
osworld_path. - A configured VM provider (Docker recommended) with the corresponding image.
- An instruction-tuned / chat model.
Example:
python -m quant_eval.cli.eval_osworld \
--model_name Qwen/Qwen2.5-7B-Instruct \
--osworld_path quant_eval/benchmarks/OSWorld \
--quant_config quant_eval/configs/llama_mxint4.toml \
--domain chrome --max_steps 15
main(model_name='Qwen/Qwen3-30B-A3B-Instruct-2507', osworld_path='quant_eval/benchmarks/OSWorld', device_id='cuda:0', dtype='bfloat16', quant_config=None, model_parallel=False, provider_name='docker', path_to_vm=None, domain='all', max_steps=15, max_tokens=1500, temperature=0.5, top_p=0.9, max_trajectory_length=3, a11y_tree_max_tokens=10000, result_dir='./results', client_password='password', screen_width=1920, screen_height=1080, headless=True, sleep_after_execution=0.0, test_all_meta_path=None, log_dir=None)
¶
Run OSWorld agentic evaluation with optional MX quantization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID — must be an instruction-tuned / chat model. |
'Qwen/Qwen3-30B-A3B-Instruct-2507'
|
osworld_path
|
str
|
Local path to the OSWorld repository checkout. |
'quant_eval/benchmarks/OSWorld'
|
device_id
|
str
|
CUDA device string. |
'cuda:0'
|
dtype
|
str
|
Model dtype — |
'bfloat16'
|
quant_config
|
Union[str, None]
|
Path to a TOML quantization recipe. |
None
|
model_parallel
|
bool
|
Distribute across GPUs with |
False
|
provider_name
|
str
|
VM provider — one of |
'docker'
|
path_to_vm
|
Union[str, None]
|
Path to the VM image (used by |
None
|
domain
|
str
|
Task domain to evaluate — |
'all'
|
max_steps
|
int
|
Maximum agent steps per task. |
15
|
max_tokens
|
int
|
Maximum tokens generated per agent step. |
1500
|
temperature
|
float
|
Sampling temperature for agent generation. |
0.5
|
top_p
|
float
|
Top-p sampling for agent generation. |
0.9
|
max_trajectory_length
|
int
|
Steps of action history retained in the prompt context. |
3
|
a11y_tree_max_tokens
|
int
|
Maximum tokens for the serialized accessibility-tree observation. |
10000
|
result_dir
|
str
|
Directory for per-task outputs (trajectories, screenshots, scores). |
'./results'
|
client_password
|
str
|
VM user password (for sudo operations inside the VM). |
'password'
|
screen_width
|
int
|
VM display width in pixels. |
1920
|
screen_height
|
int
|
VM display height in pixels. |
1080
|
headless
|
bool
|
Run the VM without a visible GUI window. |
True
|
sleep_after_execution
|
float
|
Seconds to pause after each |
0.0
|
test_all_meta_path
|
Union[str, None]
|
Path to a |
None
|
log_dir
|
Union[str, None]
|
Directory for top-level |
None
|
Returns:
| Type | Description |
|---|---|
|
Dict with keys |
|
|
|
|
|
each domain to |
Source code in quant_eval/cli/eval_osworld.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | |
Rotation search — search_rotation¶
Calibration-aware per-matmul Hadamard rotation search.
Loads a base TOML quantization config (with all online rotations off), runs
rotation_search_transform_pass from MASE, and writes a JSON summary
listing the per-matmul-type perplexities, the greedy winners, and the
final combined perplexity.
This is a calibration tool, not an evaluation — its output
(rotation_decisions.json) is consumed by subsequent eval_* runs whose
configs include a [rotation_search] block.
Example:
python -m quant_eval.cli.search_rotation \
--model_name unsloth/Llama-3.2-1B \
--base_config quant_eval/configs/llama_mxint4.toml \
--calib_data wikitext2 \
--calib_nsamples 128 \
--output_json checkpoints/rotation_decisions.json
main(model_name='Qwen/Qwen3-8B', base_config='plena_experiments/table9/configs/gsm8k/05_w4_act4_kv4_gptq_erryclip.toml', calib_data='file:calib/Qwen_Qwen3-8B_gsm8k_n64_s1024.pt', device_id='cuda:0', dtype='bfloat16', calib_nsamples=32, calib_seqlen=1024, matmul_types=None, output_json=None, improvement_eps=0.0, log_dir=None)
¶
Greedy forward search for per-matmul online Hadamard rotations that minimise calibration perplexity.
Each round tries enabling rotation on every remaining matmul type,
commits the one with the largest ppl drop above improvement_eps,
and repeats until no candidate helps.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name
|
str
|
HuggingFace model ID — must match the model that
|
'Qwen/Qwen3-8B'
|
base_config
|
str
|
TOML quantization recipe with all matmul rotations currently off. |
'plena_experiments/table9/configs/gsm8k/05_w4_act4_kv4_gptq_erryclip.toml'
|
calib_data
|
str
|
Calibration data spec — a saved-token-batch path
( |
'file:calib/Qwen_Qwen3-8B_gsm8k_n64_s1024.pt'
|
device_id
|
str
|
CUDA device for forward passes. |
'cuda:0'
|
dtype
|
str
|
Model dtype — |
'bfloat16'
|
calib_nsamples
|
int
|
Number of calibration samples used to score ppl. |
32
|
calib_seqlen
|
int
|
Sequence length for the calibration loader. |
1024
|
matmul_types
|
Union[str, None]
|
Comma-separated subset of matmul types to search
(e.g. |
None
|
output_json
|
Union[str, None]
|
Path where the JSON summary is written. |
None
|
improvement_eps
|
float
|
Minimum ppl drop required to commit a rotation
as a winner. |
0.0
|
log_dir
|
Union[str, None]
|
Directory for |
None
|
Returns:
| Type | Description |
|---|---|
|
Dict summarising the search. Core keys: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Source code in quant_eval/cli/search_rotation.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | |