Optical Neural Networks on RoBERTa#

This tutorial demonstrates how to apply optical transformer modifications to RoBERTa for sequence classification. The implementation simulates photonic computing with quantization-aware attention mechanisms and linear layers.

Note

If you have not set up the environment yet, follow Installation first.

Overview#

The simulation uses custom Triton kernels from mase-triton to accelerate quantization-aware operations (see Mase-Triton).

Optical Transform Configuration#

The transform is controlled through a YAML config file.

Configuration parameters#

  • q_levels — number of quantization levels (default: 256)

  • q_lut_min — minimum lookup-table value for quantization (default: 0.020040)

  • q_quantiles — optional quantile-based range setting (default: null)

  • q_smooth_factor — smoothing factor for statistics updates (default: 0.9)

  • q_init_seed — random seed for initialization (default: 0)

  • q_bypass — bypass the optical transform (default: false)

Default configuration#

Config file: experiments/roberta-optical-transformer/transform_cfg.yaml

"attn":
  q_levels: 256
  q_lut_min: 0.020040
  q_quantiles: null
  q_smooth_factor: 0.9
  q_init_seed: 0
  q_bypass: false
"fc":
  q_levels: 256
  q_lut_min: 0.020040
  q_quantiles: null
  q_smooth_factor: 0.9
  q_init_seed: 0
  q_bypass: false

Fine-Tuning RoBERTa with Optical Transform#

Single task#

cd experiments/roberta-optical-transformer

TASK_NAME="mrpc"
MODEL_NAME="FacebookAI/roberta-base"
LEARNING_RATE="2e-5"
BATCH_SIZE="16"
NUM_EPOCHS="3"
TRANSFORM_CONFIG="transform_cfg.yaml"

python run_glue.py \
    --model_name_or_path ${MODEL_NAME} \
    --task_name ${TASK_NAME} \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size ${BATCH_SIZE} \
    --learning_rate ${LEARNING_RATE} \
    --num_train_epochs ${NUM_EPOCHS} \
    --output_dir ./output/${TASK_NAME}_optical \
    --overwrite_output_dir \
    --transform_config ${TRANSFORM_CONFIG} \
    --eval_strategy epoch \
    --save_strategy epoch \
    --logging_steps 50 \
    --seed 42

Multiple GLUE tasks#

cd experiments/roberta-optical-transformer

export USE_SINGLE_TASK=false
export TASK_LIST="stsb mrpc cola"
export LR_LIST="1e-3 2e-5 1e-5"
export MODEL_NAME="FacebookAI/roberta-base"
export BATCH_SIZE=16

bash finetune_base.sh

Evaluation only (post-transform, no fine-tuning)#

Evaluate the optical transform applied directly to a task-fine-tuned RoBERTa (no extra fine-tuning). MODEL_NAME must point to a checkpoint that already contains a fine-tuned classifier head for the task — using the bare FacebookAI/roberta-base here gives near-random results because its classifier weights are randomly initialized.

--calibration_steps N runs N forward passes over the training split in train mode (no optimizer step) before evaluation, so the OT layers’ running *_min_max statistics are populated. Without this the buffers stay at their [+inf, -inf] initialization and the kernel produces nan logits, so the flag is required for this flow. Roughly 16–64 steps is enough for GLUE-sized tasks; larger calibration sets give marginally tighter quantization ranges.

cd experiments/roberta-optical-transformer

TASK_NAME="mrpc"
MODEL_NAME="Intel/roberta-base-mrpc"
BATCH_SIZE="16"
TRANSFORM_CONFIG="transform_cfg.yaml"

python run_glue.py \
    --model_name_or_path "${MODEL_NAME}" \
    --task_name "${TASK_NAME}" \
    --do_eval \
    --max_seq_length 128 \
    --per_device_eval_batch_size "${BATCH_SIZE}" \
    --output_dir "./output/${TASK_NAME}_eval" \
    --transform_config "${TRANSFORM_CONFIG}" \
    --calibration_steps 32 \
    --overwrite_output_dir

Evaluation with fine-tuned optical weights#

Evaluate a checkpoint produced by the Single task block above. --model_weights_path loads the saved state dict after the optical transform is re-applied, so the calibrated *_min_max and seed buffers from training are restored — without this flag those buffers are dropped as “unexpected keys” during the standard from_pretrained load and re-initialized from scratch on the eval set, which gives noticeably worse numbers than the eval reported at the end of training.

cd experiments/roberta-optical-transformer

TASK_NAME="mrpc"
MODEL_NAME="FacebookAI/roberta-base"
BATCH_SIZE="16"
TRANSFORM_CONFIG="transform_cfg.yaml"

python run_glue.py \
    --model_name_or_path "${MODEL_NAME}" \
    --task_name "${TASK_NAME}" \
    --do_eval \
    --max_seq_length 128 \
    --per_device_eval_batch_size "${BATCH_SIZE}" \
    --output_dir "./output/${TASK_NAME}_eval" \
    --transform_config "${TRANSFORM_CONFIG}" \
    --model_weights_path "./output/${TASK_NAME}_optical" \
    --overwrite_output_dir

Baseline comparison (no transform)#

python run_glue.py \
    --model_name_or_path ${MODEL_NAME} \
    --task_name ${TASK_NAME} \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size ${BATCH_SIZE} \
    --learning_rate ${LEARNING_RATE} \
    --num_train_epochs ${NUM_EPOCHS} \
    --output_dir ./output/${TASK_NAME}_baseline \
    --overwrite_output_dir \
    --eval_strategy epoch \
    --save_strategy epoch

Results#

Post-training transform (optical transform applied to a trained RoBERTa, no fine-tuning)#

Model

MNLI

QNLI

RTE

SST

MRPC

CoLA

QQP

STSB

Avg

Original

0.8728

0.9244

0.7978

0.9357

0.9019

0.6232

0.9153

0.9089

0.8600

Random

0.3266

0.4946

0.5271

0.4908

0.3162

0.0000

0.6318

0.0332

0.3525

Optical Transformer

0.8000

0.7966

0.4801

0.8704

0.7770

0.2034

0.9075

0.8485

0.7104

SqueezeLight

0.3200

0.4961

0.4404

0.5126

0.5025

0.0213

0.5890

−0.0543

0.3582

Transform-aware fine-tuning#

Model

MNLI

QNLI

RTE

SST

MRPC

CoLA

QQP

STSB

Avg

Original

0.8728

0.9244

0.7978

0.9357

0.9019

0.6232

0.9153

0.9089

0.8600

Random

0.3266

0.4946

0.5271

0.4908

0.3162

0.0000

0.6318

0.0332

0.3525

Optical Transformer

0.8510

0.9032

0.5813

0.9140

0.8677

0.4441

0.9060

0.0332

0.6876

SqueezeLight

0.3212

0.4961

0.4676

0.5131

0.5025

0.0000

0.5932

0.0514

0.3681

Takeaways

  • The Optical Transformer significantly outperforms SqueezeLight in both evaluation modes. SqueezeLight was designed for convolutional networks and does not transfer well to Transformers.

  • Transform-aware fine-tuning generally outperforms post-training transform, but the noisy forward pass with straight-through estimators can occasionally cause instability (e.g., STSB).

  • We carry the Optical Transformer forward to large-scale CLM experiments. See Scaling Optical Transformers to Causal Language Models for the follow-up.