Optical Neural Networks on RoBERTa#

This tutorial demonstrates how to apply optical transformer modifications to RoBERTa for sequence classification. The implementation simulates photonic computing with quantization-aware attention mechanisms and linear layers.

Note

If you have not set up the environment yet, follow Installation first.

Overview#

The simulation uses custom Triton kernels from mase-triton to accelerate quantization-aware operations (see Mase-Triton).

Optical Transform Configuration#

The transform is controlled through a YAML config file.

Configuration parameters#

  • q_levels — number of quantization levels (default: 256)

  • q_lut_min — minimum lookup-table value for quantization (default: 0.020040)

  • q_quantiles — optional quantile-based range setting (default: null)

  • q_smooth_factor — smoothing factor for statistics updates (default: 0.9)

  • q_init_seed — random seed for initialization (default: 0)

  • q_bypass — bypass the optical transform (default: false)

Default configuration (experiments/roberta-optical-transformer/transform_cfg.yaml)#

"attn":
  q_levels: 256
  q_lut_min: 0.020040
  q_quantiles: null
  q_smooth_factor: 0.9
  q_init_seed: 0
  q_bypass: false
"fc":
  q_levels: 256
  q_lut_min: 0.020040
  q_quantiles: null
  q_smooth_factor: 0.9
  q_init_seed: 0
  q_bypass: false

Fine-Tuning RoBERTa with Optical Transform#

Single task#

cd experiments/roberta-optical-transformer

TASK_NAME="mrpc"
MODEL_NAME="FacebookAI/roberta-base"
LEARNING_RATE="2e-5"
BATCH_SIZE="16"
NUM_EPOCHS="3"
TRANSFORM_CONFIG="transform_cfg.yaml"

python run_glue.py \
    --model_name_or_path ${MODEL_NAME} \
    --task_name ${TASK_NAME} \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size ${BATCH_SIZE} \
    --learning_rate ${LEARNING_RATE} \
    --num_train_epochs ${NUM_EPOCHS} \
    --output_dir ./output/${TASK_NAME}_optical \
    --overwrite_output_dir \
    --transform_config ${TRANSFORM_CONFIG} \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_steps 50 \
    --seed 42

Multiple GLUE tasks#

cd experiments/roberta-optical-transformer

export USE_SINGLE_TASK=false
export TASK_LIST="stsb mrpc cola"
export LR_LIST="1e-3 2e-5 1e-5"
export MODEL_NAME="FacebookAI/roberta-base"
export BATCH_SIZE=16

bash finetune_base.sh

Evaluation only#

python run_glue.py \
    --model_name_or_path ${MODEL_NAME} \
    --task_name ${TASK_NAME} \
    --do_eval \
    --max_seq_length 128 \
    --per_device_eval_batch_size ${BATCH_SIZE} \
    --output_dir ./output/${TASK_NAME}_eval \
    --transform_config ${TRANSFORM_CONFIG} \
    --overwrite_output_dir

Baseline comparison (no transform)#

python run_glue.py \
    --model_name_or_path ${MODEL_NAME} \
    --task_name ${TASK_NAME} \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size ${BATCH_SIZE} \
    --learning_rate ${LEARNING_RATE} \
    --num_train_epochs ${NUM_EPOCHS} \
    --output_dir ./output/${TASK_NAME}_baseline \
    --overwrite_output_dir \
    --evaluation_strategy epoch \
    --save_strategy epoch

Results#

Post-training transform (optical transform applied to a trained RoBERTa, no fine-tuning)#

Model

MNLI

QNLI

RTE

SST

MRPC

CoLA

QQP

STSB

Avg

Original

0.8728

0.9244

0.7978

0.9357

0.9019

0.6232

0.9153

0.9089

0.8600

Random

0.3266

0.4946

0.5271

0.4908

0.3162

0.0000

0.6318

0.0332

0.3525

Optical Transformer

0.8000

0.7966

0.4801

0.8704

0.7770

0.2034

0.9075

0.8485

0.7104

SqueezeLight

0.3200

0.4961

0.4404

0.5126

0.5025

0.0213

0.5890

−0.0543

0.3582

Transform-aware fine-tuning#

Model

MNLI

QNLI

RTE

SST

MRPC

CoLA

QQP

STSB

Avg

Original

0.8728

0.9244

0.7978

0.9357

0.9019

0.6232

0.9153

0.9089

0.8600

Random

0.3266

0.4946

0.5271

0.4908

0.3162

0.0000

0.6318

0.0332

0.3525

Optical Transformer

0.8510

0.9032

0.5813

0.9140

0.8677

0.4441

0.9060

0.0332

0.6876

SqueezeLight

0.3212

0.4961

0.4676

0.5131

0.5025

0.0000

0.5932

0.0514

0.3681

Takeaways

  • The Optical Transformer significantly outperforms SqueezeLight in both evaluation modes. SqueezeLight was designed for convolutional networks and does not transfer well to Transformers.

  • Transform-aware fine-tuning generally outperforms post-training transform, but the noisy forward pass with straight-through estimators can occasionally cause instability (e.g., STSB).

  • We carry the Optical Transformer forward to large-scale CLM experiments. See Scaling Optical Transformers to Causal Language Models for the follow-up.