Multimodal Training

CrossEncoder models can be trained on multimodal data, enabling cross-modal reranking where the model scores pairs involving different modalities. Each element in a pair can be:

  • Text: strings.

  • Image: PIL images, file paths, URLs, or numpy/torch arrays.

  • Audio: file paths, numpy/torch arrays, dicts with "array" and "sampling_rate" keys, or torchcodec.AudioDecoder instances.

  • Video: file paths, numpy/torch arrays, dicts with "array" and "video_metadata" keys, or torchcodec.VideoDecoder instances.

  • Multimodal dicts: a dict mapping modality names to values, e.g. {"text": ..., "image": ...}. The keys must be "text", "image", "audio", or "video".

Two architectural approaches are demonstrated here, both training on the doodles-captions-manual dataset with BinaryCrossEntropyLoss and multi-dataset training (image-to-text and text-to-image directions).

Transformer (Any-to-Any) + LogitScore

  • training_doodles_any_to_any.py:

    This example builds a multimodal CrossEncoder from Qwen/Qwen3.5-0.8B using the module chain Transformer(transformer_task="any-to-any") + LogitScore.

    The "any-to-any" task loads the full causal LM via AutoModelForMultimodalLM with its language model head, and add_generation_prompt=True appends the assistant turn start token so the model generates from the right position. LogitScore then takes the next-token logits and computes a relevance score as the log-odds of generating "1" (match) vs "0" (no match).

    The model is trained with BinaryCrossEntropyLoss using multi-dataset training with two sub-datasets:

    • image_to_text: given an image query, score text candidates

    • text_to_image: given a text query, score image candidates

    Each sample is expanded with negatives at a 1:4 positive-to-negative ratio. Evaluation uses CrossEncoderRerankingEvaluator on both directions.

Transformer (Feature Extraction) + Pooling + Dense

  • training_doodles_feature_extraction.py:

    This example builds a multimodal CrossEncoder from Qwen/Qwen3.5-0.8B using the module chain Transformer(transformer_task="feature-extraction") + Pooling (lasttoken) + Dense.

    The "feature-extraction" task loads only the base model via AutoModel without the LM head, making this approach more memory-efficient. The Pooling layer extracts the last token’s hidden state, and the Dense layer projects it to a single score.

    To approximate the LogitScore behavior at initialization, the Dense layer’s weight is initialized as embed("1") - embed("0") using the model’s input embeddings. Because most causal LMs tie input embeddings with the LM head weights, this gives a starting point equivalent to computing log-odds over the "1" and "0" tokens.

    The dataset, loss, and evaluation setup are identical to the LogitScore variant above.

Comparing the Two Approaches

Any-to-Any + LogitScore Feature Extraction + Pooling + Dense
LM head Loaded (full vocabulary logits) Not loaded (hidden states only)
Memory usage Higher Lower
Score mechanism Log-odds from generative output Learned Dense projection
Initialization Uses pretrained LM head directly Approximates LM head via embedding init

For large models where GPU memory is a concern, the feature extraction approach may be preferred. Both approaches produce comparable results.

Other Module Chains

These two approaches are not the only options. CrossEncoder supports several module chain patterns depending on the task:

  • Transformer (Sequence Classification): The traditional encoder-based approach (e.g. BERT, RoBERTa). A single Transformer module loads a model via AutoModelForSequenceClassification with a pretrained classification head, which produces scores without any subsequent modules. This is the default for text-only reranking.

  • Transformer (Text Generation) + LogitScore: Like the Any-to-Any variant above, but for text-only CausalLM rerankers loaded with AutoModelForCausalLM. Uses transformer_task="text-generation" instead of "any-to-any".

See Creating Custom CrossEncoder Models for details on the modular architecture.

References