Multimodal Training
CrossEncoder models can be trained on multimodal data, enabling cross-modal reranking where the model scores pairs involving different modalities. Each element in a pair can be:
Text: strings.
Image: PIL images, file paths, URLs, or numpy/torch arrays.
Audio: file paths, numpy/torch arrays, dicts with
"array"and"sampling_rate"keys, ortorchcodec.AudioDecoderinstances.Video: file paths, numpy/torch arrays, dicts with
"array"and"video_metadata"keys, ortorchcodec.VideoDecoderinstances.Multimodal dicts: a dict mapping modality names to values, e.g.
{"text": ..., "image": ...}. The keys must be"text","image","audio", or"video".
Two architectural approaches are demonstrated here, both training on the doodles-captions-manual dataset with BinaryCrossEntropyLoss and multi-dataset training (image-to-text and text-to-image directions).
Transformer (Any-to-Any) + LogitScore
training_doodles_any_to_any.py:
This example builds a multimodal
CrossEncoderfromQwen/Qwen3.5-0.8Busing the module chainTransformer(transformer_task="any-to-any")+LogitScore.The
"any-to-any"task loads the full causal LM viaAutoModelForMultimodalLMwith its language model head, andadd_generation_prompt=Trueappends the assistant turn start token so the model generates from the right position.LogitScorethen takes the next-token logits and computes a relevance score as the log-odds of generating"1"(match) vs"0"(no match).The model is trained with
BinaryCrossEntropyLossusing multi-dataset training with two sub-datasets:image_to_text: given an image query, score text candidates
text_to_image: given a text query, score image candidates
Each sample is expanded with negatives at a 1:4 positive-to-negative ratio. Evaluation uses
CrossEncoderRerankingEvaluatoron both directions.
Transformer (Feature Extraction) + Pooling + Dense
training_doodles_feature_extraction.py:
This example builds a multimodal
CrossEncoderfromQwen/Qwen3.5-0.8Busing the module chainTransformer(transformer_task="feature-extraction")+Pooling(lasttoken) +Dense.The
"feature-extraction"task loads only the base model viaAutoModelwithout the LM head, making this approach more memory-efficient. ThePoolinglayer extracts the last token’s hidden state, and theDenselayer projects it to a single score.To approximate the
LogitScorebehavior at initialization, the Dense layer’s weight is initialized asembed("1") - embed("0")using the model’s input embeddings. Because most causal LMs tie input embeddings with the LM head weights, this gives a starting point equivalent to computing log-odds over the"1"and"0"tokens.The dataset, loss, and evaluation setup are identical to the LogitScore variant above.
Comparing the Two Approaches
| Any-to-Any + LogitScore | Feature Extraction + Pooling + Dense | |
|---|---|---|
| LM head | Loaded (full vocabulary logits) | Not loaded (hidden states only) |
| Memory usage | Higher | Lower |
| Score mechanism | Log-odds from generative output | Learned Dense projection |
| Initialization | Uses pretrained LM head directly | Approximates LM head via embedding init |
For large models where GPU memory is a concern, the feature extraction approach may be preferred. Both approaches produce comparable results.
Other Module Chains
These two approaches are not the only options. CrossEncoder supports several module chain patterns depending on the task:
Transformer (Sequence Classification): The traditional encoder-based approach (e.g. BERT, RoBERTa). A single
Transformermodule loads a model viaAutoModelForSequenceClassificationwith a pretrained classification head, which produces scores without any subsequent modules. This is the default for text-only reranking.Transformer (Text Generation) + LogitScore: Like the Any-to-Any variant above, but for text-only CausalLM rerankers loaded with
AutoModelForCausalLM. Usestransformer_task="text-generation"instead of"any-to-any".
See Creating Custom CrossEncoder Models for details on the modular architecture.
References
Transformertransformer_taskparameterLogitScoreAPI reference