Losses
sentence_transformers.cross_encoder.losses
defines different loss functions that can be used to fine-tune cross-encoder models on training data. The choice of loss function plays a critical role when fine-tuning the model. It determines how well our model will work for the specific downstream task.
Sadly, there is no “one size fits all” loss function. Which loss function is suitable depends on the available training data and on the target task. Consider checking out the Loss Overview to help narrow down your choice of loss function(s).
BinaryCrossEntropyLoss
- class sentence_transformers.cross_encoder.losses.BinaryCrossEntropyLoss(model: CrossEncoder, activation_fn: Module = Identity(), pos_weight: Tensor | None = None, **kwargs)[source]
Computes the Binary Cross Entropy Loss for a CrossEncoder model. This loss is used to train a model to predict a high logit for positive pairs and a low logit for negative pairs. The model should be initialized with
num_labels = 1
(a.k.a. the default) to predict one class.It has been used to train many of the strong CrossEncoder MS MARCO Reranker models.
- Parameters:
model (
CrossEncoder
) – A CrossEncoder model to be trained.activation_fn (
Module
) – Activation function applied to the logits before computing the loss. Defaults toIdentity
.pos_weight (Tensor, optional) – A weight of positive examples. Must be a
torch.Tensor
liketorch.tensor(4)
for a weight of 4. Defaults to None.**kwargs – Additional keyword arguments passed to the underlying
torch.nn.BCEWithLogitsLoss
.
References
Cross Encoder > Training Examples > Semantic Textual Similarity
Cross Encoder > Training Examples > Quora Duplicate Questions
- Requirements:
Your model must be initialized with num_labels = 1 (a.k.a. the default) to predict one class.
- Inputs:
Texts
Labels
Number of Model Output Labels
(anchor, positive/negative) pairs
1 if positive, 0 if negative
1
(sentence_A, sentence_B) pairs
float similarity score between 0 and 1
1
- Recommendations:
Use
mine_hard_negatives
withoutput_format="labeled-pair"
to convert question-answer pairs to the(anchor, positive/negative) pairs
format with labels as 1 or 0, using hard negatives.
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset model = CrossEncoder("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "query": ["What are pandas?", "What are pandas?"], "response": ["Pandas are a kind of bear.", "Pandas are a kind of fish."], "label": [1, 0], }) loss = losses.BinaryCrossEntropyLoss(model) trainer = CrossEncoderTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()
CrossEntropyLoss
- class sentence_transformers.cross_encoder.losses.CrossEntropyLoss(model: CrossEncoder, activation_fn: Module = Identity(), **kwargs)[source]
Computes the Cross Entropy Loss for a CrossEncoder model. This loss is used to train a model to predict the correct class label for a given pair of sentences. The number of classes should be equal to the number of model output labels.
- Parameters:
model (
CrossEncoder
) – A CrossEncoder model to be trained.activation_fn (
Module
) – Activation function applied to the logits before computing the loss. Defaults toIdentity
.**kwargs – Additional keyword arguments passed to the underlying
torch.nn.CrossEntropyLoss
.
References
- Requirements:
Your model can be initialized with num_labels > 1 to predict multiple classes.
The number of dataset classes should be equal to the number of model output labels (model.num_labels).
- Inputs:
Texts
Labels
Number of Model Output Labels
(sentence_A, sentence_B) pairs
class
num_classes
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset model = CrossEncoder("microsoft/mpnet-base", num_labels=2) train_dataset = Dataset.from_dict({ "sentence1": ["How can I be a good geologist?", "What is the capital of France?"], "sentence2": ["What should I do to be a great geologist?", "What is the capital of Germany?"], "label": [1, 0], # 1: duplicate, 0: not duplicate }) loss = losses.CrossEntropyLoss(model) trainer = CrossEncoderTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()
LambdaLoss
- class sentence_transformers.cross_encoder.losses.LambdaLoss(model: ~sentence_transformers.cross_encoder.CrossEncoder.CrossEncoder, weighting_scheme: ~sentence_transformers.cross_encoder.losses.LambdaLoss.BaseWeightingScheme | None = NDCGLoss2PPScheme( (ndcg_loss2): NDCGLoss2Scheme() (lambda_rank): LambdaRankScheme() ), k: int | None = None, sigma: float = 1.0, eps: float = 1e-10, reduction_log: ~typing.Literal['natural', 'binary'] = 'binary', activation_fn: ~torch.nn.modules.module.Module | None = Identity(), mini_batch_size: int | None = None)[source]
The LambdaLoss Framework for Ranking Metric Optimization. This loss function implements the LambdaLoss framework for ranking metric optimization, which provides various weighting schemes including LambdaRank and NDCG variations. The implementation is optimized to handle padded documents efficiently by only processing valid documents during model inference.
Note
The number of documents per query can vary between samples with the
LambdaLoss
.- Parameters:
model (CrossEncoder) – CrossEncoder model to be trained
weighting_scheme (
BaseWeightingScheme
, optional) –Weighting scheme to use for the loss.
NoWeightingScheme
: No weighting scheme (weights = 1.0)NDCGLoss1Scheme
: NDCG Loss1 weighting schemeNDCGLoss2Scheme
: NDCG Loss2 weighting schemeLambdaRankScheme
: LambdaRank weighting schemeNDCGLoss2PPScheme
: NDCG Loss2++ weighting scheme
Defaults to NDCGLoss2PPScheme. In the original LambdaLoss paper, the NDCGLoss2PPScheme was shown to reach the strongest performance, with the NDCGLoss2Scheme following closely.
k (int, optional) – Number of documents to consider for NDCG@K. Defaults to None (use all documents).
sigma (float) – Score difference weight used in sigmoid
eps (float) – Small constant for numerical stability
reduction_log (str) – Type of logarithm to use - “natural”: Natural logarithm (log) - “binary”: Binary logarithm (log2)
activation_fn (
Module
) – Activation function applied to the logits before computing the loss. Defaults toIdentity
.mini_batch_size (int, optional) –
Number of samples to process in each forward pass. This has a significant impact on the memory consumption and speed of the training process. Three cases are possible:
If
mini_batch_size
is None, themini_batch_size
is set to the batch size.If
mini_batch_size
is greater than 0, the batch is split into mini-batches of sizemini_batch_size
.If
mini_batch_size
is <= 0, the entire batch is processed at once.
Defaults to None.
References
The LambdaLoss Framework for Ranking Metric Optimization: https://marc.najork.org/papers/cikm2018.pdf
Context-Aware Learning to Rank with Self-Attention: https://arxiv.org/abs/2005.10084
- Requirements:
Query with multiple documents (listwise approach)
Documents must have relevance scores/labels. Both binary and continuous labels are supported.
- Inputs:
Texts
Labels
Number of Model Output Labels
(query, [doc1, doc2, …, docN])
[score1, score2, …, scoreN]
1
- Recommendations:
Use
mine_hard_negatives
withoutput_format="labeled-list"
to convert question-answer pairs to the required input format with hard negatives.
- Relations:
LambdaLoss
anecdotally performs better than the other losses with the same input format.
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset model = CrossEncoder("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "query": ["What are pandas?", "What is the capital of France?"], "docs": [ ["Pandas are a kind of bear.", "Pandas are kind of like fish."], ["The capital of France is Paris.", "Paris is the capital of France.", "Paris is quite large."], ], "labels": [[1, 0], [1, 1, 0]], }) loss = losses.LambdaLoss(model) trainer = CrossEncoderTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()
- class sentence_transformers.cross_encoder.losses.LambdaLoss.BaseWeightingScheme(*args, **kwargs)[source]
Base class for implementing weighting schemes in LambdaLoss.
- class sentence_transformers.cross_encoder.losses.NoWeightingScheme(*args, **kwargs)[source]
Implementation of no weighting scheme (weights = 1.0).
- class sentence_transformers.cross_encoder.losses.NDCGLoss1Scheme(*args, **kwargs)[source]
Implementation of NDCG Loss1 weighting scheme.
It is used to optimize for the NDCG metric, but this weighting scheme is not recommended as the NDCGLoss2Scheme and NDCGLoss2PPScheme were shown to reach superior performance in the original LambdaLoss paper.
- class sentence_transformers.cross_encoder.losses.NDCGLoss2Scheme(*args, **kwargs)[source]
Implementation of NDCG Loss2 weighting scheme.
This scheme uses a tighter bound than NDCGLoss1Scheme and was shown to reach superior performance in the original LambdaLoss paper. It is used to optimize for the NDCG metric.
- class sentence_transformers.cross_encoder.losses.LambdaRankScheme(*args, **kwargs)[source]
Implementation of LambdaRank weighting scheme.
This weighting optimizes a coarse upper bound of NDCG.
- class sentence_transformers.cross_encoder.losses.NDCGLoss2PPScheme(mu: float = 10.0)[source]
Implementation of NDCG Loss2++ weighting scheme.
It is a hybrid weighting scheme that combines the NDCGLoss2 and LambdaRank schemes. It was shown to reach the strongest performance in the original LambdaLoss paper.
ListMLELoss
- class sentence_transformers.cross_encoder.losses.ListMLELoss(model: CrossEncoder, activation_fn: Module | None = Identity(), mini_batch_size: int | None = None, respect_input_order: bool = True)[source]
This loss function implements the ListMLE learning to rank algorithm, which uses a list-wise approach based on maximum likelihood estimation of permutations. It maximizes the likelihood of the permutation induced by the ground truth labels.
Note
The number of documents per query can vary between samples with the
ListMLELoss
.- Parameters:
model (CrossEncoder) – CrossEncoder model to be trained
activation_fn (
Module
) – Activation function applied to the logits before computing the loss. Defaults toIdentity
.mini_batch_size (int, optional) –
Number of samples to process in each forward pass. This has a significant impact on the memory consumption and speed of the training process. Three cases are possible:
If
mini_batch_size
is None, themini_batch_size
is set to the batch size.If
mini_batch_size
is greater than 0, the batch is split into mini-batches of sizemini_batch_size
.If
mini_batch_size
is <= 0, the entire batch is processed at once.
Defaults to None.
respect_input_order (bool) – Whether to respect the original input order of documents. If True, assumes the input documents are already ordered by relevance (most relevant first). If False, sorts documents by label values. Defaults to True.
References
Listwise approach to learning to rank: theory and algorithm: https://dl.acm.org/doi/abs/10.1145/1390156.1390306
- Requirements:
Query with multiple documents (listwise approach)
Documents must have relevance scores/labels. Both binary and continuous labels are supported.
Documents must be sorted in a defined rank order.
- Inputs:
Texts
Labels
Number of Model Output Labels
(query, [doc1, doc2, …, docN])
[score1, score2, …, scoreN]
1
- Recommendations:
Use
mine_hard_negatives
withoutput_format="labeled-list"
to convert question-answer pairs to the required input format with hard negatives.
- Relations:
The
PListMLELoss
is an extension of theListMLELoss
and allows for positional weighting of the loss.PListMLELoss
generally outperformsListMLELoss
and is recommended over it.LambdaLoss
takes the same inputs, and generally outperforms this loss.
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset model = CrossEncoder("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "query": ["What are pandas?", "What is the capital of France?"], "docs": [ ["Pandas are a kind of bear.", "Pandas are kind of like fish."], ["The capital of France is Paris.", "Paris is the capital of France.", "Paris is quite large."], ], "labels": [[1, 0], [1, 1, 0]], }) # Standard ListMLE loss respecting input order loss = losses.ListMLELoss(model) trainer = CrossEncoderTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()
PListMLELoss
- class sentence_transformers.cross_encoder.losses.PListMLELoss(model: CrossEncoder, lambda_weight: PListMLELambdaWeight | None = PListMLELambdaWeight(), activation_fn: Module | None = Identity(), mini_batch_size: int | None = None, respect_input_order: bool = True)[source]
PListMLE loss for learning to rank with position-aware weighting. This loss function implements the ListMLE ranking algorithm which uses a list-wise approach based on maximum likelihood estimation of permutations. It maximizes the likelihood of the permutation induced by the ground truth labels with position-aware weighting.
This loss is also known as Position-Aware ListMLE or p-ListMLE.
Note
The number of documents per query can vary between samples with the
PListMLELoss
.- Parameters:
model (CrossEncoder) – CrossEncoder model to be trained
lambda_weight (PListMLELambdaWeight, optional) – Weighting scheme to use. When specified, implements Position-Aware ListMLE which applies different weights to different rank positions. Default is None (standard PListMLE).
activation_fn (
Module
) – Activation function applied to the logits before computing the loss. Defaults toIdentity
.mini_batch_size (int, optional) –
Number of samples to process in each forward pass. This has a significant impact on the memory consumption and speed of the training process. Three cases are possible:
If
mini_batch_size
is None, themini_batch_size
is set to the batch size.If
mini_batch_size
is greater than 0, the batch is split into mini-batches of sizemini_batch_size
.If
mini_batch_size
is <= 0, the entire batch is processed at once.
Defaults to None.
respect_input_order (bool) – Whether to respect the original input order of documents. If True, assumes the input documents are already ordered by relevance (most relevant first). If False, sorts documents by label values. Defaults to True.
References
Position-Aware ListMLE: A Sequential Learning Process for Ranking: https://auai.org/uai2014/proceedings/individuals/164.pdf
- Requirements:
Query with multiple documents (listwise approach)
Documents must have relevance scores/labels. Both binary and continuous labels are supported.
Documents must be sorted in a defined rank order.
- Inputs:
Texts
Labels
Number of Model Output Labels
(query, [doc1, doc2, …, docN])
[score1, score2, …, scoreN]
1
- Recommendations:
Use
mine_hard_negatives
withoutput_format="labeled-list"
to convert question-answer pairs to the required input format with hard negatives.
- Relations:
The
PListMLELoss
is an extension of theListMLELoss
and allows for positional weighting of the loss.PListMLELoss
generally outperformsListMLELoss
and is recommended over it.LambdaLoss
takes the same inputs, and generally outperforms this loss.
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset model = CrossEncoder("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "query": ["What are pandas?", "What is the capital of France?"], "docs": [ ["Pandas are a kind of bear.", "Pandas are kind of like fish."], ["The capital of France is Paris.", "Paris is the capital of France.", "Paris is quite large."], ], "labels": [[1, 0], [1, 1, 0]], }) # Either: Position-Aware ListMLE with default weighting lambda_weight = losses.PListMLELambdaWeight() loss = losses.PListMLELoss(model, lambda_weight=lambda_weight) # or: Position-Aware ListMLE with custom weighting function def custom_discount(ranks): # e.g. ranks: [1, 2, 3, 4, 5] return 1.0 / torch.log1p(ranks) lambda_weight = losses.PListMLELambdaWeight(rank_discount_fn=custom_discount) loss = losses.PListMLELoss(model, lambda_weight=lambda_weight) trainer = CrossEncoderTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()
- class sentence_transformers.cross_encoder.losses.PListMLELambdaWeight(rank_discount_fn=None)[source]
Base class for implementing weighting schemes in Position-Aware ListMLE Loss.
Initialize a lambda weight for PListMLE loss.
- Parameters:
rank_discount_fn – Function that computes a discount for each rank position. If None, uses default discount of 2^(num_docs - rank) - 1.
ListNetLoss
- class sentence_transformers.cross_encoder.losses.ListNetLoss(model: CrossEncoder, activation_fn: Module | None = Identity(), mini_batch_size: int | None = None)[source]
ListNet loss for learning to rank. This loss function implements the ListNet ranking algorithm which uses a list-wise approach to learn ranking models. It minimizes the cross entropy between the predicted ranking distribution and the ground truth ranking distribution. The implementation is optimized to handle padded documents efficiently by only processing valid documents during model inference.
Note
The number of documents per query can vary between samples with the
ListNetLoss
.- Parameters:
model (CrossEncoder) – CrossEncoder model to be trained
activation_fn (
Module
) – Activation function applied to the logits before computing the loss. Defaults toIdentity
.mini_batch_size (int, optional) –
Number of samples to process in each forward pass. This has a significant impact on the memory consumption and speed of the training process. Three cases are possible:
If
mini_batch_size
is None, themini_batch_size
is set to the batch size.If
mini_batch_size
is greater than 0, the batch is split into mini-batches of sizemini_batch_size
.If
mini_batch_size
is <= 0, the entire batch is processed at once.
Defaults to None.
References
Learning to Rank: From Pairwise Approach to Listwise Approach: https://www.microsoft.com/en-us/research/publication/learning-to-rank-from-pairwise-approach-to-listwise-approach/
Context-Aware Learning to Rank with Self-Attention: https://arxiv.org/abs/2005.10084
- Requirements:
Query with multiple documents (listwise approach)
Documents must have relevance scores/labels. Both binary and continuous labels are supported.
- Inputs:
Texts
Labels
Number of Model Output Labels
(query, [doc1, doc2, …, docN])
[score1, score2, …, scoreN]
1
- Recommendations:
Use
mine_hard_negatives
withoutput_format="labeled-list"
to convert question-answer pairs to the required input format with hard negatives.
- Relations:
LambdaLoss
takes the same inputs, and generally outperforms this loss.
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset model = CrossEncoder("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "query": ["What are pandas?", "What is the capital of France?"], "docs": [ ["Pandas are a kind of bear.", "Pandas are kind of like fish."], ["The capital of France is Paris.", "Paris is the capital of France.", "Paris is quite large."], ], "labels": [[1, 0], [1, 1, 0]], }) loss = losses.ListNetLoss(model) trainer = CrossEncoderTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()
MultipleNegativesRankingLoss
- class sentence_transformers.cross_encoder.losses.MultipleNegativesRankingLoss(model: CrossEncoder, num_negatives: int | None = 4, scale: int = 10.0, activation_fn: Module | None = Sigmoid())[source]
Given a list of (anchor, positive) pairs or (anchor, positive, negative) triplets, this loss optimizes the following:
Given an anchor (e.g. a question), assign the highest similarity to the corresponding positive (i.e. answer) out of every single positive and negative (e.g. all answers) in the batch.
If you provide the optional negatives, they will all be used as extra options from which the model must pick the correct positive. Within reason, the harder this “picking” is, the stronger the model will become. Because of this, a higher batch size results in more in-batch negatives, which then increases performance (to a point).
This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g. (query, answer)) as it will sample in each batch
n-1
negative docs randomly.This loss is also known as InfoNCE loss, SimCSE loss, Cross-Entropy Loss with in-batch negatives, or simply in-batch negatives loss.
- Parameters:
model (
CrossEncoder
) – A CrossEncoder model to be trained.num_negatives (int, optional) – Number of in-batch negatives to sample for each anchor. Defaults to 4.
scale (int, optional) – Output of similarity function is multiplied by scale value. Defaults to 10.0.
activation_fn (
Module
) – Activation function applied to the logits before computing the loss. Defaults toSigmoid
.
Note
The current default values are subject to change in the future. Experimentation is encouraged.
References
Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf
- Requirements:
Your model must be initialized with num_labels = 1 (a.k.a. the default) to predict one class.
- Inputs:
Texts
Labels
Number of Model Output Labels
(anchor, positive) pairs
none
1
(anchor, positive, negative) triplets
none
1
(anchor, positive, negative_1, …, negative_n)
none
1
- Recommendations:
Use
BatchSamplers.NO_DUPLICATES
(docs
) to ensure that no in-batch negatives are duplicates of the anchor or positive samples.Use
mine_hard_negatives
withoutput_format="n-tuple"
oroutput_format="triplet"
to convert question-answer pairs to triplets with hard negatives.
- Relations:
CachedMultipleNegativesRankingLoss
is equivalent to this loss, but it uses caching that allows for much higher batch sizes (and thus better performance) without extra memory usage. However, it is slightly slower.
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset model = CrossEncoder("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "query": ["What are pandas?", "What is the capital of France?"], "answer": ["Pandas are a kind of bear.", "The capital of France is Paris."], }) loss = losses.MultipleNegativesRankingLoss(model) trainer = CrossEncoderTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()
CachedMultipleNegativesRankingLoss
- class sentence_transformers.cross_encoder.losses.CachedMultipleNegativesRankingLoss(model: CrossEncoder, num_negatives: int | None = 4, scale: float = 10.0, activation_fn: Module | None = Sigmoid(), mini_batch_size: int = 32, show_progress_bar: bool = False)[source]
Boosted version of
MultipleNegativesRankingLoss
that caches the gradients of the logits wrt. the loss. This allows for much higher batch sizes without extra memory usage. However, it is slightly slower.In detail:
It first does a quick prediction step without gradients/computation graphs to get all the logits;
Calculate the loss, backward up to the logits and cache the gradients wrt. to the logits;
A 2nd prediction step with gradients/computation graphs and connect the cached gradients into the backward chain.
Notes: All steps are done with mini-batches. In the original implementation of GradCache, (2) is not done in mini-batches and requires a lot memory when the batch size is large. The gradient caching will sacrifice around 20% computation time according to the paper.
Given a list of (anchor, positive) pairs or (anchor, positive, negative) triplets, this loss optimizes the following:
Given an anchor (e.g. a question), assign the highest similarity to the corresponding positive (i.e. answer) out of every single positive and negative (e.g. all answers) in the batch.
If you provide the optional negatives, they will all be used as extra options from which the model must pick the correct positive. Within reason, the harder this “picking” is, the stronger the model will become. Because of this, a higher batch size results in more in-batch negatives, which then increases performance (to a point).
This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g. (query, answer)) as it will sample in each batch
n-1
negative docs randomly.This loss is also known as InfoNCE loss with GradCache.
- Parameters:
model (
CrossEncoder
) – A CrossEncoder model to be trained.num_negatives (int, optional) – Number of in-batch negatives to sample for each anchor. Defaults to 4.
scale (int, optional) – Output of similarity function is multiplied by scale value. Defaults to 10.0.
activation_fn (
Module
) – Activation function applied to the logits before computing the loss. Defaults toSigmoid
.mini_batch_size (int, optional) – Mini-batch size for the forward pass. This informs the memory usage. Defaults to 32.
show_progress_bar (bool, optional) – Whether to show a progress bar during the forward pass. Defaults to False.
Note
The current default values are subject to change in the future. Experimentation is encouraged.
References
Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4: https://arxiv.org/pdf/1705.00652.pdf
Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup: https://arxiv.org/pdf/2101.06983.pdf
- Requirements:
Your model must be initialized with num_labels = 1 (a.k.a. the default) to predict one class.
Should be used with large per_device_train_batch_size and low mini_batch_size for superior performance, but slower training time than
MultipleNegativesRankingLoss
.
- Inputs:
Texts
Labels
Number of Model Output Labels
(anchor, positive) pairs
none
1
(anchor, positive, negative) triplets
none
1
(anchor, positive, negative_1, …, negative_n)
none
1
- Recommendations:
Use
BatchSamplers.NO_DUPLICATES
(docs
) to ensure that no in-batch negatives are duplicates of the anchor or positive samples.Use
mine_hard_negatives
withoutput_format="n-tuple"
oroutput_format="triplet"
to convert question-answer pairs to triplets with hard negatives.
- Relations:
Equivalent to
MultipleNegativesRankingLoss
, but with caching that allows for much higher batch sizes (and thus better performance) without extra memory usage. This loss also trains slower thanMultipleNegativesRankingLoss
.
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset model = CrossEncoder("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "query": ["What are pandas?", "What is the capital of France?"], "answer": ["Pandas are a kind of bear.", "The capital of France is Paris."], }) loss = losses.CachedMultipleNegativesRankingLoss(model, mini_batch_size=32) trainer = CrossEncoderTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()
MSELoss
- class sentence_transformers.cross_encoder.losses.MSELoss(model: CrossEncoder, activation_fn: Module = Identity(), **kwargs)[source]
Computes the MSE loss between the computed query-passage score and a target query-passage score. This loss is used to distill a cross-encoder model from a teacher cross-encoder model or gold labels.
- Parameters:
model (
CrossEncoder
) – A CrossEncoder model to be trained.activation_fn (
Module
) – Activation function applied to the logits before computing the loss.**kwargs – Additional keyword arguments passed to the underlying
torch.nn.MSELoss
.
Note
Be mindful of the magnitude of both the labels and what the model produces. If the teacher model produces logits with Sigmoid to bound them to [0, 1], then you may wish to use a Sigmoid activation function in the loss.
References
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation: https://arxiv.org/abs/2010.02666
- Requirements:
Your model must be initialized with num_labels = 1 (a.k.a. the default) to predict one class.
Usually uses a finetuned CrossEncoder teacher M in a knowledge distillation setup.
- Inputs:
Texts
Labels
Number of Model Output Labels
(sentence_A, sentence_B) pairs
similarity score
1
- Relations:
MarginMSELoss
is similar to this loss, but with a margin through a negative pair.
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset student_model = CrossEncoder("microsoft/mpnet-base") teacher_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L12-v2") train_dataset = Dataset.from_dict({ "query": ["What are pandas?", "What is the capital of France?"], "answer": ["Pandas are a kind of bear.", "The capital of France is Paris."], }) def compute_labels(batch): return { "label": teacher_model.predict(list(zip(batch["query"], batch["answer"]))) } train_dataset = train_dataset.map(compute_labels, batched=True) loss = losses.MSELoss(student_model) trainer = CrossEncoderTrainer( model=student_model, train_dataset=train_dataset, loss=loss, ) trainer.train()
MarginMSELoss
- class sentence_transformers.cross_encoder.losses.MarginMSELoss(model: CrossEncoder, activation_fn: Module = Identity(), **kwargs)[source]
Computes the MSE loss between
|sim(Query, Pos) - sim(Query, Neg)|
and|gold_sim(Query, Pos) - gold_sim(Query, Neg)|
. This loss is often used to distill a cross-encoder model from a teacher cross-encoder model or gold labels.In contrast to
MultipleNegativesRankingLoss
, the two passages do not have to be strictly positive and negative, both can be relevant or not relevant for a given query. This can be an advantage of MarginMSELoss over MultipleNegativesRankingLoss.Note
Be mindful of the magnitude of both the labels and what the model produces. If the teacher model produces logits with Sigmoid to bound them to [0, 1], then you may wish to use a Sigmoid activation function in the loss.
- Parameters:
model (
CrossEncoder
) – A CrossEncoder model to be trained.activation_fn (
Module
) – Activation function applied to the logits before computing the loss.**kwargs – Additional keyword arguments passed to the underlying
torch.nn.MSELoss
.
References
Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation: https://arxiv.org/abs/2010.02666
- Requirements:
Your model must be initialized with num_labels = 1 (a.k.a. the default) to predict one class.
Usually uses a finetuned CrossEncoder teacher M in a knowledge distillation setup.
- Inputs:
Texts
Labels
Number of Model Output Labels
(query, passage_one, passage_two) triplets
gold_sim(query, passage_one) - gold_sim(query, passage_two)
1
- Relations:
MSELoss
is similar to this loss, but without a margin through the negative pair.
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset student_model = CrossEncoder("microsoft/mpnet-base") teacher_model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L12-v2") train_dataset = Dataset.from_dict({ "query": ["What are pandas?", "What is the capital of France?"], "positive": ["Pandas are a kind of bear.", "The capital of France is Paris."], "negative": ["Pandas are a kind of fish.", "The capital of France is Berlin."], }) def compute_labels(batch): positive_scores = teacher_model.predict(list(zip(batch["query"], batch["positive"]))) negative_scores = teacher_model.predict(list(zip(batch["query"], batch["negative"]))) return { "label": positive_scores - negative_scores } train_dataset = train_dataset.map(compute_labels, batched=True) loss = losses.MarginMSELoss(student_model) trainer = CrossEncoderTrainer( model=student_model, train_dataset=train_dataset, loss=loss, ) trainer.train()
RankNetLoss
- class sentence_transformers.cross_encoder.losses.RankNetLoss(model: CrossEncoder, k: int | None = None, sigma: float = 1.0, eps: float = 1e-10, reduction_log: Literal['natural', 'binary'] = 'binary', activation_fn: Module | None = Identity(), mini_batch_size: int | None = None)[source]
RankNet loss implementation for learning to rank. This loss function implements the RankNet algorithm, which learns a ranking function by optimizing pairwise document comparisons using a neural network. The implementation is optimized to handle padded documents efficiently by only processing valid documents during model inference.
- Parameters:
model (CrossEncoder) – CrossEncoder model to be trained
sigma (float) – Score difference weight used in sigmoid (default: 1.0)
eps (float) – Small constant for numerical stability (default: 1e-10)
activation_fn (
Module
) – Activation function applied to the logits before computing the loss. Defaults toIdentity
.mini_batch_size (int, optional) – Number of samples to process in each forward pass. This has a significant impact on the memory consumption and speed of the training process. Three cases are possible: - If
mini_batch_size
is None, themini_batch_size
is set to the batch size. - Ifmini_batch_size
is greater than 0, the batch is split into mini-batches of sizemini_batch_size
. - Ifmini_batch_size
is <= 0, the entire batch is processed at once. Defaults to None.
References
Learning to Rank using Gradient Descent: https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf
- Requirements:
Query with multiple documents (pairwise approach)
Documents must have relevance scores/labels. Both binary and continuous labels are supported.
- Inputs:
Texts
Labels
Number of Model Output Labels
(query, [doc1, doc2, …, docN])
[score1, score2, …, scoreN]
1
- Recommendations:
Use
mine_hard_negatives
withoutput_format="labeled-list"
to convert question-answer pairs to the required input format with hard negatives.
- Relations:
LambdaLoss
can be seen as an extension of this loss where each score pair is weighted. Alternatively, this loss can be seen as a special case of theLambdaLoss
without a weighting scheme.LambdaLoss
with its default NDCGLoss2++ weighting scheme anecdotally performs better than the other losses with the same input format.
Example
from sentence_transformers.cross_encoder import CrossEncoder, CrossEncoderTrainer, losses from datasets import Dataset model = CrossEncoder("microsoft/mpnet-base") train_dataset = Dataset.from_dict({ "query": ["What are pandas?", "What is the capital of France?"], "docs": [ ["Pandas are a kind of bear.", "Pandas are kind of like fish."], ["The capital of France is Paris.", "Paris is the capital of France.", "Paris is quite large."], ], "labels": [[1, 0], [1, 1, 0]], }) loss = losses.RankNetLoss(model) trainer = CrossEncoderTrainer( model=model, train_dataset=train_dataset, loss=loss, ) trainer.train()