Losses

sentence_transformers.losses define different loss functions, that can be used to fine-tune the network on training data. The loss function plays a critical role when fine-tuning the model. It determines how well our embedding model will work for the specific downstream task.

Sadly there is no “one size fits all” loss function. Which loss function is suitable depends on the available training data and on the target task.

BatchAllTripletLoss

class sentence_transformers.losses.BatchAllTripletLoss(model: sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function BatchHardTripletLossDistanceFunction.eucledian_distance>, margin: float = 5)

BatchAllTripletLoss takes a batch with (label, sentence) pairs and computes the loss for all possible, valid triplets, i.e., anchor and positive must have the same label, anchor and negative a different label. The labels must be integers, with same label indicating sentences from the same class. You train dataset must contain at least 2 examples per label class.

Parameters
  • model – SentenceTransformer model

  • distance_metric – Function that returns a distance between two emeddings. The class SiameseDistanceMetric contains pre-defined metrices that can be used

  • margin – Negative samples should be at least margin further apart from the anchor than the positive.

Example:

from sentence_transformers import SentenceTransformer, SentencesDataset, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['Sentence from class 0'], label=0), InputExample(texts=['Another sentence from class 0'], label=0),
    InputExample(texts=['Sentence from class 1'], label=1), InputExample(texts=['Sentence from class 2'], label=2)]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.BatchAllTripletLoss(model=model)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

BatchHardSoftMarginTripletLoss

class sentence_transformers.losses.BatchHardSoftMarginTripletLoss(model: sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function BatchHardTripletLossDistanceFunction.eucledian_distance>)

BatchHardSoftMarginTripletLoss takes a batch with (label, sentence) pairs and computes the loss for all possible, valid triplets, i.e., anchor and positive must have the same label, anchor and negative a different label. The labels must be integers, with same label indicating sentences from the same class. You train dataset must contain at least 2 examples per label class. The margin is computed automatically.

Source: https://github.com/NegatioN/OnlineMiningTripletLoss/blob/master/online_triplet_loss/losses.py Paper: In Defense of the Triplet Loss for Person Re-Identification, https://arxiv.org/abs/1703.07737 Blog post: https://omoindrot.github.io/triplet-loss

Parameters
  • model – SentenceTransformer model

  • distance_metric – Function that returns a distance between two emeddings. The class SiameseDistanceMetric contains pre-defined metrices that can be used

Example:

from sentence_transformers import SentenceTransformer,  SentencesDataset, LoggingHandler, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['Sentence from class 0'], label=0), InputExample(texts=['Another sentence from class 0'], label=0),
    InputExample(texts=['Sentence from class 1'], label=1), InputExample(texts=['Sentence from class 2'], label=2)]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.BatchHardSoftMarginTripletLoss(model=model)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

BatchHardTripletLoss

class sentence_transformers.losses.BatchHardTripletLoss(model: sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function BatchHardTripletLossDistanceFunction.eucledian_distance>, margin: float = 5)

BatchHardTripletLoss takes a batch with (label, sentence) pairs and computes the loss for all possible, valid triplets, i.e., anchor and positive must have the same label, anchor and negative a different label. It then looks for the hardest positive and the hardest negatives. The labels must be integers, with same label indicating sentences from the same class. You train dataset must contain at least 2 examples per label class. The margin is computed automatically.

Source: https://github.com/NegatioN/OnlineMiningTripletLoss/blob/master/online_triplet_loss/losses.py Paper: In Defense of the Triplet Loss for Person Re-Identification, https://arxiv.org/abs/1703.07737 Blog post: https://omoindrot.github.io/triplet-loss

Parameters
  • model – SentenceTransformer model

  • distance_metric – Function that returns a distance between two emeddings. The class SiameseDistanceMetric contains pre-defined metrices that can be used

Example:

from sentence_transformers import SentenceTransformer, SentencesDataset, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['Sentence from class 0'], label=0), InputExample(texts=['Another sentence from class 0'], label=0),
    InputExample(texts=['Sentence from class 1'], label=1), InputExample(texts=['Sentence from class 2'], label=2)]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.BatchHardTripletLoss(model=model)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

BatchSemiHardTripletLoss

class sentence_transformers.losses.BatchSemiHardTripletLoss(model: sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function BatchHardTripletLossDistanceFunction.eucledian_distance>, margin: float = 5)

BatchSemiHardTripletLoss takes a batch with (label, sentence) pairs and computes the loss for all possible, valid triplets, i.e., anchor and positive must have the same label, anchor and negative a different label. It then looks for the semi hard positives and negatives. The labels must be integers, with same label indicating sentences from the same class. You train dataset must contain at least 2 examples per label class. The margin is computed automatically.

Source: https://github.com/NegatioN/OnlineMiningTripletLoss/blob/master/online_triplet_loss/losses.py Paper: In Defense of the Triplet Loss for Person Re-Identification, https://arxiv.org/abs/1703.07737 Blog post: https://omoindrot.github.io/triplet-loss

Parameters
  • model – SentenceTransformer model

  • distance_metric – Function that returns a distance between two emeddings. The class SiameseDistanceMetric contains pre-defined metrices that can be used

Example:

from sentence_transformers import SentenceTransformer, SentencesDataset, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['Sentence from class 0'], label=0), InputExample(texts=['Another sentence from class 0'], label=0),
    InputExample(texts=['Sentence from class 1'], label=1), InputExample(texts=['Sentence from class 2'], label=2)]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.BatchSemiHardTripletLoss(model=model)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

ContrastiveLoss

class sentence_transformers.losses.ContrastiveLoss(model: sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function SiameseDistanceMetric.<lambda>>, margin: float = 0.5, size_average: bool = True)

Contrastive loss. Expects as input two texts and a label of either 0 or 1. If the label == 1, then the distance between the two embeddings is reduced. If the label == 0, then the distance between the embeddings is increased.

Further information: http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf

Parameters
  • model – SentenceTransformer model

  • distance_metric – Function that returns a distance between two emeddings. The class SiameseDistanceMetric contains pre-defined metrices that can be used

  • margin – Negative samples (label == 0) should have a distance of at least the margin value.

  • size_average – Average by the size of the mini-batch.

Example:

from sentence_transformers import SentenceTransformer,  SentencesDataset, LoggingHandler, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['This is a positive pair', 'Where the distance will be minimized'], label=1),
    InputExample(texts=['This is a negative pair', 'Their distance will be increased'], label=0)]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.ContrastiveLoss(model=model)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

CosineSimilarityLoss

SBERT Siamese Network Architecture

For each sentence pair, we pass sentence A and sentence B through our network which yields the embeddings u und v. The similarity of these embeddings is computed using cosine similarity and the result is compared to the gold similarity score.

This allows our network to be fine-tuned to recognize the similarity of sentences.

class sentence_transformers.losses.CosineSimilarityLoss(model: sentence_transformers.SentenceTransformer.SentenceTransformer, loss_fct=MSELoss(), cos_score_transformation=Identity())

CosineSimilarityLoss expects, that the InputExamples consists of two texts and a float label.

It computes the vectors u = model(input_text[0]) and v = model(input_text[1]) and measures the cosine-similarity between the two. By default, it minimizes the following loss: ||input_label - cos_score_transformation(cosine_sim(u,v))||_2.

Parameters
  • model – SentenceTranformer model

  • loss_fct – Which pytorch loss function should be used to compare the cosine_similartiy(u,v) with the input_label? By default, MSE: ||input_label - cosine_sim(u,v)||_2

  • cos_score_transformation – The cos_score_transformation function is applied on top of cosine_similarity. By default, the identify function is used (i.e. no change).

Example:

from sentence_transformers import SentenceTransformer, SentencesDataset, InputExample, losses

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
    InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.CosineSimilarityLoss(model=model)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

MegaBatchMarginLoss

class sentence_transformers.losses.MegaBatchMarginLoss(model, positive_margin: float = 0.8, negative_margin: float = 0.3, use_mini_batched_version: bool = True, mini_batch_size: bool = 50)

Loss function inspired from ParaNMT paper: https://www.aclweb.org/anthology/P18-1042/

Given a large batch (like 500 or more examples) of (anchor_i, positive_i) pairs, find for each pair in the batch the hardest negative, i.e. find j != i such that cos_sim(anchor_i, positive_j) is maximal. Then create from this a triplet (anchor_i, positive_i, positive_j) where positive_j serves as the negative for this triplet.

Train than as with the triplet loss

Parameters
  • model – SentenceTransformerModel

  • positive_margin – Positive margin, cos(anchor, positive) should be > positive_margin

  • negative_margin – Negative margin, cos(anchor, negative) should be < negative_margin

  • use_mini_batched_version – As large batch sizes require a lot of memory, we can use a mini-batched version. We break down the large batch with 500 examples to smaller batches with fewer examples.

  • mini_batch_size – Size for the mini-batches. Should be a devisor for the batch size in your data loader.

MSELoss

class sentence_transformers.losses.MSELoss(model)

Computes the MSE loss between the computed sentence embedding and a target sentence embedding. This loss is used when extending sentence embeddings to new languages as described in our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation: https://arxiv.org/abs/2004.09813

For an example, see the documentation on extending language models to new languages.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

MultipleNegativesRankingLoss

MultipleNegativesRankingLoss is a great loss function if you only have positive pairs, for example, only pairs of similar texts like pairs of paraphrases, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language).

class sentence_transformers.losses.MultipleNegativesRankingLoss(model: sentence_transformers.SentenceTransformer.SentenceTransformer, scale: float = 20.0, similarity_fct=<function pytorch_cos_sim>)

This loss expects as input a batch consisting of sentence pairs (a_1, b_1), (a_2, b_2)…, (a_n, b_n) where we assume that (a_i, b_i) are a positive pair and (a_i, b_j) for i!=j a negative pair.

For each a_i, it uses all other b_j as negative samples, i.e., for a_i, we have 1 positive example (b_i) and n-1 negative examples (b_j). It then minimizes the negative log-likehood for softmax normalized scores.

This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g. (query, relevant_doc)) as it will sample in each batch n-1 negative docs randomly.

The performance usually increases with increasing batch sizes.

For more information, see: https://arxiv.org/pdf/1705.00652.pdf (Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4)

The error function is equivalent to:

scores = self.similarity_fct(embeddings_a, embeddings_b) * self.scale
labels = torch.tensor(range(len(scores)), dtype=torch.long).to(self.model.device) #Example a[i] should match with b[i]
cross_entropy_loss = nn.CrossEntropyLoss()
return cross_entropy_loss(scores, labels)

Example:

from sentence_transformers import SentenceTransformer,  SentencesDataset, LoggingHandler, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['Anchor 1', 'Positive 1']),
    InputExample(texts=['Anchor 2', 'Positive 2'])]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.MultipleNegativesRankingLoss(model=model)
Parameters
  • model – SentenceTransformer model

  • scale – Output of similarity function is multiplied by scale value

  • similarity_fct – similarity function between sentence embeddings. By default, cos_sim. Can also be set to dot product (and then set sclae to 1)

OnlineContrastiveLoss

class sentence_transformers.losses.OnlineContrastiveLoss(model: sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function SiameseDistanceMetric.<lambda>>, margin: float = 0.5)
Online Contrastive loss. Similar to ConstrativeLoss, but it selects hard positive (positives that are far apart)

and hard negative pairs (negatives that are close) and computes the loss only for these pairs. Often yields better performances than ConstrativeLoss.

Parameters
  • model – SentenceTransformer model

  • distance_metric – Function that returns a distance between two emeddings. The class SiameseDistanceMetric contains pre-defined metrices that can be used

  • margin – Negative samples (label == 0) should have a distance of at least the margin value.

  • size_average – Average by the size of the mini-batch.

Example:

from sentence_transformers import SentenceTransformer,  SentencesDataset, LoggingHandler, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['This is a positive pair', 'Where the distance will be minimized'], label=1),
    InputExample(texts=['This is a negative pair', 'Their distance will be increased'], label=0)]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.OnlineContrastiveLoss(model=model)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

SoftmaxLoss

class sentence_transformers.losses.SoftmaxLoss(model: sentence_transformers.SentenceTransformer.SentenceTransformer, sentence_embedding_dimension: int, num_labels: int, concatenation_sent_rep: bool = True, concatenation_sent_difference: bool = True, concatenation_sent_multiplication: bool = False)

This loss was used in our SBERT publication (https://arxiv.org/abs/1908.10084) to train the SentenceTransformer model on NLI data. It adds a softmax classifier on top of the output of two transformer networks.

Parameters
  • model – SentenceTransformer model

  • sentence_embedding_dimension – Dimension of your sentence embeddings

  • num_labels – Number of different labels

  • concatenation_sent_rep – Concatenate vectors u,v for the softmax classifier?

  • concatenation_sent_difference – Add abs(u-v) for the softmax classifier?

  • concatenation_sent_multiplication – Add u*v for the softmax classifier?

Example:

from sentence_transformers import SentenceTransformer, SentencesDataset, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(InputExample(texts=['First pair, sent A', 'First pair, sent B'], label=0),
    InputExample(texts=['Second Pair, sent A', 'Second Pair, sent B'], label=3)]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=train_num_labels)

Initializes internal Module state, shared by both nn.Module and ScriptModule.

TripletLoss

class sentence_transformers.losses.TripletLoss(model: sentence_transformers.SentenceTransformer.SentenceTransformer, distance_metric=<function TripletDistanceMetric.<lambda>>, triplet_margin: float = 5)

This class implements triplet loss. Given a triplet of (anchor, positive, negative), the loss minimizes the distance between anchor and positive while it maximizes the distance between anchor and negative. It compute the following loss function:

loss = max(||anchor - positive|| - ||anchor - negative|| + margin, 0).

Margin is an important hyperparameter and needs to be tuned respectively.

For further details, see: https://en.wikipedia.org/wiki/Triplet_loss

Parameters
  • model – SentenceTransformerModel

  • distance_metric – Function to compute distance between two embeddings. The class TripletDistanceMetric contains common distance metrices that can be used.

  • triplet_margin – The negative should be at least this much further away from the anchor than the positive.

Example:

from sentence_transformers import SentenceTransformer,  SentencesDataset, LoggingHandler, losses
from sentence_transformers.readers import InputExample

model = SentenceTransformer('distilbert-base-nli-mean-tokens')
train_examples = [InputExample(texts=['Anchor 1', 'Positive 1', 'Negative 1']),
    InputExample(texts=['Anchor 2', 'Positive 2', 'Negative 2'])]
train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
train_loss = losses.TripletLoss(model=model)

Initializes internal Module state, shared by both nn.Module and ScriptModule.