Datasets

sentence_transformers.datasets contains classes to organize your training input examples.

ParallelSentencesDataset

ParallelSentencesDataset is used for multilingual training. For details, see multilingual training.

class sentence_transformers.datasets.ParallelSentencesDataset(*args, **kwds)

This dataset reader can be used to read-in parallel sentences, i.e., it reads in a file with tab-seperated sentences with the same sentence in different languages. For example, the file can look like this (EN DE ES): hello world hallo welt hola mundo second sentence zweiter satz segunda oración

The sentence in the first column will be mapped to a sentence embedding using the given the embedder. For example, embedder is a mono-lingual sentence embedding method for English. The sentences in the other languages will also be mapped to this English sentence embedding.

When getting a sample from the dataset, we get one sentence with the according sentence embedding for this sentence.

teacher_model can be any class that implement an encode function. The encode function gets a list of sentences and returns a list of sentence embeddings

Parallel sentences dataset reader to train student model given a teacher model :param student_model: Student sentence embedding model that should be trained :param teacher_model: Teacher model, that provides the sentence embeddings for the first column in the dataset file

SentenceLabelDataset

SentenceLabelDataset can be used if you have labeled sentences and want to train with triplet loss.

class sentence_transformers.datasets.SentenceLabelDataset(*args, **kwds)

This dataset can be used for some specific Triplet Losses like BATCH_HARD_TRIPLET_LOSS which requires multiple examples with the same label in a batch.

It draws n consecutive, random and unique samples from one label at a time. This is repeated for each label.

Labels with fewer than n unique samples are ignored. This also applied to drawing without replacement, once less than n samples remain for a label, it is skipped.

This DOES NOT check if there are more labels than the batch is large or if the batch size is divisible by the samples drawn per label.

Creates a LabelSampler for a SentenceLabelDataset.

Parameters
  • examples – a list with InputExamples

  • samples_per_label – the number of consecutive, random and unique samples drawn per label. Batch size should be a multiple of samples_per_label

  • with_replacement – if this is True, then each sample is drawn at most once (depending on the total number of samples per label). if this is False, then one sample can be drawn in multiple draws, but still not multiple times in the same drawing.