Datasets

Note

The sentence_transformers.datasets classes have been deprecated, and only exist for compatibility with the deprecated training.

Instead of SentenceLabelDataset, you can now use BatchSamplers.GROUP_BY_LABEL to use the GroupByLabelBatchSampler.
Instead of NoDuplicatesDataLoader, you can now use the BatchSamplers.NO_DUPLICATES to use the NoDuplicatesBatchSampler.

sentence_transformers.datasets contains classes to organize your training input examples.

ParallelSentencesDataset

ParallelSentencesDataset is used for multilingual training. For details, see multilingual training.

class sentence_transformers.datasets.ParallelSentencesDataset(student_model: SentenceTransformer, teacher_model: SentenceTransformer, batch_size: int = 8, use_embedding_cache: bool = True)[source]

This dataset reader can be used to read-in parallel sentences, i.e., it reads in a file with tab-seperated sentences with the same sentence in different languages. For example, the file can look like this (EN DE ES): hello world hallo welt hola mundo second sentence zweiter satz segunda oración

The sentence in the first column will be mapped to a sentence embedding using the given the embedder. For example, embedder is a mono-lingual sentence embedding method for English. The sentences in the other languages will also be mapped to this English sentence embedding.

When getting a sample from the dataset, we get one sentence with the according sentence embedding for this sentence.

teacher_model can be any class that implement an encode function. The encode function gets a list of sentences and returns a list of sentence embeddings

Parallel sentences dataset reader to train student model given a teacher model

Parameters:

student_model (SentenceTransformer) – The student sentence embedding model that should be trained.
teacher_model (SentenceTransformer) – The teacher model that provides the sentence embeddings for the first column in the dataset file.
batch_size (int, optional) – The batch size for training. Defaults to 8.
use_embedding_cache (bool, optional) – Whether to use an embedding cache. Defaults to True.

SentenceLabelDataset

SentenceLabelDataset can be used if you have labeled sentences and want to train with triplet loss.

class sentence_transformers.datasets.SentenceLabelDataset(examples: list[InputExample], samples_per_label: int = 2, with_replacement: bool = False)[source]

This dataset can be used for some specific Triplet Losses like BATCH_HARD_TRIPLET_LOSS which requires multiple examples with the same label in a batch.

It draws n consecutive, random and unique samples from one label at a time. This is repeated for each label.

Labels with fewer than n unique samples are ignored. This also applied to drawing without replacement, once less than n samples remain for a label, it is skipped.

This DOES NOT check if there are more labels than the batch is large or if the batch size is divisible by the samples drawn per label.

Creates a LabelSampler for a SentenceLabelDataset.

Parameters:

examples (List[InputExample]) – A list of InputExamples.
samples_per_label (int, optional) – The number of consecutive, random, and unique samples drawn per label. The batch size should be a multiple of samples_per_label. Defaults to 2.
with_replacement (bool, optional) – If True, each sample is drawn at most once (depending on the total number of samples per label). If False, one sample can be drawn in multiple draws, but not multiple times in the same drawing. Defaults to False.

DenoisingAutoEncoderDataset

DenoisingAutoEncoderDataset is used for unsupervised training with the TSDAE method.

class sentence_transformers.datasets.DenoisingAutoEncoderDataset(sentences: list[str], noise_fn=<function DenoisingAutoEncoderDataset.<lambda>>)[source]

The DenoisingAutoEncoderDataset returns InputExamples in the format: texts=[noise_fn(sentence), sentence] It is used in combination with the DenoisingAutoEncoderLoss: Here, a decoder tries to re-construct the sentence without noise.

Parameters:

sentences – A list of sentences
noise_fn – A noise function: Given a string, it returns a string with noise, e.g. deleted words

NoDuplicatesDataLoader

NoDuplicatesDataLoadercan be used together with MultipleNegativeRankingLoss to ensure that no duplicates are within the same batch.

class sentence_transformers.datasets.NoDuplicatesDataLoader(train_examples, batch_size)[source]: A special data loader to be used with MultipleNegativesRankingLoss. The data loader ensures that there are no duplicate sentences within the same batch