Datasets
Note
The sentence_transformers.datasets
classes have been deprecated, and only exist for compatibility with the deprecated training.
Instead of
SentenceLabelDataset
, you can now useBatchSamplers.GROUP_BY_LABEL
to use theGroupByLabelBatchSampler
.Instead of
NoDuplicatesDataLoader
, you can now use theBatchSamplers.NO_DUPLICATES
to use theNoDuplicatesBatchSampler
.
sentence_transformers.datasets
contains classes to organize your training input examples.
ParallelSentencesDataset
ParallelSentencesDataset
is used for multilingual training. For details, see multilingual training.
- class sentence_transformers.datasets.ParallelSentencesDataset(student_model: SentenceTransformer, teacher_model: SentenceTransformer, batch_size: int = 8, use_embedding_cache: bool = True)[source]
This dataset reader can be used to read-in parallel sentences, i.e., it reads in a file with tab-seperated sentences with the same sentence in different languages. For example, the file can look like this (EN DE ES): hello world hallo welt hola mundo second sentence zweiter satz segunda oración
The sentence in the first column will be mapped to a sentence embedding using the given the embedder. For example, embedder is a mono-lingual sentence embedding method for English. The sentences in the other languages will also be mapped to this English sentence embedding.
When getting a sample from the dataset, we get one sentence with the according sentence embedding for this sentence.
teacher_model can be any class that implement an encode function. The encode function gets a list of sentences and returns a list of sentence embeddings
Parallel sentences dataset reader to train student model given a teacher model
- Parameters:
student_model (SentenceTransformer) – The student sentence embedding model that should be trained.
teacher_model (SentenceTransformer) – The teacher model that provides the sentence embeddings for the first column in the dataset file.
batch_size (int, optional) – The batch size for training. Defaults to 8.
use_embedding_cache (bool, optional) – Whether to use an embedding cache. Defaults to True.
SentenceLabelDataset
SentenceLabelDataset
can be used if you have labeled sentences and want to train with triplet loss.
- class sentence_transformers.datasets.SentenceLabelDataset(examples: list[InputExample], samples_per_label: int = 2, with_replacement: bool = False)[source]
This dataset can be used for some specific Triplet Losses like BATCH_HARD_TRIPLET_LOSS which requires multiple examples with the same label in a batch.
It draws n consecutive, random and unique samples from one label at a time. This is repeated for each label.
Labels with fewer than n unique samples are ignored. This also applied to drawing without replacement, once less than n samples remain for a label, it is skipped.
This DOES NOT check if there are more labels than the batch is large or if the batch size is divisible by the samples drawn per label.
Creates a LabelSampler for a SentenceLabelDataset.
- Parameters:
examples (List[InputExample]) – A list of InputExamples.
samples_per_label (int, optional) – The number of consecutive, random, and unique samples drawn per label. The batch size should be a multiple of samples_per_label. Defaults to 2.
with_replacement (bool, optional) – If True, each sample is drawn at most once (depending on the total number of samples per label). If False, one sample can be drawn in multiple draws, but not multiple times in the same drawing. Defaults to False.
DenoisingAutoEncoderDataset
DenoisingAutoEncoderDataset
is used for unsupervised training with the TSDAE method.
- class sentence_transformers.datasets.DenoisingAutoEncoderDataset(sentences: list[str], noise_fn=<function DenoisingAutoEncoderDataset.<lambda>>)[source]
The DenoisingAutoEncoderDataset returns InputExamples in the format: texts=[noise_fn(sentence), sentence] It is used in combination with the DenoisingAutoEncoderLoss: Here, a decoder tries to re-construct the sentence without noise.
- Parameters:
sentences – A list of sentences
noise_fn – A noise function: Given a string, it returns a string with noise, e.g. deleted words
NoDuplicatesDataLoader
NoDuplicatesDataLoader
can be used together with MultipleNegativeRankingLoss to ensure that no duplicates are within the same batch.