Datasets

sentence_transformers.datasets contains classes to organize your training input examples.

SentencesDataset

SentencesDataset is the main class to store training classes for training. For details, see training overview.

class sentence_transformers.datasets.SentencesDataset(examples: List[sentence_transformers.readers.InputExample.InputExample], model: <module 'sentence_transformers.SentenceTransformer' from 'd:\\dropbox\\doktor\\githubrepositories\\sentence-transformers\\sentence_transformers\\SentenceTransformer.py'>)

Dataset for smart batching, that is each batch is only padded to its longest sequence instead of padding all sequences to the max length. The SentenceBertEncoder.smart_batching_collate is required for this to work. SmartBatchingDataset does not work without it.

Create a new SentencesDataset with the tokenized texts and the labels as Tensor

:param examples

A list of sentence.transformers.readers.InputExample

Parameters

model – SentenceTransformerModel

ParallelSentencesDataset

ParallelSentencesDataset is used for multilingual training. For details, see multilingual training.

class sentence_transformers.datasets.ParallelSentencesDataset(student_model: <module 'sentence_transformers.SentenceTransformer' from 'd:\\dropbox\\doktor\\githubrepositories\\sentence-transformers\\sentence_transformers\\SentenceTransformer.py'>, teacher_model: <module 'sentence_transformers.SentenceTransformer' from 'd:\\dropbox\\doktor\\githubrepositories\\sentence-transformers\\sentence_transformers\\SentenceTransformer.py'>, batch_size: int = 8, use_embedding_cache: bool = True)

This dataset reader can be used to read-in parallel sentences, i.e., it reads in a file with tab-seperated sentences with the same sentence in different languages. For example, the file can look like this (EN DE ES): hello world hallo welt hola mundo second sentence zweiter satz segunda oración

The sentence in the first column will be mapped to a sentence embedding using the given the embedder. For example, embedder is a mono-lingual sentence embedding method for English. The sentences in the other languages will also be mapped to this English sentence embedding.

When getting a sample from the dataset, we get one sentence with the according sentence embedding for this sentence.

teacher_model can be any class that implement an encode function. The encode function gets a list of sentences and returns a list of sentence embeddings

Parallel sentences dataset reader to train student model given a teacher model :param student_model: Student sentence embedding model that should be trained :param teacher_model: Teacher model, that provides the sentence embeddings for the first column in the dataset file

SentenceLabelDataset

SentenceLabelDataset can be used if you have labeled sentences and want to train with triplet loss.

class sentence_transformers.datasets.SentenceLabelDataset(examples: List[sentence_transformers.readers.InputExample.InputExample], model: <module 'sentence_transformers.SentenceTransformer' from 'd:\\dropbox\\doktor\\githubrepositories\\sentence-transformers\\sentence_transformers\\SentenceTransformer.py'>, provide_positive: bool = True, provide_negative: bool = True, parallel_tokenization: bool = True, max_processes: int = 4, chunk_size: int = 5000)

Dataset for training with triplet loss. This dataset takes a list of sentences grouped by their label and uses this grouping to dynamically select a positive example from the same group and a negative example from the other sentences for a selected anchor sentence.

This dataset should be used in combination with dataset_reader.LabelSentenceReader

One iteration over this dataset selects every sentence as anchor once.

This also uses smart batching like SentenceDataset.

Converts input examples to a SentenceLabelDataset usable to train the model with SentenceTransformer.smart_batching_collate as the collate_fn for the DataLoader

Assumes only one sentence per InputExample and labels as integers from 0 to max_num_labels and should be used in combination with dataset_reader.LabelSentenceReader.

Labels with only one example are ignored.

smart_batching_collate as collate_fn is required because it transforms the tokenized texts to the tensors.

Parameters

examples – the input examples for the training

:param model

the Sentence BERT model for the conversion

Parameters
  • provide_positive – set this to False, if you don’t need a positive example (e.g. for BATCH_HARD_TRIPLET_LOSS).

  • provide_negative – set this to False, if you don’t need a negative example (e.g. for BATCH_HARD_TRIPLET_LOSS or MULTIPLE_NEGATIVES_RANKING_LOSS).

:param parallel_tokenization

If true, multiple processes will be started for the tokenization

:param max_processes

Maximum number of processes started for tokenization. Cannot be larger can cpu_count()

:param chunk_size

#chunk_size number of examples are send to each process. Larger values increase overall tokenization speed