Training Datasets

Most dataset configurations will take one of four forms:

  • Case 1: The example is a pair of sentences and a label indicating how similar they are. The label can be either an integer or a float. This case applies to datasets originally prepared for Natural Language Inference (NLI), since they contain pairs of sentences with a label indicating whether they infer each other or not. Case Example: SNLI.

  • Case 2: The example is a pair of positive (similar) sentences without a label. For example, pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences. Case Examples: Sentence Compression, COCO Captions, Flickr30k captions.

  • Case 3: The example is a sentence with an integer label indicating the class to which it belongs. This data format is easily converted by loss functions into three sentences (triplets) where the first is an “anchor”, the second a “positive” of the same class as the anchor, and the third a “negative” of a different class. Case Examples: TREC, Yahoo Answers Topics.

  • Case 4: The example is a triplet (anchor, positive, negative) without classes or labels for the sentences. Case Example: Quora Triplets

Note that Sentence Transformers models can be trained with human labeling (cases 1 and 3) or with labels automatically deduced from text formatting (cases 2 and 4).

You can get almost ready-to-train datasets from various sources. One of them is the Hugging Face Hub.

Datasets on the Hugging Face Hub

The Datasets library (pip install datasets) allows you to load datasets from the Hugging Face Hub with the load_dataset function:

from datasets import load_dataset

# Indicate the repo id from the Hub
dataset_id = "embedding-data/QQP_triplets"

dataset = load_dataset(dataset_id)

For more information on how to manipulate your dataset see » Datasets Documentation.

These are popular datasets used to train and fine-tune SentenceTransformers models.

Dataset
altlex pairs
sentence compression pairs
QQP triplets
PAQ pairs
SPECTER triplets
Amazon QA pairs
Simple Wiki pairs
Wiki Answers equivalent sentences
COCO Captions quintets
Flickr30k Captions quintets
MS Marco
GOOAQ
MS Marco
Yahoo Answers topics
Search QA
Stack Exchange
ELI5
MultiNLI
SNLI
S2ORC
Trivia QA
Code Search Net
Natural Questions