Training Datasets¶

Most dataset configurations will take one of four forms:

Case 1: The example is a pair of sentences and a label indicating how similar they are. The label can be either an integer or a float. This case applies to datasets originally prepared for Natural Language Inference (NLI), since they contain pairs of sentences with a label indicating whether they infer each other or not. Case Example: SNLI.
Case 2: The example is a pair of positive (similar) sentences without a label. For example, pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences. Case Examples: Sentence Compression, COCO Captions, Flickr30k captions.
Case 3: The example is a sentence with an integer label indicating the class to which it belongs. This data format is easily converted by loss functions into three sentences (triplets) where the first is an “anchor”, the second a “positive” of the same class as the anchor, and the third a “negative” of a different class. Case Examples: TREC, Yahoo Answers Topics.
Case 4: The example is a triplet (anchor, positive, negative) without classes or labels for the sentences. Case Example: Quora Triplets

Note that Sentence Transformers models can be trained with human labeling (cases 1 and 3) or with labels automatically deduced from text formatting (cases 2 and 4).

You can get almost ready-to-train datasets from various sources. One of them is the Hugging Face Hub.

Datasets on the Hugging Face Hub¶

The Datasets library (pip install datasets) allows you to load datasets from the Hugging Face Hub with the load_dataset function:

from datasets import load_dataset

# Indicate the repo id from the Hub
dataset_id = "embedding-data/QQP_triplets"

dataset = load_dataset(dataset_id)

For more information on how to manipulate your dataset see » Datasets Documentation.

These are popular datasets used to train and fine-tune SentenceTransformers models.

	Dataset
	altlex pairs
	sentence compression pairs
	QQP triplets
	PAQ pairs
	SPECTER triplets
	Amazon QA pairs
	Simple Wiki pairs
	Wiki Answers equivalent sentences
	COCO Captions quintets
	Flickr30k Captions quintets
	MS Marco
	GOOAQ
	MS Marco
	Yahoo Answers topics
	Search QA
	Stack Exchange
	ELI5
	MultiNLI
	SNLI
	S2ORC
	Trivia QA
	Code Search Net
	Natural Questions