Training

This folder contains various examples to fine-tune SentenceTransformers for specific tasks.

For the beginning, I can recommend to have a look at the Semantic Textual Similarity (STS) or the Natural Language Inference (NLI) examples.

For the documentation how to train your own models, see Training Overview.

Training Examples

adaptive_layer - Examples to train models whose layers can be removed on the fly for faster inference.
avg_word_embeddings - This folder contains examples to train models based on classical word embeddings like GloVe. These models are extremely fast, but are a more inaccuracte than transformers based models.
clip - Examples to train CLIP image models.
cross-encoder - Examples to train CrossEncoder models.
data_augmentation Examples of how to apply data augmentation strategies to improve embedding models.
distillation - Examples to make models smaller, faster and lighter.
hpo - Examples with hyperparameter search to find the best hyperparameters for your task.
matryoshka - Examples with training embedding models whose embeddings can be truncated (allowing for faster search) with minimal performance loss.
ms_marco - Example training scripts for training on the MS MARCO information retrieval dataset.
multilingual - Existent monolingual models can be extend to various languages (paper). This folder contains a step-by-step guide to extend existent models to new languages.
nli - Natural Language Inference (NLI) data can be quite helpful to pre-train and fine-tune models to create meaningful sentence embeddings.
other - Various tiny examples for show-casing one specific training case.
paraphrases - Examples for training models capable of recognizing paraphrases, i.e. understand when texts have the same meaning despite using different words.
quora_duplicate_questions - Quora Duplicate Questions is large set corpus with duplicate questions from the Quora community. The folder contains examples how to train models for duplicate questions mining and for semantic search.
sts - The most basic method to train models is using Semantic Textual Similarity (STS) data. Here, we have a sentence pair and a score indicating the semantic similarity.