Pretrained Models

We provide various pre-trained models. Using these models is easy:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('model_name')

Alternatively, you can download and unzip them from here.

Choosing the Right Model

Sadly there cannot exist a universal model that performs great on all possible tasks. Models strong on one task, will be weak for another task. Hence, it is important to select the right model for your task.

Paraphrase Identification

The following models are recommended for various applications, as they were trained on Millions of paraphrase examples. They create extremely good results for various similarity and retrieval tasks. They are currently under development, better versions and more details will be released in future. But they many tasks they work better than the NLI / STSb models.

  • distilroberta-base-paraphrase-v1 - Trained on large scale paraphrase data.

  • xlm-r-distilroberta-base-paraphrase-v1 - Multilingual version of distilroberta-base-paraphrase-v1, trained on parallel data for 50+ languages.

Semantic Textual Similarity

The following models were optimized for Semantic Textual Similarity (STS). They were trained on SNLI+MultiNLI and then fine-tuned on the STS benchmark train set.

The best available models for STS are:

  • roberta-large-nli-stsb-mean-tokens - STSb performance: 86.39

  • roberta-base-nli-stsb-mean-tokens - STSb performance: 85.44

  • bert-large-nli-stsb-mean-tokens - STSb performance: 85.29

  • distilbert-base-nli-stsb-mean-tokens - STSb performance: 85.16

» Full List of STS Models

Duplicate Questions Detection

The following models were trained for duplicate questions mining and duplicate questions retrieval. You can use them to detect duplicate questions in a large corpus (see paraphrase mining) or to search for similar questions (see semantic search).

Available models:

  • distilbert-base-nli-stsb-quora-ranking - Model first tuned on NLI+STSb data, then fine-tune for Quora Duplicate Questions detection retrieval.

  • distilbert-multilingual-nli-stsb-quora-ranking - Multilingual version of distilbert-base-nli-stsb-quora-ranking. Fine-tuned with parallel data for 50+ languages.

Information Retrieval

The following models were trained on MSMARCO Passage Ranking: Given a search query (which can be anything like key words, a sentence, a question), find the relevant passages. You can index the embeddings and use it for dense information retrieval, outperforming lexical approaches like BM25.

  • distilroberta-base-msmarco-v1 - First version trained on MSMarco train set. MRR on MSMARCO dev dataset: 23.28

To use, you have to prepend [QRY] to the queries and [DOC] to passages:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('distilroberta-base-msmarco-v1')

query_embedding = model.encode('[QRY] ' + 'How big is London')
passage_embedding = model.encode('[DOC] ' + 'London has 9,787,426 inhabitants at the 2011 census')

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

You can index the passages as shown here.

More details

Multi-Lingual Models

The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language. Details are in our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation:

Currently, there are models for two use-cases:

Semantic Similarity

These models find semantically similar sentences within one language or across languages:

  • distiluse-base-multilingual-cased-v2: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. While the original mUSE model only supports 16 languages, this multilingual knowledge distilled version supports 50+ languages.

  • xlm-r-distilroberta-base-paraphrase-v1 - Multilingual version of distilroberta-base-paraphrase-v1, trained on parallel data for 50+ languages.

  • xlm-r-bert-base-nli-stsb-mean-tokens: Produces similar embeddings as the bert-base-nli-stsb-mean-token model. Trained on parallel data for 50+ languages.

  • distilbert-multilingual-nli-stsb-quora-ranking - Multilingual version of distilbert-base-nli-stsb-quora-ranking. Fine-tuned with parallel data for 50+ languages.

Bitext Mining

Bitext mining describes the process of finding translated sentence pairs in two languages. If this is your use-case, the following model gives the best performance:

  • LaBSE - LaBSE Model. Supports 109 languages. Works well for finding translation pairs in multiple languages. As detailed here, LaBSE works less well for assessing the similarity of sentence pairs that are not translations of each other.


XLM-R models support the following 100 languages.

Language Language Language Language Language
Afrikaans Albanian Amharic Arabic Armenian
Assamese Azerbaijani Basque Belarusian Bengali
Bengali Romanize Bosnian Breton Bulgarian Burmese
Burmese zawgyi font Catalan Chinese (Simplified) Chinese (Traditional) Croatian
Czech Danish Dutch English Esperanto
Estonian Filipino Finnish French Galician
Georgian German Greek Gujarati Hausa
Hebrew Hindi Hindi Romanize Hungarian Icelandic
Indonesian Irish Italian Japanese Javanese
Kannada Kazakh Khmer Korean Kurdish (Kurmanji)
Kyrgyz Lao Latin Latvian Lithuanian
Macedonian Malagasy Malay Malayalam Marathi
Mongolian Nepali Norwegian Oriya Oromo
Pashto Persian Polish Portuguese Punjabi
Romanian Russian Sanskrit Scottish Gaelic Serbian
Sindhi Sinhala Slovak Slovenian Somali
Spanish Sundanese Swahili Swedish Tamil
Tamil Romanize Telugu Telugu Romanize Thai Turkish
Ukrainian Urdu Urdu Romanize Uyghur Uzbek
Vietnamese Welsh Western Frisian Xhosa Yiddish

We used the following languages for Multilingual Knowledge Distillation: ar, bg, ca, cs, da, de, el, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw.

Extending a model to new languages is easy by following the description here.

Wikipedia Sections

The following models is trained on the dataset from Dor et al. 2018, Learning Thematic Similarity Metric Using Triplet Networks and learns if two sentences belong to the same section in a Wikipedia page or not. It can be used to do fine-grained clustering of similar sentences into sections / topics. Further details

  • bert-base-wikipedia-sections-mean-tokens: 80.42% accuracy on Wikipedia Triplets test set.

Average Word Embeddings Models

The following models apply compute the average word embedding for some well-known word embedding methods. Their computation speed is much higher than the transformer based models, but the quality of the embeddings are worse.

  • average_word_embeddings_glove.6B.300d

  • average_word_embeddings_komninos

  • average_word_embeddings_levy_dependency

  • average_word_embeddings_glove.840B.300d