We provide various pre-trained models. Using these models is easy:
from sentence_transformers import SentenceTransformer model = SentenceTransformer('model_name')
Alternatively, you can download and unzip them from here.
Sentence Embedding Models¶
The following models have been tuned to embed sentences and short paragraphs up to a length of 128 word pieces.
Use paraphrase-mpnet-base-v2 for the best quality, and paraphrase-MiniLM-L6-v2 if you want a quick model with high quality.
Question-Answer Retrieval - MSMARCO¶
The following models were trained on MSMARCO Passage Ranking, a dataset with 500k real queries from Bing search. Given a search query, find the relevant passages.
Models tuned to be used with cosine-similarity:
msmarco-distilbert-base-v4: MRR@10: 33.79 on MS MARCO dev set
Models tuned to be used with dot-product:
msmarco-distilbert-base-dot-prod-v3: MRR@10: 33.04 on MS MARCO dev set
msmarco-distilbert-base-tas-b: MRR@10: 34.43 on MS MARCO dev set
msmarco-roberta-base-ance-firstp: MRR@10: 33.03 on MS MARCO dev set
Models tuned for cosine-similarity will prefer the retrieval of short documents, while models tuned for dot-product will prefer the retrieval of longer documents. Depending on your task, the models of the one or the other type are preferable.
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('msmarco-distilbert-base-v3') query_embedding = model.encode('How big is London') passage_embedding = model.encode('London has 9,787,426 inhabitants at the 2011 census') print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))
You can index the passages as shown here.
Question-Answer Retrieval - Natural Questions¶
The following models were trained on Google’s Natural Questions dataset, a dataset with 100k real queries from Google search together with the relevant passages from Wikipedia.
nq-distilbert-base-v1: MRR10: 72.36 on NQ dev set (small)
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer('nq-distilbert-base-v1') query_embedding = model.encode('How many people live in London?') #The passages are encoded as [ [title1, text1], [title2, text2], ...] passage_embedding = model.encode([['London', 'London has 9,787,426 inhabitants at the 2011 census.']]) print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))
You can index the passages as shown here.
In Dense Passage Retrieval for Open-Domain Question Answering Karpukhin et al. trained models based on Google’s Natural Questions dataset:
They also trained models on the combination of Natural Questions, TriviaQA, WebQuestions, and CuratedTREC.
The following models generate aligned vector spaces, i.e., similar inputs in different languages are mapped close in vector space. You do not need to specify the input language. Details are in our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation:
Currently, there are models for two use-cases:
These models find semantically similar sentences within one language or across languages:
distiluse-base-multilingual-cased-v1: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. Supports 15 languages: Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish.
distiluse-base-multilingual-cased-v2: Multilingual knowledge distilled version of multilingual Universal Sentence Encoder. While v1 model supports 15 languages, this version supports 50+ languages. However, performance on the 15 languages mentioned above are reported to be a bit lower.
paraphrase-xlm-r-multilingual-v1 - Multilingual version of paraphrase-distilroberta-base-v1, trained on parallel data for 50+ languages.
stsb-xlm-r-multilingual: Produces similar embeddings as the stsb-bert-base model. Trained on parallel data for 50+ languages.
quora-distilbert-multilingual - Multilingual version of quora-distilbert-base. Fine-tuned with parallel data for 50+ languages.
T-Systems-onsite/cross-en-de-roberta-sentence-transformer - Multilingual model for English an German. [More]
paraphrase-multilingual-MiniLM-L12-v2 - Multilingual version of paraphrase-MiniLM-L12-v2, trained on parallel data for 50+ languages.
paraphrase-multilingual-mpnet-base-v2 - Multilingual version of paraphrase-mpnet-base-v2, trained on parallel data for 50+ languages.
Bitext mining describes the process of finding translated sentence pairs in two languages. If this is your use-case, the following model gives the best performance:
LaBSE - LaBSE Model. Supports 109 languages. Works well for finding translation pairs in multiple languages. As detailed here, LaBSE works less well for assessing the similarity of sentence pairs that are not translations of each other.
XLM-R models support the following 100 languages.
|Burmese zawgyi font||Catalan||Chinese (Simplified)||Chinese (Traditional)||Croatian|
|Tamil Romanize||Telugu||Telugu Romanize||Thai||Turkish|
We used the following languages for Multilingual Knowledge Distillation: ar, bg, ca, cs, da, de, el, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw.
Extending a model to new languages is easy by following the description here.
SPECTER is a model trained on scientific citations and can be used to estimate the similarity of two publications. We can use it to find similar papers.
Average Word Embeddings Models¶
The following models apply compute the average word embedding for some well-known word embedding methods. Their computation speed is much higher than the transformer based models, but the quality of the embeddings are worse.
Image & Text-Models¶
The following models can embed images and text into a joint vector space. See Image Search for more details how to use for text2image-search, image2image-search, image clustering, and zero-shot image classification.