Natural Questions Models

Google’s Natural Questions dataset constists of about 100k real search queries from Google with the respective, relevant passage from Wikipedia. Models trained on this dataset work well for question-answer retrieval.


from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('nq-distilbert-base-v1')

query_embedding = model.encode('How many people live in London?')

#The passages are encoded as [ [title1, text1], [title2, text2], ...]
passage_embedding = model.encode([['London', 'London has 9,787,426 inhabitants at the 2011 census.']])

print("Similarity:", util.pytorch_cos_sim(query_embedding, passage_embedding))

Note: For the passage, we have to encode the Wikipedia article title together with a text paragraph from that article.


The models are evaluated on the Natural Questions development dataset using MRR@10.

Approach MRR@10 (NQ dev set small)
nq-distilbert-base-v1 72.36
Other models
DPR 48.88