Natural Questions Models

Google’s Natural Questions dataset consists of about 100k real search queries from Google with the respective, relevant passage from Wikipedia. Models trained on this dataset work well for question-answer retrieval.

Usage

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("nq-distilbert-base-v1")

query_embedding = model.encode("How many people live in London?")

# The passages are encoded as [ [title1, text1], [title2, text2], ...]
passage_embedding = model.encode(
    [["London", "London has 9,787,426 inhabitants at the 2011 census."]]
)

print("Similarity:", util.cos_sim(query_embedding, passage_embedding))

Note: For the passage, we have to encode the Wikipedia article title together with a text paragraph from that article.

Performance

The models are evaluated on the Natural Questions development dataset using MRR@10.

Approach	MRR@10 (NQ dev set small)
nq-distilbert-base-v1	72.36
Other models
DPR	58.96