MS MARCO

MS MARCO Passage Ranking is a large dataset to train models for information retrieval. It consists of about 500k real search queries from Bing search engine with the relevant text passage that answers the query.

This pages shows how to train models (Cross-Encoder and Sentence Embedding Models) on this dataset so that it can be used for searching text passages given queries (key words, phrases or questions).

If you are interested in how to use these models, see Application - Retrieve & Re-Rank.

There are pre-trained models available, which you can directly use without the need of training your own models. For more information, see: Pretrained Models | Pretrained Cross-Encoders

Bi-Encoder

Cross-Encoder are only suitable for reranking a small set of passages. For retrieval of suitable documents from a large collection, we have to use a bi-encoder. The documents are independently encoded into fixed-sized embeddings. A query is embedded into the same vector space. Relevant documents can then be found by using cosine-similarity.

BiEncoder

To train an bi-encoder on the MS MARCO dataset, see: train_bi-encoder-v2.py.

Cross-Encoder

A Cross-Encoder accepts both inputs, the query and the possible relevant passage and returns a score between 0 and 1 how relevant the passage is for the given query.

CrossEncoder

Cross-Encoders are often used for re-ranking: Given a list with possible relevant passages for a query, for example retrieved from BM25 / ElasticSearch, the cross-encoder re-ranks this list so that the most relevant passages are the top of the result list.

To train an cross-encoder on the MS MARCO dataset, see:

Cross-Encoder Knowledge Distillation

https://github.com/UKPLab/sentence-transformers/raw/master/docs/img/msmarco-training-ce-distillation.png

  • train_cross-encoder-v2.py uses a knowledge distillation setup: Hostätter et al. trained an ensemble of 3 (large) models for the MS MARCO dataset and predicted the scores for various (query, passage)-pairs (50% positive, 50% negative). In this example, we use knowledge distillation with a small & fast model and learn the logits scores from the teacher ensemble. This yields performances comparable to large models, while being 18 times faster.