Quora Duplicate Questions

This folder contains scripts that demonstrate how to train SentenceTransformers for Information Retrieval. As a simple example, we will use the Quora Duplicate Questions dataset. It contains over 500,000 sentences with over 400,000 pairwise annotations whether two questions are a duplicate or not.

Models trained on this dataset can be used for mining duplicate questions, i.e., given a large set of sentences (in this case questions), identify all pairs that are duplicates. Due to how CrossEncoder models work only on pairs of texts, they are best deployed after an initial filtering using a SentenceTransformer model. See Sentence Transformer > Usage > Paraphrase Mining for an example how to use sentence transformers to mine for duplicate questions / paraphrases across hundred thousands of sentences.

After the initial filtering, a CrossEncoder model can be used to rerank the top e.g. 100 candidates into the top e.g. 10. Because a CrossEncoder can apply attention across the sentences from the pairs, the model can give better scores than the SentenceTransformer can.

To train a CrossEncoder on the Quora Duplicate Questions dataset, see the following example file:

training_quora_duplicate_questions.py:

This example uses BinaryCrossEntropyLoss to train the CrossEncoder model to give high scores for identical questions and low scores for different questions.

You can also train and use SentenceTransformer models for this task. See Sentence Transformer > Training Examples > Quora Duplicate Questions for more details.

Training

Choosing the right loss function is crucial for finetuning useful models. BinaryCrossEntropyLoss remains a very solid loss for training any CrossEncoder model that has just one output class, i.e. if it just outputs one score.

For each question pair, we pass question A and question B through the BERT-based model, after which a classifier head converts the intermediary representation from the BERT-based model into a similarity score. With this loss, we apply torch.nn.BCEWithLogitsLoss which accepts logits (a.k.a. outputs, raw predictions) and gold similarity scores (1 if duplicate, 0 if not duplicate) to compute a loss denoting how well the model has done. This loss is then minimized to improve the performance of the model.

Inference

You can perform inference using any of the pre-trained CrossEncoder models for Duplicate Question detection like so:

from sentence_transformers import CrossEncoder

model = CrossEncoder('cross-encoder/quora-distilroberta-base')
scores = model.predict([
    ('What do apples consist of?', 'What are in Apple devices?'),
    ('How do I get good at programming?', 'How to become a good programmer?')
])
print(scores)
# [0.00056, 0.97536]