CT (In-Batch Negatives)
Carlsson et al. present in Semantic Re-Tuning With Contrastive Tension (CT) an unsupervised learning approach for sentence embeddings that just requires sentences.
Background
During training, CT builds two independent encoders (‘Model1’ and ‘Model2’) with initial parameters shared to encode a pair of sentences. If Model1 and Model2 encode the same sentence, then the dot-product of the two sentence embeddings should be large. If Model1 and Model2 encode different sentences, then their dot-product should be small.
In the original CT paper, specially created batches are used. We implemented an improved version that uses in-batch negative sampling: Model1 and Model2 both encode the same set of sentences. We maximize the scores for matching indexes (i.e. Model1(S_i) and Model2(S_i)) while we minimize the scores for different indexes (i.e. Model1(S_i) and Model2(S_j) for i != j).
Using in-batch negative sampling gives a stronger training signal than the original loss function proposed by Carlsson et al.
After training, the model 2 will be used for inference, which usually has better performance.
Performance
In some preliminary experiments, we compare performance on the STSbenchmark dataset (trained with 1 million sentences from Wikipedia) and on the Quora duplicate questions dataset (trained with questions from Quora).
Method | STSb (Spearman) | Quora-Duplicate-Question (Avg. Precision) |
---|---|---|
CT | 75.7 | 36.5 |
CT (In-Batch Negatives) | 78.5 | 40.1 |
Note: We used the code provided in this repository, not the official code from the authors.
CT from Sentences File
train_ct-improved_from_file.py loads sentences from a provided text file. It is expected, that the there is one sentence per line in that text file.
SimCSE will be training using these sentences. Checkpoints are stored every 500 steps to the output folder.
Further Training Examples
train_stsb_ct-improved.py: This example uses 1 million sentences from Wikipedia to train with CT. It evaluate the performance on the STSbenchmark dataset.
train_askubuntu_ct-improved.py: This example trains on AskUbuntu Questions dataset, a dataset with questions from the AskUbuntu Stackexchange forum.