CT (In-Batch Negatives)

Carlsson et al. present in Semantic Re-Tuning With Contrastive Tension (CT) an unsupervised learning approach for sentence embeddings that just requires sentences.

Background

During training, CT builds two independent encoders (‘Model1’ and ‘Model2’) with intial parameters shared to encode a pair of sentences. If Model1 and Model2 encode the same sentence, then the dot-product of the two sentence embeddings should be large. If Model1 and Model2 encode different sentences, then their dot-product should be small.

In the original CT paper, specially created batches are used. We implemented an improved version that uses in-batch negative sampling: Model1 and Model2 both encode the same set of sentences. We maximize the scores for matching indexes (i.e. Model1(S_i) and Model2(S_i)) while we minimize the scores for different indexes (i.e. Model1(S_i) and Model2(S_j) for i != j).

Using in-batch negative sampling gives a stronger training signal than the original loss function proposed by Carlsson et al.

CT working

After training, the model 2 will be used for inference, which usually has better performance.

Performance

In some preliminary experiments, we compate performance on the STSbenchmark dataset (trained with 1 million sentences from Wikipedia) and on the Quora duplicate questions dataset (trained with questions from Quora).

Method STSb (Spearman) Quora-Duplicate-Question (Avg. Precision)
CT 75.7 36.5
CT (In-Batch Negatives) 78.5 40.1

Note: We used the code provided in this repository, not the official code from the authors.

CT from Sentences File

train_ct-improved_from_file.py loads sentences from a provided text file. It is expected, that the there is one sentence per line in that text file.

SimCSE will be training using these sentences. Checkpoints are stored every 500 steps to the output folder.

Further Training Examples