Sentence-Transformers can be used in different ways to perform clustering of small or large set of sentences.

k-Means contains an example of using K-means Clustering Algorithm. K-Means requires that the number of clusters is specified beforehand. The sentences are clustered in groups of about euqal size.

Agglomerative Clustering shows an example of using Hierarchical clustering using the Agglomerative Clustering Algorithm. In contrast to k-means, we can specify a threshold for the clustering: Clusters below that threshold are merged. This algorithm can be useful if the number of clusters is unknown. By the threshold, we can control if we want to have many small and fine-grained cluster or few coarse-grained clusters.

Fast Clustering

Agglomerative Clustering is for larger datasets quite slow, so it is only applicable for maybe a few thousand sentences.

In we present a clustering algorithm that is tuned for large datasets (50k sentences in less than 5 seconds). In a large list of sentences it searches for local communities: A local community is a set of highly similar sentences.

You can configure the threshold of cosine-similarity for which we consider two sentences as similar. Also, you can specific the minimal size for a local community. This allows you to get either large coarse-grained cluster or small fine-grained clusters.

We apply it on the Quora Duplicate Questions dataset and the output will looks something like this:

Cluster 1, #109 Elements
         How do I improve my English speaking?
         How could I improve my English?
         How can I improve my English speaking ability?

Cluster 2, #99 Elements
         Will the decision to demonetize 500 and 1000 rupee notes help to curb black money?
         The decision of Indian Government to demonetize ₹500 and ₹1000 notes? Is Right or wrong?
         What do you think about Modi's new policy on the ban of Rs 500 and Rs 1000 notes?

Cluster 3, #61 Elements
         What are the best way of loose the weight?
         What is the best method of losing weight?
         What are the best simple ways to loose weight?


Cluster 21, #25 Elements
         Why is Saltwater Taffy candy imported in Portugal?
         Why is saltwater taffy candy imported in Brazil?
         Why is Saltwater taffy candy imported in Japan?

Topic Modeling

Topic modeling is the process of discovering topics in a collection of documents.

An example is shown in the following picture, which shows the identified topics in the 20 newsgroup dataset: 20news

For each topic, you want to extract the words that describe this topic: 20news

Sentence-Transformers can be used to identify these topics in a collection of sentences, paragraphs or short documents. For

For an excellent tutorial, see Topic Modeling with BERT as well as the repositories Top2Vec and BERTopic.

Image source: Top2Vec: Distributed Representations of Topics