SentenceTransformer based on sentence-transformers/clip-ViT-B-32

This is a sentence-transformers model finetuned from sentence-transformers/clip-ViT-B-32 on the unsplash-lite dataset. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for retrieval.

Model Details

Model Description

  • Model Type: Sentence Transformer

  • Base model: sentence-transformers/clip-ViT-B-32

  • Maximum Sequence Length: 77 tokens

  • Output Dimensionality: 512 dimensions

  • Similarity Function: Cosine Similarity

  • Supported Modalities: Text, Image

  • Training Dataset:

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'get_text_features', 'method_output_name': 'pooler_output'}, 'image': {'method': 'get_image_features', 'method_output_name': 'pooler_output'}}, 'module_output_name': 'sentence_embedding', 'architecture': 'CLIPModel'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
queries = [
    'assets/image_0.jpg',
]
documents = [
    'assets/image_1.jpg',
    'assets/image_2.jpg',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 512] [2, 512]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.0571, -0.0265]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.63
cosine_accuracy@3 0.83
cosine_accuracy@5 0.89
cosine_accuracy@10 0.91
cosine_precision@1 0.63
cosine_precision@3 0.2767
cosine_precision@5 0.178
cosine_precision@10 0.091
cosine_recall@1 0.63
cosine_recall@3 0.83
cosine_recall@5 0.89
cosine_recall@10 0.91
cosine_ndcg@10 0.7791
cosine_mrr@10 0.7358
cosine_map@100 0.7392

Training Details

Training Dataset

unsplash-lite

  • Dataset: unsplash-lite at 3afcfc7

  • Size: 24,896 training samples

  • Columns: image and caption

  • Approximate statistics based on the first 100 samples:

    image

    caption

    type

    image

    string

    modality

    image

    text

    details

    • min: 640x340 px
    • mean: 640x679 px
    • max: 640x1138 px

    • min: 2 tokens
    • mean: 25.48 tokens
    • max: 73 tokens

  • Samples:

    image

    caption

    lighthouse, beach, moody, outdoors, water, wave, beacon, ocean, sea, stormy, crashing, porthcawl, sea waves, tower, nature, swell, building, architecture, surf

    experimental, person, human, bubble

    nature, outdoors, night, aurora

  • Loss: MultipleNegativesRankingLoss with these parameters:

    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false,
        "directions": [
            "query_to_doc"
        ],
        "partition_mode": "joint",
        "hardness_mode": null,
        "hardness_strength": 0.0
    }
    

Evaluation Dataset

unsplash-lite

  • Dataset: unsplash-lite at 3afcfc7

  • Size: 100 evaluation samples

  • Columns: image and caption

  • Approximate statistics based on the first 100 samples:

    image

    caption

    type

    image

    string

    modality

    image

    text

    details

    • min: 640x301 px
    • mean: 640x687 px
    • max: 640x1138 px

    • min: 2 tokens
    • mean: 28.81 tokens
    • max: 70 tokens

  • Samples:

    image

    caption

    windbreaker, style, person, female, clothing, fur coat, apparel, raincoat, human

    nature, backgrounds, outdoors, wallpapers, fog

    waves, texture, wind wave, cloudy, water housing, blue, barrel, experimental, california, swimming, southern california, bubble, canon, wave, surf, sky, reflection, surfing

  • Loss: MultipleNegativesRankingLoss with these parameters:

    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false,
        "directions": [
            "query_to_doc"
        ],
        "partition_mode": "joint",
        "hardness_mode": null,
        "hardness_strength": 0.0
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 32

  • num_train_epochs: 1

  • learning_rate: 2e-05

  • warmup_steps: 0.1

  • bf16: True

  • per_device_eval_batch_size: 32

  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 32

  • num_train_epochs: 1

  • max_steps: -1

  • learning_rate: 2e-05

  • lr_scheduler_type: linear

  • lr_scheduler_kwargs: None

  • warmup_steps: 0.1

  • optim: adamw_torch_fused

  • optim_args: None

  • weight_decay: 0.0

  • adam_beta1: 0.9

  • adam_beta2: 0.999

  • adam_epsilon: 1e-08

  • optim_target_modules: None

  • gradient_accumulation_steps: 1

  • average_tokens_across_devices: True

  • max_grad_norm: 1.0

  • label_smoothing_factor: 0.0

  • bf16: True

  • fp16: False

  • bf16_full_eval: False

  • fp16_full_eval: False

  • tf32: None

  • gradient_checkpointing: False

  • gradient_checkpointing_kwargs: None

  • torch_compile: False

  • torch_compile_backend: None

  • torch_compile_mode: None

  • use_liger_kernel: False

  • liger_kernel_config: None

  • use_cache: False

  • neftune_noise_alpha: None

  • torch_empty_cache_steps: None

  • auto_find_batch_size: False

  • log_on_each_node: True

  • logging_nan_inf_filter: True

  • include_num_input_tokens_seen: no

  • log_level: passive

  • log_level_replica: warning

  • disable_tqdm: False

  • project: huggingface

  • trackio_space_id: None

  • trackio_bucket_id: None

  • trackio_static_space_id: None

  • per_device_eval_batch_size: 32

  • prediction_loss_only: True

  • eval_on_start: False

  • eval_do_concat_batches: True

  • eval_use_gather_object: False

  • eval_accumulation_steps: None

  • include_for_metrics: []

  • batch_eval_metrics: False

  • save_only_model: False

  • save_on_each_node: False

  • enable_jit_checkpoint: False

  • push_to_hub: False

  • hub_private_repo: None

  • hub_model_id: None

  • hub_strategy: every_save

  • hub_always_push: False

  • hub_revision: None

  • load_best_model_at_end: False

  • ignore_data_skip: False

  • restore_callback_states_from_checkpoint: False

  • full_determinism: False

  • seed: 42

  • data_seed: None

  • use_cpu: False

  • accelerator_config: {‘split_batches’: False, ‘dispatch_batches’: None, ‘even_batches’: True, ‘use_seedable_sampler’: True, ‘non_blocking’: False, ‘gradient_accumulation_kwargs’: None}

  • parallelism_config: None

  • dataloader_drop_last: False

  • dataloader_num_workers: 0

  • dataloader_pin_memory: True

  • dataloader_persistent_workers: False

  • dataloader_prefetch_factor: None

  • remove_unused_columns: True

  • label_names: None

  • train_sampling_strategy: random

  • length_column_name: length

  • ddp_find_unused_parameters: None

  • ddp_bucket_cap_mb: None

  • ddp_broadcast_buffers: False

  • ddp_static_graph: None

  • ddp_backend: None

  • ddp_timeout: 1800

  • fsdp: []

  • fsdp_config: {‘min_num_params’: 0, ‘xla’: False, ‘xla_fsdp_v2’: False, ‘xla_fsdp_grad_ckpt’: False}

  • deepspeed: None

  • debug: []

  • skip_memory_metrics: True

  • do_predict: False

  • resume_from_checkpoint: None

  • warmup_ratio: None

  • local_rank: -1

  • prompts: None

  • batch_sampler: no_duplicates

  • multi_dataset_batch_sampler: proportional

  • router_mapping: {}

  • learning_rate_mapping: {}

Training Logs

Epoch

Step

Training Loss

Validation Loss

unsplash-dev_cosine_ndcg@10

-1

-1

-

-

0.7142

0.0501

39

0.9333

-

-

0.1003

78

0.5194

0.3630

0.7751

0.1504

117

0.6026

-

-

0.2005

156

0.5822

0.3062

0.8042

0.2506

195

0.5841

-

-

0.3008

234

0.5302

0.2984

0.7983

0.3509

273

0.5120

-

-

0.4010

312

0.5243

0.3266

0.7791

Training Time

  • Training: 12.0 minutes

Framework Versions

  • Python: 3.11.6

  • Sentence Transformers: 5.6.0.dev0

  • Transformers: 5.8.0.dev0

  • PyTorch: 2.10.0+cu128

  • Accelerate: 1.13.0.dev0

  • Datasets: 4.8.4

  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}