SentenceTransformer based on sentence-transformers/clip-ViT-B-32
This is a sentence-transformers model finetuned from sentence-transformers/clip-ViT-B-32 on the unsplash-lite dataset. It maps sentences & paragraphs to a 512-dimensional dense vector space and can be used for retrieval.
Model Details
Model Description
Model Type: Sentence Transformer
Base model: sentence-transformers/clip-ViT-B-32
Maximum Sequence Length: 77 tokens
Output Dimensionality: 512 dimensions
Similarity Function: Cosine Similarity
Supported Modalities: Text, Image
Training Dataset:
Model Sources
Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'get_text_features', 'method_output_name': 'pooler_output'}, 'image': {'method': 'get_image_features', 'method_output_name': 'pooler_output'}}, 'module_output_name': 'sentence_embedding', 'architecture': 'CLIPModel'})
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
queries = [
'assets/image_0.jpg',
]
documents = [
'assets/image_1.jpg',
'assets/image_2.jpg',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 512] [2, 512]
# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[ 0.0571, -0.0265]])
Evaluation
Metrics
Information Retrieval
Dataset:
unsplash-devEvaluated with
InformationRetrievalEvaluator
| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.63 |
| cosine_accuracy@3 | 0.83 |
| cosine_accuracy@5 | 0.89 |
| cosine_accuracy@10 | 0.91 |
| cosine_precision@1 | 0.63 |
| cosine_precision@3 | 0.2767 |
| cosine_precision@5 | 0.178 |
| cosine_precision@10 | 0.091 |
| cosine_recall@1 | 0.63 |
| cosine_recall@3 | 0.83 |
| cosine_recall@5 | 0.89 |
| cosine_recall@10 | 0.91 |
| cosine_ndcg@10 | 0.7791 |
| cosine_mrr@10 | 0.7358 |
| cosine_map@100 | 0.7392 |
Training Details
Training Dataset
unsplash-lite
Dataset: unsplash-lite at 3afcfc7
Size: 24,896 training samples
Columns:
imageandcaptionApproximate statistics based on the first 100 samples:
image
caption
type
image
string
modality
image
text
details
- min: 640x340 px
- mean: 640x679 px
- max: 640x1138 px
- min: 2 tokens
- mean: 25.48 tokens
- max: 73 tokens
Samples:
image
caption

lighthouse, beach, moody, outdoors, water, wave, beacon, ocean, sea, stormy, crashing, porthcawl, sea waves, tower, nature, swell, building, architecture, surf
experimental, person, human, bubble
nature, outdoors, night, auroraLoss:
MultipleNegativesRankingLosswith these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim", "gather_across_devices": false, "directions": [ "query_to_doc" ], "partition_mode": "joint", "hardness_mode": null, "hardness_strength": 0.0 }
Evaluation Dataset
unsplash-lite
Dataset: unsplash-lite at 3afcfc7
Size: 100 evaluation samples
Columns:
imageandcaptionApproximate statistics based on the first 100 samples:
image
caption
type
image
string
modality
image
text
details
- min: 640x301 px
- mean: 640x687 px
- max: 640x1138 px
- min: 2 tokens
- mean: 28.81 tokens
- max: 70 tokens
Samples:
image
caption

windbreaker, style, person, female, clothing, fur coat, apparel, raincoat, human
nature, backgrounds, outdoors, wallpapers, fog
waves, texture, wind wave, cloudy, water housing, blue, barrel, experimental, california, swimming, southern california, bubble, canon, wave, surf, sky, reflection, surfingLoss:
MultipleNegativesRankingLosswith these parameters:{ "scale": 20.0, "similarity_fct": "cos_sim", "gather_across_devices": false, "directions": [ "query_to_doc" ], "partition_mode": "joint", "hardness_mode": null, "hardness_strength": 0.0 }
Training Hyperparameters
Non-Default Hyperparameters
per_device_train_batch_size: 32num_train_epochs: 1learning_rate: 2e-05warmup_steps: 0.1bf16: Trueper_device_eval_batch_size: 32batch_sampler: no_duplicates
All Hyperparameters
Click to expand
per_device_train_batch_size: 32num_train_epochs: 1max_steps: -1learning_rate: 2e-05lr_scheduler_type: linearlr_scheduler_kwargs: Nonewarmup_steps: 0.1optim: adamw_torch_fusedoptim_args: Noneweight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08optim_target_modules: Nonegradient_accumulation_steps: 1average_tokens_across_devices: Truemax_grad_norm: 1.0label_smoothing_factor: 0.0bf16: Truefp16: Falsebf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Nonetorch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneuse_liger_kernel: Falseliger_kernel_config: Noneuse_cache: Falseneftune_noise_alpha: Nonetorch_empty_cache_steps: Noneauto_find_batch_size: Falselog_on_each_node: Truelogging_nan_inf_filter: Trueinclude_num_input_tokens_seen: nolog_level: passivelog_level_replica: warningdisable_tqdm: Falseproject: huggingfacetrackio_space_id: Nonetrackio_bucket_id: Nonetrackio_static_space_id: Noneper_device_eval_batch_size: 32prediction_loss_only: Trueeval_on_start: Falseeval_do_concat_batches: Trueeval_use_gather_object: Falseeval_accumulation_steps: Noneinclude_for_metrics: []batch_eval_metrics: Falsesave_only_model: Falsesave_on_each_node: Falseenable_jit_checkpoint: Falsepush_to_hub: Falsehub_private_repo: Nonehub_model_id: Nonehub_strategy: every_savehub_always_push: Falsehub_revision: Noneload_best_model_at_end: Falseignore_data_skip: Falserestore_callback_states_from_checkpoint: Falsefull_determinism: Falseseed: 42data_seed: Noneuse_cpu: Falseaccelerator_config: {‘split_batches’: False, ‘dispatch_batches’: None, ‘even_batches’: True, ‘use_seedable_sampler’: True, ‘non_blocking’: False, ‘gradient_accumulation_kwargs’: None}parallelism_config: Nonedataloader_drop_last: Falsedataloader_num_workers: 0dataloader_pin_memory: Truedataloader_persistent_workers: Falsedataloader_prefetch_factor: Noneremove_unused_columns: Truelabel_names: Nonetrain_sampling_strategy: randomlength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falseddp_static_graph: Noneddp_backend: Noneddp_timeout: 1800fsdp: []fsdp_config: {‘min_num_params’: 0, ‘xla’: False, ‘xla_fsdp_v2’: False, ‘xla_fsdp_grad_ckpt’: False}deepspeed: Nonedebug: []skip_memory_metrics: Truedo_predict: Falseresume_from_checkpoint: Nonewarmup_ratio: Nonelocal_rank: -1prompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportionalrouter_mapping: {}learning_rate_mapping: {}
Training Logs
Epoch |
Step |
Training Loss |
Validation Loss |
unsplash-dev_cosine_ndcg@10 |
|---|---|---|---|---|
-1 |
-1 |
- |
- |
0.7142 |
0.0501 |
39 |
0.9333 |
- |
- |
0.1003 |
78 |
0.5194 |
0.3630 |
0.7751 |
0.1504 |
117 |
0.6026 |
- |
- |
0.2005 |
156 |
0.5822 |
0.3062 |
0.8042 |
0.2506 |
195 |
0.5841 |
- |
- |
0.3008 |
234 |
0.5302 |
0.2984 |
0.7983 |
0.3509 |
273 |
0.5120 |
- |
- |
0.4010 |
312 |
0.5243 |
0.3266 |
0.7791 |
Training Time
Training: 12.0 minutes
Framework Versions
Python: 3.11.6
Sentence Transformers: 5.6.0.dev0
Transformers: 5.8.0.dev0
PyTorch: 2.10.0+cu128
Accelerate: 1.13.0.dev0
Datasets: 4.8.4
Tokenizers: 0.22.2
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MultipleNegativesRankingLoss
@misc{oord2019representationlearningcontrastivepredictive,
title={Representation Learning with Contrastive Predictive Coding},
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
year={2019},
eprint={1807.03748},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1807.03748},
}