SentenceTransformer based on sentence-transformers/all-distilroberta-v1

This is a sentence-transformers model finetuned from sentence-transformers/all-distilroberta-v1. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: sentence-transformers/all-distilroberta-v1
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
HF中国镜像站: Sentence Transformers on HF中国镜像站

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("krshahvivek/distilroberta-ai-job-embeddings")
# Run inference
sentences = [
    'Senior Data Scientist, Statistical Analysis, Data Interpretation, TS/SCI Clearance',
    'experience to solve some of the most challenging intelligence issues around data.\n\nJob Responsibilities & Duties\n\nDevise strategies for extracting meaning and value from large datasets. Make and communicate principled conclusions from data using elements of mathematics, statistics, computer science, and application specific knowledge. Through analytic modeling, statistical analysis, programming, and/or another appropriate scientific method, develop and implement qualitative and quantitative methods for characterizing, exploring, and assessing large datasets in various states of organization, cleanliness, and structure that account for the unique features and limitations inherent in data holdings. Translate practical needs and analytic questions related to large datasets into technical requirements and, conversely, assist others with drawing appropriate conclusions from the analysis of such data. Effectively communicate complex technical information to non-technical audiences.\n\nMinimum Qualifications\n\n10 years relevant experience with Bachelors in related field; or 8 years experience with Masters in related field; or 6 years experience with a Doctoral degree in a related field; or 12 years of relevant experience and an Associates may be considered for individuals with in-depth experienceDegree in an Mathematics, Applied Mathematics, Statistics, Applied Statistics, Machine Learning, Data Science, Operations Research, or Computer Science, or related field of technical rigorAbility/willingness to work full-time onsite in secure government workspacesNote: A broader range of degrees will be considered if accompanied by a Certificate in Data Science from an accredited college/university.\n\nClearance Requirements\n\nThis position requires a TS/SCI with Poly\n\nLooking for other great opportunities? Check out Two Six Technologies Opportunities for all our Company’s current openings!\n\nReady to make the first move towards growing your career? If so, check out the Two Six Technologies Candidate Journey! This will give you step-by-step directions on applying, what to expect during the application process, information about our rich benefits and perks along with our most frequently asked questions. If you are undecided and would like to learn more about us and how we are contributing to essential missions, check out our Two Six Technologies News page! We share information about the tech world around us and how we are making an impact! Still have questions, no worries! You can reach us at Contact Two Six Technologies. We are happy to connect and cover the information needed to assist you in reaching your next career milestone.\n\nTwo Six Technologies is \n\nIf you are an individual with a disability and would like to request reasonable workplace accommodation for any part of our employment process, please send an email to [email protected]. Information provided will be kept confidential and used only to the extent required to provide needed reasonable accommodations.\n\nAdditionally, please be advised that this business uses E-Verify in its hiring practices.\n\n\n\nBy submitting the following application, I hereby certify that to the best of my knowledge, the information provided is true and accurate.',
    'Skills :8+ years of relevant experienceExperience with big data technology(s) or ecosystem in Hadoop, HDFS (also an understanding of HDFS Architecture), Hive, Map Reduce, Base - this is considering all of AMP datasets are in HDFS/S3Advanced SQL and SQL performance tuningStrong experience in Spark and Scala',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Datasets: ai-job-validation and ai-job-test
Evaluated with TripletEvaluator

Metric	ai-job-validation	ai-job-test
cosine_accuracy	0.9901	1.0

Training Details

Training Dataset

Unnamed Dataset

Size: 809 training samples
Columns: sentence_0 and sentence_1
Approximate statistics based on the first 809 samples:
sentence_0 sentence_1
type string string
details
min: 8 tokens
mean: 15.02 tokens
max: 40 tokens

min: 7 tokens
mean: 348.14 tokens
max: 512 tokens

	sentence_0	sentence_1
type	string	string
details	min: 8 tokens mean: 15.02 tokens max: 40 tokens	min: 7 tokens mean: 348.14 tokens max: 512 tokens

Samples:

sentence_0	sentence_1
`GCP Data Engineer, BigQuery, Airflow DAG, Hadoop ecosystem`	requirements for our direct client, please go through the below Job Description. If you are interested please send me your updated word format resume to [email protected] and reach me @ 520-231-4672. Title: GCP Data EngineerLocation: Hartford, CTDuration: Full Time 6-8 Years of experience in data extraction and creating data pipeline workflows on Bigdata (Hive, HQL/PySpark) with knowledge of Data Engineering concepts.Experience in analyzing large data sets from multiple data sources, perform validation of data.Knowledge of Hadoop eco-system components like HDFS, Spark, Hive, Sqoop.Experience writing codes in Python.Knowledge of SQL/HQL to write optimized queries.Hands on with GCP Cloud Services such as Big Query, Airflow DAG, Dataflow, Beam etc.
`Data analysis for legal documents, meticulous data entry, active Top-Secret security clearance`	Requirements NOTE: Applicants with an Active TS Clearance preferred Requirements * High School diploma or GED, Undergraduate degree preferred Ability to grasp and understand the organization and functions of the customer Meticulous data entry skills Excellent communication skills; oral and written Competence to review, interpret, and evaluate complex legal and non-legal documents Attention to detail and the ability to read and follow directions is extremely important Strong organizational and prioritization skills Experience with the Microsoft Office suite of applications (Excel, PowerPoint, Word) and other common software applications, to include databases, intermediate skills preferred Proven commitment and competence to provide excellent customer service; positive and flexible Ability to work in a team environment and maintain a professional dispositionThis position requires U.S. Citizenship and a 7 (or 10) year minimum background investigation ** NOTE: The 20% pay differential is d...
`Trust & Safety, Generative AI, Recommender Systems`	experiences achieve more in their careers. Our vision is to create economic opportunity for every member of the global workforce. Every day our members use our products to make connections, discover opportunities, build skills and gain insights. We believe amazing things happen when we work together in an environment where everyone feels a true sense of belonging, and that what matters most in a candidate is having the skills needed to succeed. It inspires us to invest in our talent and support career growth. Join us to challenge yourself with work that matters. Location: At LinkedIn, we trust each other to do our best work where it works best for us and our teams. This role offers a hybrid work option, meaning you can work from home and commute to a LinkedIn office, depending on what’s best for you and when it is important for your team to be together. This role is based in Sunnyvale, CA. Team Information: The mission of the Anti-Abuse AI team is to build trust in every inte...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 2
per_device_eval_batch_size: 2
num_train_epochs: 2
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin

Training Logs

Epoch	Step	Training Loss	ai-job-validation_cosine_accuracy	ai-job-test_cosine_accuracy
-1	-1	-	0.8812	-
1.0	405	-	0.9901	-
1.2346	500	0.07	-	-
2.0	810	-	0.9901	-
-1	-1	-	0.9901	1.0

Framework Versions

Python: 3.10.12
Sentence Transformers: 3.4.1
Transformers: 4.48.3
PyTorch: 2.6.0+cu124
Accelerate: 1.3.0
Datasets: 3.2.0
Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

krshahvivek
/

distilroberta-ai-job-embeddings