EmbeddingStudio
/

query-parser-falcon-7b-instruct

@@ -4,7 +4,9 @@ base_model: tiiuae/falcon-7b-instruct
 license: apache-2.0
 language:
 - en
-pipeline_tag: text-generation
 tags:
 - search-queries
 - instruct-fine-tuned
@@ -352,32 +354,94 @@ print(output)
 ### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
 ### Testing Data, Factors & Metrics
@@ -395,9 +459,69 @@ print(output)
 #### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
 ### Results

 license: apache-2.0
 language:
 - en
+pipeline_tag: text-generation
+datasets:
+- EmbeddingStudio/query-parsing-instructions-falcon
 tags:
 - search-queries
 - instruct-fine-tuned
 ### Training Data
+We used synthetically generated query parsing instructions:
+* We generated lists of possible filters for 63 customer categories:
+  * [Raw version of filters dataset](https://huggingface.co/datasets/EmbeddingStudio/synthetic-search-filters-raw)
+  * [Split by representations](https://huggingface.co/datasets/EmbeddingStudio/synthetic-search-filters)
+* Select randomly up-to 150 possible combinations (1-3 filters in each combination) of filters, the way each filter's representation appears maximum twice.
+* For a given category and combination we [generated](https://huggingface.co/datasets/EmbeddingStudio/synthetic-search-queries) with GPT-4 Turbo:
+  * 2 search queries and theirs parsed version with unstructured parts.
+  * 2 search queries and theirs parsed version without unstructured part.
+* Using filters, queries and parsed version we prepared [72.5k falcon format instruction](EmbeddingStudio/query-parsing-instructions-falcon)
+**Warning:** EmbeddingStudio team aware you that generated queries **weren't enough curated**, and will be curated later once we finish our product market fit stage.
+#### Principles of train / test splitting
+As we are trying to fine-tune LLM to follow zero-shot query parsing instructions, so we want to test:
+* Ability to work well with unseen domain
+* Ability to work well with unseen filters
+* Ability to work well with unseen queries
+For these purposes we:
+1. We put into test split 5 categories,  completely separared from train: `Telecommunication Companies, Legal Services, Enterprise Software Development, Artificial Intelligence and Machine Learning, Documentation and Knowledge Sharing`.
+2. Also out of each appearing in train company categories, we put aside / removed one filter and queries related to it.
+3. Selected 5% of other queries and put it into test.
+#### Filters generation details
+We used GPT-4 Turbo to generate several possible filters for 63 company categroies. For each filter we also generated some possible representations. For examples filter `Date` can be represented as `dd/mm/YYYY`, `YYYY-mm-dd`, as words `2024 Jan 17`, etc.
+#### Queries generation details
+We also used GPT-4 Turbo for generation of search queries and theirs parsed version. Main principles were:
+* If passed schema doesn't contain possible filter, do not generate query itself or a possible filter
+* If a selected representations combination contains enumeration, so we ask to map values in  a search query and a parsed version.
+* If a selected representations combination contains pattern, so we ask GPT-4 Turbo to be aligned with a pattern
+#### Instructions generation details
+For the generation instructions we used following ideas:
+1. Zero-Shot query parser should be schema agnostic. Cases like `snake_case, CamelCase, http-headers-like` should not ruin generation process.
+2. Zero-Shot query parser should be spelling errors insensitive.
+3. Training instructions should be in the following order:
+   * Category
+   * Schema
+   * Query
+   So LLM can be used in the following way: just generate embedding of category -> schema part, so inference will be faster.
+We assume, that `schema agnostic` termin means something wider, like to be able to work not only with JSONs, but also with HTML, Markdown, YAML, etc. We are working on it.
+So, what was our approach as an attempt to achieve these abilities:
+1. For each query we generated a version  with a mistake
+2. Passed to each parsed version an additional field `Correct`, which contains a corrected version of a search query.
+3. For each query we randomly selected and used a  case for schema fields and a case for filter and representation names.
+4. For each query we additionally generated two instuctions:
+  * Where did we remove from a provided schema and parsed version one filter
+  * Where did we remove from a provided schema and parsed version all related filters
+**Warning:** EmbeddingStudio team ask you to curate datasets on your own precisely.
+### Training Procedure
+1. Mixed Precision Regime
+2. Supervised Fine-Tuning
+3. Three epochs with cosine scheduler
+All details in Training Hyperparameters
+#### Preprocessing [optional]
+The preprocessing steps are not detailed in the provided code. Typically, preprocessing involves tokenization, normalization, data augmentation, and handling of special tokens. In this training setup, the tokenizer was configured with `add_prefix_space=True` and `use_fast=False`, which might indicate special considerations for tokenizing certain languages or text formats.
+#### Training Hyperparameters
+| Hyperparameter                       | Value                        | Description                                           |
+|--------------------------------------|------------------------------|-------------------------------------------------------|
+| **Training Regime**                  | Mixed Precision (bfloat16)   | Utilizes bfloat16 for efficient memory usage and training speed. |
+| **Model Configuration**              | Causal Language Model        | Incorporates LoRA (Low-Rank Adaptation) for training efficiency. |
+| **Quantization Configuration**       | Bits and Bytes (BnB)         | Uses settings like `load_in_4bit` and `bnb_4bit_quant_type` for model quantization. |
+| **Training Environment**             | CUDA-enabled Device          | Indicates GPU acceleration for training. |
+| **Learning Rate**                    | 2e-4                         | Determines the step size at each iteration while moving toward a minimum of a loss function. |
+| **Weight Decay**                     | 0.001                        | Helps in regularizing and preventing overfitting. |
+| **Warmup Ratio**                     | 0.03                         | Fraction of total training steps used for the learning rate warmup. |
+| **Optimizer**                        | Paged AdamW (32-bit)         | Optimizes the training process with efficient memory usage. |
+| **Gradient Accumulation Steps**      | 2                            | Reduces memory consumption and allows for larger effective batch sizes. |
+| **Max Grad Norm**                    | 0.3                          | Maximum norm for the gradients. |
+| **LR Scheduler Type**                | Cosine                       | Specifies the learning rate schedule. |
+| **PEFT Configurations**              | LoraConfig                   | Details like `lora_alpha`, `lora_dropout`, and `r` for LoRA adaptations. |
+| **Training Dataset Segmentation**    | Train and Test Sets          | Segmentation of the dataset for training and evaluation. |
+| **Max Sequence Length**              | 1024                         | Maximum length of the input sequences. |
 ### Testing Data, Factors & Metrics
 #### Metrics
+##### Total metrics
+| Category                                         | Recall | Precision | F1    | Accuracy |
+| ------------------------------------------------ | ------ | --------- | ----- | -------- |
+| Telecommunication Companies [+]                  | 0.70   | 0.67      | 0.68  | 0.52     |
+| Legal Services [+]                               | 0.80   | 0.74      | 0.77  | 0.63     |
+| Enterprise Software Development [+]              | 0.78   | 0.71      | 0.74  | 0.59     |
+| Artificial Intelligence and Machine Learning [+] | 0.77   | 0.78      | 0.78  | 0.63     |
+| Documentation and Knowledge Sharing [+]          | 0.68   | 0.65      | 0.66  | 0.50     |
+| Educational Institutions                         | 0.55   | 0.51      | 0.53  | 0.36     |
+| Job Recruitment Agencies                         | 0.58   | 0.51      | 0.54  | 0.37     |
+| Banking Services                                 | 0.73   | 0.81      | 0.76  | 0.62     |
+| Investment Services                              | 0.50   | 0.50      | 0.50  | 0.33     |
+| Insurance Services                               | 0.77   | 0.77      | 0.77  | 0.62     |
+| Financial Planning and Advisory                  | 0.65   | 0.67      | 0.66  | 0.49     |
+| Credit Services                                  | 0.60   | 0.65      | 0.63  | 0.45     |
+| Payment Processing                               | 0.79   | 0.74      | 0.76  | 0.62     |
+| Mortgage and Real Estate Services                | 1.00   | 1.00      | 1.00  | 1.00     |
+| Taxation Services                                | 0.52   | 0.57      | 0.54  | 0.37     |
+| Risk Management and Compliance                   | 1.00   | 0.95      | 0.98  | 0.95     |
+| Digital and Mobile Banking                       | 0.72   | 0.71      | 0.71  | 0.55     |
+| Retail Stores (Online and Offline)               | 0.96   | 0.87      | 0.92  | 0.85     |
+| Automotive Dealerships                           | 0.52   | 0.53      | 0.53  | 0.36     |
+| Restaurants and Food Delivery Services           | 0.76   | 0.77      | 0.76  | 0.62     |
+| Entertainment and Media Platforms                | 0.80   | 0.84      | 0.82  | 0.70     |
+| Government Services                              | 0.58   | 0.65      | 0.61  | 0.44     |
+| Travelers and Consumers                          | 0.89   | 0.89      | 0.89  | 0.80     |
+| Logistics and Supply Chain Management            | 0.56   | 0.59      | 0.58  | 0.41     |
+| Customer Support Services                        | 0.60   | 0.54      | 0.57  | 0.40     |
+| Market Research Firms                            | 0.52   | 0.49      | 0.51  | 0.34     |
+| Mobile App Development                           | 0.81   | 0.79      | 0.80  | 0.67     |
+| Game Development                                 | 0.94   | 0.94      | 0.94  | 0.88     |
+| Cloud Computing Services                         | 0.64   | 0.62      | 0.63  | 0.46     |
+| Data Analytics and Business Intelligence         | 0.63   | 0.61      | 0.62  | 0.45     |
+| Cybersecurity Software                           | 0.54   | 0.59      | 0.57  | 0.39     |
+| User Interface/User Experience Design            | 0.63   | 0.64      | 0.63  | 0.46     |
+| Internet of Things (IoT) Development             | 0.89   | 0.71      | 0.79  | 0.65     |
+| Project Management Tools                         | 0.80   | 0.83      | 0.81  | 0.69     |
+| Version Control Systems                          | 0.77   | 0.73      | 0.75  | 0.60     |
+| Continuous Integration/Continuous Deployment     | 0.85   | 0.83      | 0.84  | 0.72     |
+| Issue Tracking and Bug Reporting                 | 0.64   | 0.62      | 0.63  | 0.46     |
+| Collaborative Development Environments           | 0.68   | 0.67      | 0.68  | 0.51     |
+| Team Communication and Chat Tools                | 0.94   | 0.91      | 0.93  | 0.87     |
+| Task and Time Management                         | 0.78   | 0.78      | 0.78  | 0.64     |
+| Customer Support and Feedback                    | 0.88   | 0.82      | 0.85  | 0.74     |
+| Cloud-based Development Environments             | 0.81   | 0.81      | 0.81  | 0.68     |
+| Image Stock Platforms                            | 0.88   | 0.85      | 0.87  | 0.76     |
+| Video Hosting and Portals                        | 0.86   | 0.88      | 0.87  | 0.77     |
+| Social Networks                                  | 0.60   | 0.57      | 0.59  | 0.41     |
+| Professional Social Networks                     | 0.68   | 0.69      | 0.68  | 0.52     |
+| Dating Apps                                      | 0.90   | 0.90      | 0.90  | 0.82     |
+| Aggregate                                        | 0.73   | 0.72      | 0.73  | 0.59     |
+##### Unseen domains metrics
+| Category                                         | Recall | Precision | F1    | Accuracy |
+| ------------------------------------------------ | ------ | --------- | ----- | -------- |
+| Telecommunication Companies [+]                  | 0.70   | 0.67      | 0.68  | 0.52     |
+| Legal Services [+]                               | 0.80   | 0.74      | 0.77  | 0.63     |
+| Enterprise Software Development [+]              | 0.78   | 0.71      | 0.74  | 0.59     |
+| Artificial Intelligence and Machine Learning [+] | 0.77   | 0.78      | 0.78  | 0.63     |
+| Documentation and Knowledge Sharing [+]          | 0.68   | 0.65      | 0.66  | 0.50     |
+| Aggregate                                        | 0.75   | 0.71      | 0.73  | 0.57     |
 ### Results

calculate_metrics.py ADDED Viewed

File without changes

test_query_parser.py ADDED Viewed

File without changes