EmbeddingStudio
/

query-parser-falcon-7b-instruct

@@ -5,8 +5,6 @@ base_model: tiiuae/falcon-7b-instruct
 # Model Card for the Query Parser LLM using Falcon-7B-Instruct
-[![version](https://img.shields.io/badge/version-0.0.1-red.svg)]()[![Python 3.9](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)![CUDA 11.7.1](https://img.shields.io/badge/CUDA-11.7.1-green.svg)
 EmbeddingStudio is the [open-source framework](https://github.com/EulerSearch/embedding_studio/tree/main), that allows you transform a joint "Embedding Model + Vector DB" into
 a full-cycle search engine: collect clickstream -> improve search experience-> adapt embedding model and repeat out of the box.
@@ -200,19 +198,143 @@ def parse(
 ### Bias
-### Risks
-### Recommendations
 ## How to Get Started with the Model
 Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details

 # Model Card for the Query Parser LLM using Falcon-7B-Instruct
 EmbeddingStudio is the [open-source framework](https://github.com/EulerSearch/embedding_studio/tree/main), that allows you transform a joint "Embedding Model + Vector DB" into
 a full-cycle search engine: collect clickstream -> improve search experience-> adapt embedding model and repeat out of the box.
 ### Bias
+Again, this model was fine-tuned for following the zero-shot query parsing instructions.
+So, all ethical biases are inherited by the original model.
+Model was fine-tuned to be able to work with the unknown company domain and filters schema. But, can be better with the training company categories:
+Educational Institutions, Job Recruitment Agencies, Banking Services, Investment Services, Insurance Services, Financial Planning and Advisory, Credit Services, Payment Processing, Mortgage and Real Estate Services, Taxation Services, Risk Management and Compliance, Digital and Mobile Banking, Retail Stores (Online and Offline), Automotive Dealerships, Restaurants and Food Delivery Services, Entertainment and Media Platforms, Government Services, Travelers and Consumers, Logistics and Supply Chain Management, Customer Support Services, Market Research Firms, Mobile App Development, Game Development, Cloud Computing Services, Data Analytics and Business Intelligence, Cybersecurity Software, User Interface/User Experience Design, Internet of Things (IoT) Development, Project Management Tools, Version Control Systems, Continuous Integration/Continuous Deployment, Issue Tracking and Bug Reporting, Collaborative Development Environments, Team Communication and Chat Tools, Task and Time Management, Customer Support and Feedback, Cloud-based Development Environments, Image Stock Platforms, Video Hosting and Portals, Social Networks, Professional Social Networks, Dating Apps
+### Risks and Limitations
+Known limitations:
+1. Can add extra spaces or remove spaces: `1-2` -> `1 - 2`.
+2. Can add extra words: `5` -> `5 years`.
+3. Can not differentiate between `<>=` and theirs HTML versions `&lt;`, `&gt;`, `&eq;`.
+4. Bad with abbreviations.
+5. Can add extra `.0` for floats and integers.
+6. Can add extra `0` or remove `0` for integers with a char postfix: `10M` -> `1m`.
+7. Can hallucinate with integers. For the case like `list of positions exactly 7 openings available` result can be
+`{'Name': 'Job_Type.Exact_Match', 'Value': 'Full Time'}`.
+8. We fine-tuned this model with max sequence length = 1024, so it may happen that response will not be JSON-readable.
+The list will be extended in the future.
+### Recommendations
+1. We used synthetic data for the first version of this model. So, we suggest you to precisely test this model on your company's domain, even it's in the list.
+2. Use meaningful names for filters and theirs representations.
+3. Provide examples for each representation.
+4. Try to be compact, model was fine-tuned with max sequence length equal 1024.
+5. During the generation use greedy strategy with tempertature 0.05.
+6. The result will be better if you align a filters schema with a schema type of the training data.
 ## How to Get Started with the Model
 Use the code below to get started with the model.
+```python
+MODEL_ID = 'EmbeddingStudio/query-parser-falcon-7b-instruct'
+```
+Initialize tokenizer:
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    MODEL_ID,
+    trust_remote_code=True,
+    add_prefix_space=True,
+    use_fast=False,
+)
+```
+Initialize model:
+```python
+import torch
+from peft import LoraConfig
+from transformers import AutoModelForCausalLM, BitsAndBytesConfig
+peft_config = LoraConfig(
+      lora_alpha=16,
+      lora_dropout=0.1,
+      r=64,
+      bias="none",
+      task_type="CAUSAL_LM",
+)
+bnb_config = BitsAndBytesConfig(
+    load_in_4bit=True,
+    load_4bit_use_double_quant=True,
+    bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
+)
+device_map = {"": 0}
+model = AutoModelForCausalLM.from_pretrained(
+  MODEL_ID,
+  quantization_config=bnb_config,
+  device_map=device_map,
+  torch_dtype=torch.float16
+)
+```
+Use for parsing:
+```python
+import json
+from json import JSONDecodeError
+INSTRUCTION_TEMPLATE = """
+### System: Master in Query Analysis
+### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
+#### Category: {0}
+#### Schema: ```{1}```
+#### Query: {2}
+### Response:
+"""
+def parse(
+        query: str,
+        company_category: str,
+        filter_schema: dict,
+        model: AutoModelForCausalLM,
+        tokenizer: AutoTokenizer
+):
+    input_text = INSTRUCTION_TEMPLATE.format(
+      company_category,
+      json.dumps(filter_schema),
+      query
+    )
+    input_ids = tokenizer.encode(input_text, return_tensors='pt')
+    # Generating text
+    output = model.generate(input_ids.to('cuda'),
+                            max_new_tokens=1024,
+                            do_sample=True,
+                            temperature=0.05,
+                            pad_token_id=50256
+    )
+    try:
+        parsed = json.loads(tokenizer.decode(output[0], skip_special_tokens=True).split('## Response:\n')[-1])
+    except JSONDecodeError as e:
+        parsed = dict()
+    return parsed
+category = 'Logistics and Supply Chain Management'
+query = 'Which logistics companies in the US have a perfect 5.0 rating ?'
+schema = [{"Name": "Customer_Ratings", "Representations": [{"Name": "Exact_Rating", "Type": "float", "Examples": [4.5, 3.2, 5.0, "4.5", "Unstructured"]}, {"Name": "Minimum_Rating", "Type": "float", "Examples": [4.0, 3.0, 5.0, "4.5"]}, {"Name": "Star_Rating", "Type": "int", "Examples": [4, 3, 5], "Enum": [1, 2, 3, 4, 5]}]}, {"Name": "Date", "Representations": [{"Name": "Day_Month_Year", "Type": "str", "Examples": ["01.01.2024", "15.06.2023", "31.12.2022", "25.12.2021", "20.07.2024", "15.06.2023"], "Pattern": "dd.mm.YYYY"}, {"Name": "Day_Name", "Type": "str", "Examples": ["Monday", "Wednesday", "Friday", "Thursday", "Monday", "Tuesday"], "Enum": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}]}, {"Name": "Date_Period", "Representations": [{"Name": "Specific_Period", "Type": "str", "Examples": ["01.01.2024 - 31.01.2024", "01.06.2023 - 30.06.2023", "01.12.2022 - 31.12.2022"], "Pattern": "dd.mm.YYYY - dd.mm.YYYY"}, {"Name": "Month", "Type": "str", "Examples": ["January", "June", "December"], "Enum": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]}, {"Name": "Quarter", "Type": "str", "Examples": ["Q1", "Q2", "Q3"], "Enum": ["Q1", "Q2", "Q3", "Q4"]}, {"Name": "Season", "Type": "str", "Examples": ["Winter", "Summer", "Autumn"], "Enum": ["Winter", "Spring", "Summer", "Autumn"]}]}, {"Name": "Destination_Country", "Representations": [{"Name": "Country_Name", "Type": "str", "Examples": ["United States", "Germany", "China"]}, {"Name": "Country_Code", "Type": "str", "Examples": ["US", "DE", "CN"]}, {"Name": "Country_Abbreviation", "Type": "str", "Examples": ["USA", "GER", "CHN"]}]}]
+output = parse(query, category, schema)
+print(output)
+# [out]: [{"Value": "Which logistics companies in the US have a perfect 5.0 rating?", "Name": "Correct"}, {"Name": "Customer_Ratings.Exact_Rating", "Value": 5.0}, {"Name": "Destination_Country.Country_Code", "Value": "US"}]
+```
 ## Training Details