chilly-magician commited on
Commit
3fef804
·
1 Parent(s): e68e2a4

[add]: risks, limitations, recommendations and how to get started sections

Browse files
Files changed (1) hide show
  1. README.md +127 -5
README.md CHANGED
@@ -5,8 +5,6 @@ base_model: tiiuae/falcon-7b-instruct
5
 
6
  # Model Card for the Query Parser LLM using Falcon-7B-Instruct
7
 
8
- [![version](https://img.shields.io/badge/version-0.0.1-red.svg)]()[![Python 3.9](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/)![CUDA 11.7.1](https://img.shields.io/badge/CUDA-11.7.1-green.svg)
9
-
10
  EmbeddingStudio is the [open-source framework](https://github.com/EulerSearch/embedding_studio/tree/main), that allows you transform a joint "Embedding Model + Vector DB" into
11
  a full-cycle search engine: collect clickstream -> improve search experience-> adapt embedding model and repeat out of the box.
12
 
@@ -200,19 +198,143 @@ def parse(
200
 
201
  ### Bias
202
 
 
 
203
 
 
204
 
205
- ### Risks
206
 
207
- ### Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
208
 
 
209
 
 
 
 
 
 
 
210
 
211
  ## How to Get Started with the Model
212
 
213
  Use the code below to get started with the model.
214
 
215
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
216
 
217
  ## Training Details
218
 
 
5
 
6
  # Model Card for the Query Parser LLM using Falcon-7B-Instruct
7
 
 
 
8
  EmbeddingStudio is the [open-source framework](https://github.com/EulerSearch/embedding_studio/tree/main), that allows you transform a joint "Embedding Model + Vector DB" into
9
  a full-cycle search engine: collect clickstream -> improve search experience-> adapt embedding model and repeat out of the box.
10
 
 
198
 
199
  ### Bias
200
 
201
+ Again, this model was fine-tuned for following the zero-shot query parsing instructions.
202
+ So, all ethical biases are inherited by the original model.
203
 
204
+ Model was fine-tuned to be able to work with the unknown company domain and filters schema. But, can be better with the training company categories:
205
 
206
+ Educational Institutions, Job Recruitment Agencies, Banking Services, Investment Services, Insurance Services, Financial Planning and Advisory, Credit Services, Payment Processing, Mortgage and Real Estate Services, Taxation Services, Risk Management and Compliance, Digital and Mobile Banking, Retail Stores (Online and Offline), Automotive Dealerships, Restaurants and Food Delivery Services, Entertainment and Media Platforms, Government Services, Travelers and Consumers, Logistics and Supply Chain Management, Customer Support Services, Market Research Firms, Mobile App Development, Game Development, Cloud Computing Services, Data Analytics and Business Intelligence, Cybersecurity Software, User Interface/User Experience Design, Internet of Things (IoT) Development, Project Management Tools, Version Control Systems, Continuous Integration/Continuous Deployment, Issue Tracking and Bug Reporting, Collaborative Development Environments, Team Communication and Chat Tools, Task and Time Management, Customer Support and Feedback, Cloud-based Development Environments, Image Stock Platforms, Video Hosting and Portals, Social Networks, Professional Social Networks, Dating Apps
207
 
208
+ ### Risks and Limitations
209
+
210
+ Known limitations:
211
+ 1. Can add extra spaces or remove spaces: `1-2` -> `1 - 2`.
212
+ 2. Can add extra words: `5` -> `5 years`.
213
+ 3. Can not differentiate between `<>=` and theirs HTML versions `&lt;`, `&gt;`, `&eq;`.
214
+ 4. Bad with abbreviations.
215
+ 5. Can add extra `.0` for floats and integers.
216
+ 6. Can add extra `0` or remove `0` for integers with a char postfix: `10M` -> `1m`.
217
+ 7. Can hallucinate with integers. For the case like `list of positions exactly 7 openings available` result can be
218
+ `{'Name': 'Job_Type.Exact_Match', 'Value': 'Full Time'}`.
219
+ 8. We fine-tuned this model with max sequence length = 1024, so it may happen that response will not be JSON-readable.
220
+
221
+ The list will be extended in the future.
222
 
223
+ ### Recommendations
224
 
225
+ 1. We used synthetic data for the first version of this model. So, we suggest you to precisely test this model on your company's domain, even it's in the list.
226
+ 2. Use meaningful names for filters and theirs representations.
227
+ 3. Provide examples for each representation.
228
+ 4. Try to be compact, model was fine-tuned with max sequence length equal 1024.
229
+ 5. During the generation use greedy strategy with tempertature 0.05.
230
+ 6. The result will be better if you align a filters schema with a schema type of the training data.
231
 
232
  ## How to Get Started with the Model
233
 
234
  Use the code below to get started with the model.
235
 
236
+ ```python
237
+ MODEL_ID = 'EmbeddingStudio/query-parser-falcon-7b-instruct'
238
+ ```
239
+
240
+ Initialize tokenizer:
241
+ ```python
242
+ from transformers import AutoTokenizer
243
+
244
+ tokenizer = AutoTokenizer.from_pretrained(
245
+ MODEL_ID,
246
+ trust_remote_code=True,
247
+ add_prefix_space=True,
248
+ use_fast=False,
249
+ )
250
+ ```
251
+
252
+ Initialize model:
253
+ ```python
254
+ import torch
255
+
256
+ from peft import LoraConfig
257
+ from transformers import AutoModelForCausalLM, BitsAndBytesConfig
258
+
259
+ peft_config = LoraConfig(
260
+ lora_alpha=16,
261
+ lora_dropout=0.1,
262
+ r=64,
263
+ bias="none",
264
+ task_type="CAUSAL_LM",
265
+ )
266
+
267
+ bnb_config = BitsAndBytesConfig(
268
+ load_in_4bit=True,
269
+ load_4bit_use_double_quant=True,
270
+ bnb_4bit_quant_type="nf4",
271
+ bnb_4bit_compute_dtype=torch.bfloat16,
272
+ )
273
+
274
+ device_map = {"": 0}
275
+
276
+ model = AutoModelForCausalLM.from_pretrained(
277
+ MODEL_ID,
278
+ quantization_config=bnb_config,
279
+ device_map=device_map,
280
+ torch_dtype=torch.float16
281
+ )
282
+ ```
283
+
284
+ Use for parsing:
285
+ ```python
286
+ import json
287
+
288
+ from json import JSONDecodeError
289
+
290
+ INSTRUCTION_TEMPLATE = """
291
+ ### System: Master in Query Analysis
292
+ ### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
293
+ #### Category: {0}
294
+ #### Schema: ```{1}```
295
+ #### Query: {2}
296
+ ### Response:
297
+ """
298
+
299
+
300
+ def parse(
301
+ query: str,
302
+ company_category: str,
303
+ filter_schema: dict,
304
+ model: AutoModelForCausalLM,
305
+ tokenizer: AutoTokenizer
306
+ ):
307
+ input_text = INSTRUCTION_TEMPLATE.format(
308
+ company_category,
309
+ json.dumps(filter_schema),
310
+ query
311
+ )
312
+ input_ids = tokenizer.encode(input_text, return_tensors='pt')
313
+
314
+ # Generating text
315
+ output = model.generate(input_ids.to('cuda'),
316
+ max_new_tokens=1024,
317
+ do_sample=True,
318
+ temperature=0.05,
319
+ pad_token_id=50256
320
+ )
321
+ try:
322
+ parsed = json.loads(tokenizer.decode(output[0], skip_special_tokens=True).split('## Response:\n')[-1])
323
+ except JSONDecodeError as e:
324
+ parsed = dict()
325
+
326
+ return parsed
327
+
328
+ category = 'Logistics and Supply Chain Management'
329
+ query = 'Which logistics companies in the US have a perfect 5.0 rating ?'
330
+ schema = [{"Name": "Customer_Ratings", "Representations": [{"Name": "Exact_Rating", "Type": "float", "Examples": [4.5, 3.2, 5.0, "4.5", "Unstructured"]}, {"Name": "Minimum_Rating", "Type": "float", "Examples": [4.0, 3.0, 5.0, "4.5"]}, {"Name": "Star_Rating", "Type": "int", "Examples": [4, 3, 5], "Enum": [1, 2, 3, 4, 5]}]}, {"Name": "Date", "Representations": [{"Name": "Day_Month_Year", "Type": "str", "Examples": ["01.01.2024", "15.06.2023", "31.12.2022", "25.12.2021", "20.07.2024", "15.06.2023"], "Pattern": "dd.mm.YYYY"}, {"Name": "Day_Name", "Type": "str", "Examples": ["Monday", "Wednesday", "Friday", "Thursday", "Monday", "Tuesday"], "Enum": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}]}, {"Name": "Date_Period", "Representations": [{"Name": "Specific_Period", "Type": "str", "Examples": ["01.01.2024 - 31.01.2024", "01.06.2023 - 30.06.2023", "01.12.2022 - 31.12.2022"], "Pattern": "dd.mm.YYYY - dd.mm.YYYY"}, {"Name": "Month", "Type": "str", "Examples": ["January", "June", "December"], "Enum": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]}, {"Name": "Quarter", "Type": "str", "Examples": ["Q1", "Q2", "Q3"], "Enum": ["Q1", "Q2", "Q3", "Q4"]}, {"Name": "Season", "Type": "str", "Examples": ["Winter", "Summer", "Autumn"], "Enum": ["Winter", "Spring", "Summer", "Autumn"]}]}, {"Name": "Destination_Country", "Representations": [{"Name": "Country_Name", "Type": "str", "Examples": ["United States", "Germany", "China"]}, {"Name": "Country_Code", "Type": "str", "Examples": ["US", "DE", "CN"]}, {"Name": "Country_Abbreviation", "Type": "str", "Examples": ["USA", "GER", "CHN"]}]}]
331
+
332
+ output = parse(query, category, schema)
333
+ print(output)
334
+
335
+ # [out]: [{"Value": "Which logistics companies in the US have a perfect 5.0 rating?", "Name": "Correct"}, {"Name": "Customer_Ratings.Exact_Rating", "Value": 5.0}, {"Name": "Destination_Country.Country_Code", "Value": "US"}]
336
+ ```
337
+
338
 
339
  ## Training Details
340