Support for custom column names in harness for csv by Prikshit7766 · Pull Request #650 · PacificAI/langtest

Prikshit7766 · 2023-07-20T06:54:26Z

Checklist:

I've added Google style docstrings to my code.
I've used pydantic for typing when/where necessary.
I have linted my code
I have added tests to cover my changes.

allowing users to specify custom column names for various tasks.

text-classification

with custom column

harness = Harness(task="text-classification",
                  model={"model":"lvwerra/distilbert-imdb", "hub":"huggingface"},
                  data={"data_source": "data_file.csv",
                        "feature_column": "text",    
                        "target_column": 'label',     
                       })

Directly providing the data path

harness = Harness(
                  task="text-classification", 
                 model={"model":"lvwerra/distilbert-imdb", "hub":"huggingface"},
                data={"data_source": "data_file.csv"}

question-answering

with custom column

harness = Harness(task="question-answering", 
                  model={"model":"gpt-3.5-turbo","hub":"openai"},
                  data={"data_source":"data_file.csv",
                  "feature_column":{"passage": "context", "question": "question"},
                  "target_column":'answer_start',
                  })

Directly providing the data path

harness = Harness(task="question-answering", 
                 model={"model":"gpt-3.5-turbo","hub":"openai"},
                  data={"data_source":"data_file.csv"},

                  )

summarization

with custom column

harness = Harness(task="summarization",
                  model={"model":"gpt-3.5-turbo","hub":"openai"},
                  data={"data_source":"data_file.csv",
                  "feature_column":"headlines",
                  "target_column":'text',
                  })

Directly providing the data path

harness = Harness(task="summarization",
                   model={"model":"gpt-3.5-turbo","hub":"openai"},
                  data={"data_source":"data_file.csv"}
                  )

ner

with custom column

harness = Harness( 
                  task="ner",
                  model={"model":"dslim/bert-base-NER", "hub":"huggingface"},
                  data={"data_source": "data_file.csv",
                        "feature_column": "tokens",    
                        "target_column": 'ner_tags',     
                       })

Directly providing the data path

harness = Harness(
        task="ner",
        model={"model":"dslim/bert-base-NER", "hub":"huggingface"},
        data={"data_source":"data_file.csv"},
       
)

into support-for-custom-column-names-in-harness-for-csv

JulesBelveze

Please merge the latest changes from release/1.2.0
I am not getting why we are not extending the already existing CSVDataset we have?
Let's put this on hold for a moment as we are still debating with Arshaan on the parameters the Harness class should take. We'll think about it on Monday and get back to you but for now we're thinking of something like this:

Harness(
    task="ner",
    model={
        [{"model": "bert-base-cased", "hub": "huggingface"},
        {"model": "path/to/local/model", "hub": "johnsnowlabs"}]
    },
    data={
        "data_source" : "mydataset"
        "subset": "sst2",
        "feature_column": "sentence",
        "target_column": "label",
        "split": "train"
    }
)

JulesBelveze · 2023-07-21T13:00:31Z

+
+
+class CustomCSVDataset(_IDataset):
+    """
+    A class to handle CSV files datasets. Subclass of _IDataset.


How is that different from the CSVDataset we already have?

I think CSVDataset is dynamically called based on the .csv extenstion that we have in the file path
you can check the load method of DataFactory
for custom column names we are using dictionary format in data so it will not match .

What do you think @JulesBelveze?

I would rather have only one class that handles CSV files @Prikshit7766

we can try that

JulesBelveze

Hmmm not sure where we going here and we need to have a standard way to load data across all tasks and hubs.
I think that by looking at the current codebase each XXXDataset object should have a load_data method to which we pass feature_column, target_column.

JulesBelveze · 2023-07-28T08:06:20Z

+        self._custom_label = file_path
+        if isinstance(self._custom_label, dict):
+            self._file_path = file_path["name"]
+        else:
+            self._file_path = file_path


Don't get what is this _custom_label attribute

_custom_label conatin

{"name": r"data\imdb.csv", "feature_column": "text", "target_column": 'label', }```

Well then file_path is not a string anymore and doesn't refer to the file location

file_path contains only the path i think ,
also based on the path extension we are calling the load_data method of respective class

…swering

JulesBelveze

I am sorry guys but the approach is, to me, not user friendly and having this file_path parameter actually being a dictionary is to me a bit weird..
I'll write down and share with you what I have in mind and we can discuss it.

Prikshit7766 added 2 commits July 20, 2023 12:16

custom-column-names for text-classification and summarization

5b76e46

updated langtest.py for custom-column-names for csv

2f1db05

Prikshit7766 marked this pull request as draft July 20, 2023 06:54

This was linked to issues Jul 20, 2023

Support for Custom Column Names in Harness for CSV #625

Closed

Add Support for QA and Summarization Tasks for CSV Dataset #626

Closed

Prikshit7766 added 4 commits July 20, 2023 12:32

Format: datasource.py and langtest.py

ad7da52

added default_question_answering_prompt

d62598e

added support for question answering for csv dataset

902795b

Test(test/test_harness.py): added test

3382635

Prikshit7766 requested review from JulesBelveze July 21, 2023 09:23

Prikshit7766 added the ⭐ Feature Indicates new feature requests label Jul 21, 2023

Prikshit7766 self-assigned this Jul 21, 2023

Prikshit7766 added 3 commits July 21, 2023 15:43

updated test

b459bbc

minor change

ea9e607

Merge branch 'release/1.2.0' of https://github.com/JohnSnowLabs/langtest

373bd56

into support-for-custom-column-names-in-harness-for-csv

JulesBelveze suggested changes Jul 21, 2023

View reviewed changes

Prikshit7766 added 4 commits July 21, 2023 18:41

chore(datasource): add load raw method for CustomCSVDataset

9c8a6b3

updated test for HuggingFaceDataset

f45509f

updated CSVDataset for custom column names

108f6b0

tests\test_datasource.py reformatted

234177f

JulesBelveze suggested changes Jul 28, 2023

View reviewed changes

Prikshit7766 added 6 commits July 28, 2023 13:57

datasource.py updated

30add66

re-arranged code and directly load csv for summarization, question-an…

c4692be

…swering

re-arranged classes

49a3ced

added some checks and support for custom columns ner

f82ac20

Test(test/test_harness.py): added some test

c1b0d8f

file path updated

55ce03e

Prikshit7766 requested review from JulesBelveze July 30, 2023 19:19

resolve conflicts

dc332eb

JulesBelveze suggested changes Jul 31, 2023

View reviewed changes

ArshaanNazir added the v2.1.0 Issue or request to be done in v2.1.0 release label Aug 7, 2023

Prikshit7766 added 3 commits August 17, 2023 22:00

conflict resolved

f15585c

updated test_harness.py and datasource.py

aeefe00

fix lint

b457911

Prikshit7766 changed the base branch from release/1.2.0 to release/1.3.0 August 17, 2023 18:20

Prikshit7766 requested review from chakravarthik27 and removed request for JulesBelveze August 17, 2023 18:27

Prikshit7766 marked this pull request as ready for review August 17, 2023 18:36

Prikshit7766 added 4 commits August 18, 2023 18:06

added Loading_Data_with_Custom_Columns notebook

299eafb

updated docstring and added tests

f1a4c78

updated test_datasource.py

9279cde

updated load_raw_data for csv

0f904dd

chakravarthik27 approved these changes Aug 18, 2023

View reviewed changes

ArshaanNazir merged commit 21fe1bd into release/1.3.0 Aug 18, 2023

ArshaanNazir deleted the support-for-custom-column-names-in-harness-for-csv branch August 22, 2023 05:28

ArshaanNazir removed the v2.1.0 Issue or request to be done in v2.1.0 release label Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for custom column names in harness for csv#650

Support for custom column names in harness for csv#650
ArshaanNazir merged 27 commits intorelease/1.3.0from
support-for-custom-column-names-in-harness-for-csv

Prikshit7766 commented Jul 20, 2023 •

edited

Loading

Uh oh!

JulesBelveze left a comment

Uh oh!

JulesBelveze Jul 21, 2023

Uh oh!

Prikshit7766 Jul 21, 2023 •

edited

Loading

Uh oh!

JulesBelveze Jul 21, 2023 •

edited

Loading

Uh oh!

Prikshit7766 Jul 21, 2023

Uh oh!

JulesBelveze left a comment

Uh oh!

JulesBelveze Jul 28, 2023

Uh oh!

Prikshit7766 Jul 28, 2023

Uh oh!

JulesBelveze Jul 28, 2023

Uh oh!

Prikshit7766 Jul 28, 2023

Uh oh!

JulesBelveze left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Prikshit7766 commented Jul 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist:

text-classification

question-answering

summarization

ner

Uh oh!

JulesBelveze left a comment

Choose a reason for hiding this comment

Uh oh!

JulesBelveze Jul 21, 2023

Choose a reason for hiding this comment

Uh oh!

Prikshit7766 Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JulesBelveze Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Prikshit7766 Jul 21, 2023

Choose a reason for hiding this comment

Uh oh!

JulesBelveze left a comment

Choose a reason for hiding this comment

Uh oh!

JulesBelveze Jul 28, 2023

Choose a reason for hiding this comment

Uh oh!

Prikshit7766 Jul 28, 2023

Choose a reason for hiding this comment

Uh oh!

JulesBelveze Jul 28, 2023

Choose a reason for hiding this comment

Uh oh!

Prikshit7766 Jul 28, 2023

Choose a reason for hiding this comment

Uh oh!

JulesBelveze left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Prikshit7766 commented Jul 20, 2023 •

edited

Loading

Prikshit7766 Jul 21, 2023 •

edited

Loading

JulesBelveze Jul 21, 2023 •

edited

Loading