Skip to content

Support for custom column names in harness for csv#650

Merged
ArshaanNazir merged 27 commits intorelease/1.3.0from
support-for-custom-column-names-in-harness-for-csv
Aug 18, 2023
Merged

Support for custom column names in harness for csv#650
ArshaanNazir merged 27 commits intorelease/1.3.0from
support-for-custom-column-names-in-harness-for-csv

Conversation

@Prikshit7766
Copy link
Copy Markdown
Contributor

@Prikshit7766 Prikshit7766 commented Jul 20, 2023

Checklist:

  • I've added Google style docstrings to my code.
  • I've used pydantic for typing when/where necessary.
  • I have linted my code
  • I have added tests to cover my changes.

allowing users to specify custom column names for various tasks.

Notebook

text-classification

  • with custom column
harness = Harness(task="text-classification",
                  model={"model":"lvwerra/distilbert-imdb", "hub":"huggingface"},
                  data={"data_source": "data_file.csv",
                        "feature_column": "text",    
                        "target_column": 'label',     
                       })

  • Directly providing the data path
harness = Harness(
                  task="text-classification", 
                 model={"model":"lvwerra/distilbert-imdb", "hub":"huggingface"},
                data={"data_source": "data_file.csv"}

question-answering

  • with custom column
harness = Harness(task="question-answering", 
                  model={"model":"gpt-3.5-turbo","hub":"openai"},
                  data={"data_source":"data_file.csv",
                  "feature_column":{"passage": "context", "question": "question"},
                  "target_column":'answer_start',
                  })
  • Directly providing the data path
harness = Harness(task="question-answering", 
                 model={"model":"gpt-3.5-turbo","hub":"openai"},
                  data={"data_source":"data_file.csv"},

                  )

summarization

  • with custom column
harness = Harness(task="summarization",
                  model={"model":"gpt-3.5-turbo","hub":"openai"},
                  data={"data_source":"data_file.csv",
                  "feature_column":"headlines",
                  "target_column":'text',
                  })
  • Directly providing the data path
harness = Harness(task="summarization",
                   model={"model":"gpt-3.5-turbo","hub":"openai"},
                  data={"data_source":"data_file.csv"}
                  )

ner

  • with custom column
harness = Harness( 
                  task="ner",
                  model={"model":"dslim/bert-base-NER", "hub":"huggingface"},
                  data={"data_source": "data_file.csv",
                        "feature_column": "tokens",    
                        "target_column": 'ner_tags',     
                       })
  • Directly providing the data path
harness = Harness(
        task="ner",
        model={"model":"dslim/bert-base-NER", "hub":"huggingface"},
        data={"data_source":"data_file.csv"},
       
)

@Prikshit7766 Prikshit7766 added the ⭐ Feature Indicates new feature requests label Jul 21, 2023
@Prikshit7766 Prikshit7766 self-assigned this Jul 21, 2023
Copy link
Copy Markdown
Contributor

@JulesBelveze JulesBelveze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please merge the latest changes from release/1.2.0
  2. I am not getting why we are not extending the already existing CSVDataset we have?
  3. Let's put this on hold for a moment as we are still debating with Arshaan on the parameters the Harness class should take. We'll think about it on Monday and get back to you but for now we're thinking of something like this:
Harness(
    task="ner",
    model={
        [{"model": "bert-base-cased", "hub": "huggingface"},
        {"model": "path/to/local/model", "hub": "johnsnowlabs"}]
    },
    data={
        "data_source" : "mydataset"
        "subset": "sst2",
        "feature_column": "sentence",
        "target_column": "label",
        "split": "train"
    }
)

Comment thread langtest/datahandler/datasource.py Outdated
Comment on lines +879 to +883


class CustomCSVDataset(_IDataset):
"""
A class to handle CSV files datasets. Subclass of _IDataset.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is that different from the CSVDataset we already have?

Copy link
Copy Markdown
Contributor Author

@Prikshit7766 Prikshit7766 Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CSVDataset is dynamically called based on the .csv extenstion that we have in the file path
you can check the load method of DataFactory
for custom column names we are using dictionary format in data so it will not match .

What do you think @JulesBelveze?

Copy link
Copy Markdown
Contributor

@JulesBelveze JulesBelveze Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have only one class that handles CSV files @Prikshit7766

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can try that

Copy link
Copy Markdown
Contributor

@JulesBelveze JulesBelveze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm not sure where we going here and we need to have a standard way to load data across all tasks and hubs.
I think that by looking at the current codebase each XXXDataset object should have a load_data method to which we pass feature_column, target_column.

Comment thread langtest/datahandler/datasource.py Outdated
Comment on lines +101 to +105
self._custom_label = file_path
if isinstance(self._custom_label, dict):
self._file_path = file_path["name"]
else:
self._file_path = file_path
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't get what is this _custom_label attribute

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_custom_label conatin

{"name": r"data\imdb.csv",
                        "feature_column": "text",    
                        "target_column": 'label',     
                       }```
            

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well then file_path is not a string anymore and doesn't refer to the file location

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_path contains only the path i think ,
also based on the path extension we are calling the load_data method of respective class

Copy link
Copy Markdown
Contributor

@JulesBelveze JulesBelveze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry guys but the approach is, to me, not user friendly and having this file_path parameter actually being a dictionary is to me a bit weird..
I'll write down and share with you what I have in mind and we can discuss it.

@ArshaanNazir ArshaanNazir added the v2.1.0 Issue or request to be done in v2.1.0 release label Aug 7, 2023
@Prikshit7766 Prikshit7766 changed the base branch from release/1.2.0 to release/1.3.0 August 17, 2023 18:20
@Prikshit7766 Prikshit7766 requested review from chakravarthik27 and removed request for JulesBelveze August 17, 2023 18:27
@Prikshit7766 Prikshit7766 marked this pull request as ready for review August 17, 2023 18:36
@ArshaanNazir ArshaanNazir merged commit 21fe1bd into release/1.3.0 Aug 18, 2023
@ArshaanNazir ArshaanNazir deleted the support-for-custom-column-names-in-harness-for-csv branch August 22, 2023 05:28
@ArshaanNazir ArshaanNazir removed the v2.1.0 Issue or request to be done in v2.1.0 release label Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⭐ Feature Indicates new feature requests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Support for QA and Summarization Tasks for CSV Dataset Support for Custom Column Names in Harness for CSV

4 participants