Support for custom column names in harness for csv#650
Support for custom column names in harness for csv#650ArshaanNazir merged 27 commits intorelease/1.3.0from
Conversation
JulesBelveze
left a comment
There was a problem hiding this comment.
- Please merge the latest changes from
release/1.2.0 - I am not getting why we are not extending the already existing
CSVDatasetwe have? - Let's put this on hold for a moment as we are still debating with Arshaan on the parameters the
Harnessclass should take. We'll think about it on Monday and get back to you but for now we're thinking of something like this:
Harness(
task="ner",
model={
[{"model": "bert-base-cased", "hub": "huggingface"},
{"model": "path/to/local/model", "hub": "johnsnowlabs"}]
},
data={
"data_source" : "mydataset"
"subset": "sst2",
"feature_column": "sentence",
"target_column": "label",
"split": "train"
}
)|
|
||
|
|
||
| class CustomCSVDataset(_IDataset): | ||
| """ | ||
| A class to handle CSV files datasets. Subclass of _IDataset. |
There was a problem hiding this comment.
How is that different from the CSVDataset we already have?
There was a problem hiding this comment.
I think CSVDataset is dynamically called based on the .csv extenstion that we have in the file path
you can check the load method of DataFactory
for custom column names we are using dictionary format in data so it will not match .
What do you think @JulesBelveze?
There was a problem hiding this comment.
I would rather have only one class that handles CSV files @Prikshit7766
There was a problem hiding this comment.
we can try that
JulesBelveze
left a comment
There was a problem hiding this comment.
Hmmm not sure where we going here and we need to have a standard way to load data across all tasks and hubs.
I think that by looking at the current codebase each XXXDataset object should have a load_data method to which we pass feature_column, target_column.
| self._custom_label = file_path | ||
| if isinstance(self._custom_label, dict): | ||
| self._file_path = file_path["name"] | ||
| else: | ||
| self._file_path = file_path |
There was a problem hiding this comment.
Don't get what is this _custom_label attribute
There was a problem hiding this comment.
_custom_label conatin
{"name": r"data\imdb.csv",
"feature_column": "text",
"target_column": 'label',
}```
There was a problem hiding this comment.
Well then file_path is not a string anymore and doesn't refer to the file location
There was a problem hiding this comment.
file_path contains only the path i think ,
also based on the path extension we are calling the load_data method of respective class
JulesBelveze
left a comment
There was a problem hiding this comment.
I am sorry guys but the approach is, to me, not user friendly and having this file_path parameter actually being a dictionary is to me a bit weird..
I'll write down and share with you what I have in mind and we can discuss it.
Checklist:
pydanticfor typing when/where necessary.allowing users to specify custom column names for various tasks.
Notebook
text-classification
question-answering
summarization
ner