Context
load_dwca_data() in src/dataset_tools/utils.py:20 hardcodes the columns it selects from the DwC-A occurrence data, and verbatimScientificName is not among them. This means all downstream CLI commands (clean-dataset, verify-images, split-dataset, etc.) cannot natively use verbatimScientificName as a label column.
Currently, this is worked around by scripts/build_species_list.py, which reads the DwC-A independently and joins the name column onto the annotations CSV after the clean-dataset step. This works but adds an extra step to the pipeline.
Proposed Changes
-
Add an extra_columns parameter (or a name_column option) to load_dwca_data() in src/dataset_tools/utils.py:20 so that additional DwC-A columns can be carried through the pipeline.
-
Update the CLI decorators/options in src/dataset_tools/cli.py to expose this option on relevant commands (clean-dataset, split-dataset, etc.).
-
Once supported natively, the build_species_list.py bridge script could be simplified or removed.
Related
Context
load_dwca_data()insrc/dataset_tools/utils.py:20hardcodes the columns it selects from the DwC-A occurrence data, andverbatimScientificNameis not among them. This means all downstream CLI commands (clean-dataset,verify-images,split-dataset, etc.) cannot natively useverbatimScientificNameas a label column.Currently, this is worked around by
scripts/build_species_list.py, which reads the DwC-A independently and joins the name column onto the annotations CSV after theclean-datasetstep. This works but adds an extra step to the pipeline.Proposed Changes
Add an
extra_columnsparameter (or aname_columnoption) toload_dwca_data()insrc/dataset_tools/utils.py:20so that additional DwC-A columns can be carried through the pipeline.Update the CLI decorators/options in
src/dataset_tools/cli.pyto expose this option on relevant commands (clean-dataset,split-dataset, etc.).Once supported natively, the
build_species_list.pybridge script could be simplified or removed.Related