Skip to content

Possible Issue with utf-8 encoding under windows while reading dataset #80

@sstiene

Description

@sstiene

I try to load a dataset with agml under win11 with python 3.12 but get an error.

How to reproduce it:

import agml

dataset_name = 'carrot_weeds_germany'
loader = agml.data.AgMLDataLoader(dataset_name)
dataset_path = loader.dataset_root

print(f"Datensatz heruntergeladen nach: {dataset_path}")

The error is:

Traceback (most recent call last):
  File "c:\Users\stefa\HSOS\Vorbereitung Einführung in die KI (BAT) - General\Praktika\label_studio_51\agml_test.py", line 4, in <module>
    loader = agml.data.AgMLDataLoader(dataset_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stefa\cv-env\Lib\site-packages\agml\data\loader.py", line 149, in __init__
    self._info = make_metadata(dataset, kwargs.get('meta', None))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stefa\cv-env\Lib\site-packages\agml\data\metadata.py", line 49, in make_metadata
    return DatasetMetadata(name)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stefa\cv-env\Lib\site-packages\agml\data\metadata.py", line 105, in __init__
    self._load_source_info(name)
  File "C:\Users\stefa\cv-env\Lib\site-packages\agml\data\metadata.py", line 148, in _load_source_info
    **load_citation_sources()[name], dataset = name)
      ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\stefa\cv-env\Lib\site-packages\agml\utils\data.py", line 38, in load_citation_sources
    return json.load(f)
           ^^^^^^^^^^^^
  File "C:\Users\stefa\AppData\Local\Programs\Python\Python312\Lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
                 ^^^^^^^^^
  File "C:\Users\stefa\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 22968: character maps to <undefined>

So it seams that there is an encoding issue while reading the citation_sources under Windows. I did test it with different datasets. Same error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions