Skip to content
This repository was archived by the owner on Jul 15, 2024. It is now read-only.
This repository was archived by the owner on Jul 15, 2024. It is now read-only.

tables_to_parquet() fails sporadically for large datasets #78

@galamit86

Description

@galamit86

In some cases, tables_to_parquet() fails on a large dataset, due to a restart of the task. The pathology is:

  1. A long time (over an hour) between this message (indicating the heaviest dask computation is starting):
Dumping bag to textfiles for cbs.v3.84750NED_TypedDataSet

and the next one, (indicating this task has been restarted):

Task 'tables_to_parquet[8]': Starting task run...
  1. The task continues, reaching the first message again (Dumping bag...)
  2. This time around it finishes, logging:
Finished dumping bag to textfiles for cbs.v3.84750NED_TypedDataSet
Starting to concatanate files for cbs.v3.84750NED_TypedDataSet
Concluded concatanating files for cbs.v3.84750NED_TypedDataSet
  1. And then throws an error:
Unexpected error: FileNotFoundError(2, "Failed to open local file '/tmp/cbs/v3/84750NED/20210320/json/cbs.v3.84750NED_TypedDataSet/cbs.v3.84750NED_TypedDataSet.json'. Detail: [errno 2] No such file or directory")
Traceback (most recent call last):
  File "/home/amitgalmail/nl-open-data/.venv/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/home/amitgalmail/nl-open-data/.venv/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 865, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/home/amitgalmail/nl-open-data/.venv/lib/python3.8/site-packages/prefect/utilities/executors.py", line 299, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "/home/amitgalmail/nl-open-data/.venv/lib/python3.8/site-packages/statline_bq/utils.py", line 1390, in tables_to_parquet
    pq_path = convert_table_to_parquet(
  File "/home/amitgalmail/nl-open-data/.venv/lib/python3.8/site-packages/statline_bq/utils.py", line 629, in convert_table_to_parquet
    pa_table = pa_json.read_json(json_path)
  File "pyarrow/_json.pyx", line 238, in pyarrow._json.read_json
  File "pyarrow/_json.pyx", line 193, in pyarrow._json._get_reader
  File "pyarrow/io.pxi", line 1493, in pyarrow.lib.get_input_stream
  File "pyarrow/io.pxi", line 1464, in pyarrow.lib.get_native_file
  File "pyarrow/io.pxi", line 827, in pyarrow.lib.OSFile.__cinit__
  File "pyarrow/io.pxi", line 837, in pyarrow.lib.OSFile._open_readable
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
FileNotFoundError: [Errno 2] Failed to open local file '/tmp/cbs/v3/84750NED/20210320/json/cbs.v3.84750NED_TypedDataSet/cbs.v3.84750NED_TypedDataSet.json'. Detail: [errno 2] No such file or directory

Full log from prefect attached.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions