Skip to content

Bug: Packaging issue when handling large files plus small files (zip + large file) #49

@kenlhlui

Description

@kenlhlui

Bug: Packaging issue when handling large files plus small files (zip + large file)

I found a bug regarding uploading a large file along with several small files in the same upload session.

Reproduction Script

Prerequisites:

  • A PID in a Dataverse instance where you have write access
  • python-dotenv installed
    • pip install python-dotenv
    • Create a .env file with the following variables:
      • DV_URL='your_dataverse_url'
      • API_TOKEN='your_api_token'
      • PID='your_pid'

To reproduce the issue, use the following script:

# Create dummy files (large and small) for testing purposes
import os
from pathlib import Path

# Create a directory for the files if it doesn't exist
Path("./files").mkdir(parents=True, exist_ok=True)


def create_dummy_file(file_path: Path, size_in_bytes: int):
    with open(file_path, "wb") as f:
        f.write(os.urandom(size_in_bytes))


create_dummy_file(
    Path("./files/4gb_dummy_file.bin"), 4 * 1024 * 1024 * 1024
)  # 4 GB file
create_dummy_file(
    Path("./files/1gb_dummy_file.bin"), 1 * 1024 * 1024 * 1024
)  # 1 GB file
create_dummy_file(Path("./files/10mb_dummy_file.bin"), 10 * 1024 * 1024)  # 10 MB file
create_dummy_file(Path("./files/1kb_dummy_file.bin"), 1024)  # 1 KB file

from dotenv import load_dotenv
import os

load_dotenv()

# Load the env
DV_URL = os.getenv("DV_URL", "")
API_TOKEN = os.getenv("API_TOKEN", "")
PID = os.getenv("PID", "")

import dvuploader as dv

files = [
    *dv.add_directory("./files/"),  # Add an entire directory
]


dvuploader = dv.DVUploader(files=files)
dvuploader.upload(
    api_token=API_TOKEN,
    dataverse_url=DV_URL,
    persistent_id=PID,
    n_parallel_uploads=4,  # Whatever your instance can handle
    replace_existing=False,
)

And you will only see the 4GB file in the upload queue, the other 3 files are lost (should be zipped together):

╭───────────── DVUploader ──────────────╮
│ Server: https://demo.borealisdata.ca/ │
│ PID: doi:10.80240/FK2/JBIRNC          │
│ Files: 4                              │
╰───────────────────────────────────────╯
🔎 Checking dataset files
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ File                ┃ Status ┃ Action ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ 10mb_dummy_file.bin │ New    │ Upload │
│ 4gb_dummy_file.bin  │ New    │ Upload │
│ 1gb_dummy_file.bin  │ New    │ Upload │
│ 1kb_dummy_file.bin  │ New    │ Upload │
└─────────────────────┴────────┴────────┘

⚠️  Direct upload not supported. Falling back to Native API.

🚀 Uploading files

4gb_dummy_file.bin ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--

Note: I tried to upload to demo.dataverse.org and demo.borealisdata.ca, the error is the same. So it's not about the repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions