Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 186 additions & 0 deletions docs/howto/asyncDownloadMultiple.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
## Asynchronous Multiple File Downloads

The Gen3 SDK provides an optimized asynchronous download method `async_download_multiple` for efficiently downloading large numbers of files with high throughput and memory efficiency.

## Overview

The `async_download_multiple` method implements a hybrid architecture combining:

- **Multiprocessing**: Multiple Python subprocesses for CPU utilization
- **Asyncio**: High I/O concurrency within each process
- **Queue-based memory management**: Efficient handling of large file sets
- **Just-in-time presigned URL generation**: Optimized authentication flow

## Architecture

### Concurrency Model

The implementation uses a three-tier architecture:

1. **Producer Thread**: Feeds GUIDs to worker processes via bounded queues
2. **Worker Processes**: Multiple Python subprocesses with asyncio event loops
3. **Queue System**: Memory-efficient streaming of work items

```python
# Architecture overview
Producer Thread → Input Queue → Worker Processes → Output Queue → Results
(1) (configurable) (configurable) (configurable) (Final)
```

### Key Features

- **Memory Efficiency**: Bounded queues prevent memory explosion with large file sets
- **True Parallelism**: Multiprocessing bypasses Python GIL limitations
- **High Concurrency**: Configurable concurrent downloads per process
- **Resume Support**: Skip completed files with `--skip-completed` flag
- **Progress Tracking**: Real-time progress bars and detailed reporting

## Usage

### Command Line Interface

Download multiple files using a manifest:

```bash
gen3 --endpoint my-commons.org --auth credentials.json download-multiple \
--manifest files.json \
--download-path ./downloads \
--max-concurrent-requests 10 \
--filename-format original \
--skip-completed \
--no-prompt
```

### Python API

The `async_download_multiple` method is available in the `Gen3File` class for programmatic use. Refer to the Python SDK documentation for the complete API reference.

## Parameters

For detailed parameter information and current default values, run:

```bash
gen3 download-multiple --help
```

The command supports various options for customizing download behavior, including concurrency settings, file naming strategies, and progress controls.

## Performance Characteristics

### Throughput Optimization

The method is optimized for high-throughput scenarios:

- **Concurrent Downloads**: Configurable number of simultaneous downloads
- **Memory Usage**: Bounded by queue sizes (typically < 100MB)
- **CPU Utilization**: Leverages multiple CPU cores
- **Network Efficiency**: Just-in-time presigned URL generation

### Scalability

Performance scales with:

- **File Count**: Linear time complexity with constant memory usage
- **File Size**: Independent of individual file sizes
- **Network Bandwidth**: Limited by available bandwidth and concurrent connections
- **System Resources**: Scales with available CPU cores and memory

## Error Handling

### Robust Error Recovery

The implementation includes comprehensive error handling:

- **Network Failures**: Automatic retry with exponential backoff
- **Authentication Errors**: Token refresh and retry
- **File System Errors**: Graceful handling of permission and space issues
- **Process Failures**: Automatic worker process restart

### Result Reporting

The method returns a structured result object containing lists of succeeded, failed, and skipped downloads with detailed information about each operation.

## Best Practices

### Configuration Recommendations

For optimal performance, adjust the concurrency and process settings based on your specific use case:

- **Small files**: Use higher concurrent request limits
- **Large files**: Use lower concurrent request limits to avoid overwhelming the system
- **High-bandwidth networks**: Increase the number of worker processes
- **Limited memory**: Reduce queue sizes to manage memory usage


## Comparison with Synchronous Downloads

### Performance Advantages

| Metric | Synchronous | Asynchronous |
| ------------------ | ---------------------------- | ---------------------------- |
| Memory Usage | O(n) - grows with file count | O(1) - bounded by queue size |
| CPU Utilization | Single core | Multiple cores |
| Network Efficiency | Sequential | Parallel |
| Scalability | Limited by GIL | Scales with CPU cores |

## Troubleshooting

### Common Issues

**Slow Downloads:**

- Check network bandwidth and server limits
- Reduce concurrent request limits if server is overwhelmed

**Memory Issues:**

- Reduce queue sizes and batch sizes
- Lower the number of worker processes if system memory is limited
- Monitor system memory usage during downloads

**Authentication Errors:**

- Verify credentials file is valid and not expired
- Check endpoint URL is correct
- Ensure proper permissions for target files

**Process Failures:**

- Check system resources (CPU, memory, file descriptors)
- Verify network connectivity to Gen3 commons
- Review logs for specific error messages

### Debugging

Enable verbose logging for detailed debugging:

```bash
gen3 -vv --endpoint my-commons.org --auth credentials.json download-multiple \
--manifest files.json \
--download-path ./downloads
```

## Examples

### Basic Usage

```bash
# Download files with default settings
gen3 --endpoint data.commons.io --auth creds.json download-multiple \
--manifest my_files.json \
--download-path ./data
```

### High-Performance Configuration

```bash
# Optimized for high-throughput downloads
gen3 --endpoint data.commons.io --auth creds.json download-multiple \
--manifest large_dataset.json \
--download-path ./large_downloads \
--max-concurrent-requests 8 \
--no-progress \
--skip-completed
```

**Note**: The specific values shown in examples (like `--max-concurrent-requests 8`) are for demonstration only. For current parameter options and default values, always refer to the command line help: `gen3 download-multiple --help`
3 changes: 3 additions & 0 deletions gen3/cli/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
import gen3.cli.drs_pull as drs_pull
import gen3.cli.users as users
import gen3.cli.wrap as wrap
import gen3.cli.download as download
import gen3
from gen3 import logging as sdklogging
from gen3.cli import nih
Expand Down Expand Up @@ -142,6 +143,8 @@ def main(
main.add_command(objects.objects)
main.add_command(drs_pull.drs_pull)
main.add_command(file.file)
main.add_command(download.download_single, name="download-single")
main.add_command(download.download_multiple, name="download-multiple")
main.add_command(nih.nih)
main.add_command(users.users)
main.add_command(wrap.run)
Expand Down
Loading
Loading