An AWS Lambda function that processes visit event logs from S3 storage and creates queryable checkpoint parquet files for analytical reporting. The Lambda supports incremental processing, validating events against schema specifications, and aggregating data for monthly reporting queries.
This Lambda function scrapes event log files from the NACC Flywheel instance stored in S3, validates them using Pydantic models, and creates/updates a checkpoint parquet file optimized for analytical queries. The system follows an incremental processing approach: it reads the previous checkpoint (if it exists), retrieves only new JSON event files since the last checkpoint, validates them, and merges them to create an updated checkpoint file.
- Incremental Processing: Only processes new events since the last checkpoint, significantly improving performance
- Schema Validation: Uses Pydantic models to validate event structure and data types
- Error Resilience: Continues processing valid events even when some files fail validation
- Analytical Optimization: Outputs parquet files optimized for monthly reporting queries
- Observability: Full logging, tracing, and metrics using AWS Lambda Powertools
- Event Evolution Support: Preserves all event versions to track data completeness over time
The system follows a pipeline architecture with these components:
- CheckpointReader: Reads previous checkpoint and determines what new events to process
- S3EventRetriever: Retrieves event log files from S3 with timestamp filtering
- EventValidator: Validates events using Pydantic models
- CheckpointMerger: Merges previous checkpoint data with newly validated events
- ParquetWriter: Writes merged events to parquet format and uploads to S3
- Runtime: Python 3.12
- Build System: Pants 2.29
- Development Environment: Dev container with devcontainer CLI
- Infrastructure: Terraform
- Libraries:
- AWS Lambda Powertools (logging, tracing, metrics)
- Pydantic (data validation)
- Polars (DataFrame operations and parquet generation)
- Boto3 (AWS SDK - included with Powertools)
All development uses the established dev container workflow:
./bin/start-devcontainer.sh./bin/exec-in-devcontainer.sh pants fix lint check test ::./bin/exec-in-devcontainer.sh pants package lambda/event_log_checkpoint/src/python/checkpoint_lambda::./bin/exec-in-devcontainer.sh pants test lambda/event_log_checkpoint/test/python::./bin/terminal.sh # For exploration and debugging only./bin/stop-devcontainer.shImportant: Use ./bin/exec-in-devcontainer.sh <command> for all pants and terraform commands. Terraform is only available inside the dev container.
lambda/event_log_checkpoint/
├── src/python/checkpoint_lambda/
│ ├── lambda_function.py # Main Lambda handler
│ ├── models.py # Pydantic VisitEvent model
│ ├── checkpoint_reader.py # Reads previous checkpoints
│ ├── s3_retriever.py # Retrieves events from S3
│ ├── validator.py # Validates events
│ ├── merger.py # Merges checkpoint data
│ └── parquet_writer.py # Writes parquet files
├── test/python/ # Unit and property tests
├── main.tf # Terraform configuration
├── variables.tf # Terraform variables
└── outputs.tf # Terraform outputs
The Lambda processes visit events with the following structure:
| Field | Type | Required | Description |
|---|---|---|---|
| action | str | Yes | Event type: submit, pass-qc, not-pass-qc, delete |
| study | str | Yes | Study identifier (default: "adrc") |
| pipeline_adcid | int | Yes | Pipeline/center identifier |
| project_label | str | Yes | Flywheel project label |
| center_label | str | Yes | Center/group label |
| gear_name | str | Yes | Gear that logged the event |
| ptid | str | Yes | Participant ID (pattern: ^[A-Z0-9]+$, max 10 chars) |
| visit_date | date | Yes | Visit date (ISO format YYYY-MM-DD) |
| visit_number | str | Yes | Visit number |
| datatype | str | Yes | Data type (form, dicom, etc.) |
| module | str | No | Module name (UDS, FTLD, etc.) |
| packet | str | No | Packet type (I, F, etc.) |
| timestamp | datetime | Yes | When action occurred (ISO 8601) |
The deployment uses a multi-layer approach for optimal performance:
- Contents: AWS Lambda Powertools (includes boto3)
- Size: ~5MB
- Update Frequency: Low
- Contents: Pydantic and Polars libraries
- Size: ~15-20MB
- Update Frequency: Medium
# Build Lambda function
./bin/exec-in-devcontainer.sh pants package lambda/event_log_checkpoint/src/python/checkpoint_lambda:lambda
# Build Powertools layer
./bin/exec-in-devcontainer.sh pants package lambda/event_log_checkpoint/src/python/checkpoint_lambda:powertools
# Build data processing layer
./bin/exec-in-devcontainer.sh pants package lambda/event_log_checkpoint/src/python/checkpoint_lambda:data_processingFrom the lambda directory:
# Standard deployment (reuses existing layers)
./bin/exec-in-devcontainer.sh bash -c "cd lambda/event_log_checkpoint && terraform apply"
# Force layer updates
./bin/exec-in-devcontainer.sh bash -c "cd lambda/event_log_checkpoint && terraform apply -var='reuse_existing_layers=false'"
# Function-only deployment (fastest for code changes)
./bin/exec-in-devcontainer.sh bash -c "cd lambda/event_log_checkpoint && terraform apply -target=aws_lambda_function.event_log_checkpoint"| Variable | Description | Example |
|---|---|---|
| SOURCE_BUCKET | S3 bucket containing event logs | nacc-events |
| CHECKPOINT_BUCKET | S3 bucket for checkpoint files | nacc-checkpoints |
| CHECKPOINT_KEY | S3 key for checkpoint parquet file | checkpoints/events.parquet |
| LOG_LEVEL | Logging level | INFO |
| POWERTOOLS_SERVICE_NAME | Service name for tracing | event-log-checkpoint |
{
"source_bucket": "nacc-event-logs",
"prefix": "logs/",
"checkpoint_bucket": "nacc-checkpoints",
"checkpoint_key": "checkpoints/events.parquet"
}{
"statusCode": 200,
"checkpoint_path": "s3://nacc-checkpoints/checkpoints/events.parquet",
"new_events_processed": 150,
"total_events": 10250,
"events_failed": 2,
"last_processed_timestamp": "2024-01-15T14:30:00Z",
"execution_time_ms": 45000
}The Lambda supports efficient incremental processing:
- Processes all event log files in S3
- Creates initial checkpoint parquet file
- Returns total events processed
- Reads previous checkpoint to get last processed timestamp
- Only retrieves and processes events with timestamp > last checkpoint
- Merges new events with previous checkpoint data
- Overwrites checkpoint file with merged data
This approach provides significant performance benefits:
- Reduced S3 Operations: Only reads new files
- Lower Memory Usage: Only processes new events
- Faster Execution: Typical incremental runs complete in 30-60 seconds
- Cost Savings: Shorter execution time = lower Lambda costs
The system handles evolving event data intelligently:
- Multiple Events: Preserves all events for the same visit with different timestamps
- Data Evolution: Events may become more complete over time (e.g., module/packet fields populated later)
- Audit Trail: All event versions preserved for compliance and analysis
- Latest Data: Analysts can query for most recent event per visit to get complete data
The checkpoint parquet file supports various analytical patterns:
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY ptid, visit_date, visit_number
ORDER BY timestamp DESC
) as rn
FROM checkpoint_events
) ranked
WHERE rn = 1SELECT center_label, COUNT(*) as error_count
FROM checkpoint_events
WHERE action = 'not-pass-qc'
AND visit_date >= '2024-01-01'
AND visit_date < '2024-02-01'
GROUP BY center_labelSELECT
center_label,
AVG(DATEDIFF(timestamp, visit_date)) as avg_qc_days
FROM checkpoint_events
WHERE action IN ('pass-qc', 'not-pass-qc')
GROUP BY center_labelThe project includes comprehensive testing:
- Component-specific tests for each module
- Error handling and edge cases
- S3 integration tests
- Uses Hypothesis library for property verification
- Tests universal properties across many inputs
- Minimum 100 iterations per property test
- End-to-end Lambda execution
- Real S3 integration (using LocalStack or test buckets)
- CloudWatch logs and metrics verification
# Run all tests
./bin/exec-in-devcontainer.sh pants test lambda/event_log_checkpoint/test/python::
# Run specific test file
./bin/exec-in-devcontainer.sh pants test lambda/event_log_checkpoint/test/python:test_validator
# Run with coverage
./bin/exec-in-devcontainer.sh pants test --coverage-py-report=html lambda/event_log_checkpoint/test/python::NewEventsProcessed: Count of newly processed eventsTotalEvents: Total events in checkpointEventsFailed: Count of failed eventsExecutionTime: Total execution timeCheckpointFileSize: Size of checkpoint file
- S3 operations tracing
- Validation performance
- Parquet write operations
- Lambda errors > 0
- Lambda duration > 10 minutes
- EventsFailed > 10% of EventsProcessed
- First run: 5-10 minutes for 10,000+ historical events
- Incremental runs: 30-60 seconds for 100-1000 new events
- Memory: 3GB to handle large datasets
- Timeout: 15 minutes maximum
- Incremental processing reduces processing time by 90%+
- Efficient parquet merge operations
- Snappy compression for optimal size/speed balance
- Connection pooling for S3 operations
The Lambda handles errors gracefully:
- Malformed JSON: Logs error and continues processing
- Schema validation failures: Logs specific errors and continues
- S3 permission errors: Logs error and fails execution
- Partial failures: Includes valid events in output, logs failed events
- Unexpected exceptions: Logs full error details and fails execution
- Follow the established dev container workflow
- Run quality checks before committing:
./bin/exec-in-devcontainer.sh pants fix lint check test :: - Add tests for new functionality
- Update documentation for significant changes
- Use property-based tests for universal behaviors
See LICENSE file for details.