LLM-powered analysis of biodiversity-related investment activities in financial reports
BiodiversityASSET is a comprehensive pipeline for extracting, classifying, and analyzing biodiversity-related content from investor reports. The system uses LLMs to evaluate paragraphs across three key dimensions:
- πΏ Biodiversity relevance - Identifies content related to biodiversity and environmental impact
- π° Investment activity - Classifies paragraphs containing concrete investment activities
- π Assetization characteristics - Scores content on intrinsic value, cash flow, and ownership/control
- Key Features
- Quick Start
- Processing Pipeline
- Batch Job Management
- Prompt Customization
- Project Structure
- Output Organization
- Documentation
β¨ Modular Architecture
- Submit batch jobs and monitor progress independently
- Resume workflows from any step using batch IDs
- Cancel running jobs with safety confirmations
π€ LLM-Powered Processing
- OpenAI Batch API integration for cost-effective analysis (~50% cost reduction)
- External prompt system for easy customization
- Support for multiple models and configurations
π Organized Output
- Results saved in batch-specific subfolders
- Clean filenames without ID conflicts
- Individual chunk processing for large datasets
π§ Developer-Friendly
- Comprehensive CLI tools with intuitive options
- Detailed progress monitoring and error handling
- Flexible configuration and custom prompt support
The processing pipeline consists of sequential steps, with LLM-powered batch processing for steps 3-4:
| Step | Purpose | Script | Input | Output |
|---|---|---|---|---|
| 1 | Extract paragraphs from PDFs | extract_pdfs.py |
data/raw/pdfs/ |
extracted_paragraphs_from_pdfs/ |
| 2 | Filter biodiversity content | filter_biodiversity_paragraphs.py |
extracted_paragraphs_from_pdfs/ |
biodiversity_related_paragraphs/ |
| 3a | Submit investment classification | submit_batch_job.py |
biodiversity_related_paragraphs/ |
Returns batch ID |
| 3b | Monitor batch progress | check_batch_status.py |
Batch ID | Status updates |
| 3c | Download investment results | download_batch_results.py |
Batch ID | investment_activity_classification/ |
| 4a | Submit assetization scoring | submit_batch_job.py |
investment_activity_classification/ |
Returns batch ID |
| 4b | Monitor batch progress | check_batch_status.py |
Batch ID | Status updates |
| 4c | Download assetization results | download_batch_results.py |
Batch ID | assetization_features_scoring/ |
π‘ Key Points:
- Steps 3-4 use OpenAI's Batch API for cost-effective processing
- Each batch step can be run independently
- Step 4 requires a completed investment activity classification batch ID
- All results are organized in batch-specific subfolders
Ensure you have uv installed:
π¦ Install uv (click to expand)
Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"Linux/MacOS:
curl -LsSf https://astral.sh/uv/install.sh | sh# Clone the repository
git clone <repository-url>
cd BiodiversityASSET
# Install dependencies
uv sync# Set your OpenAI API key
export OPENAI_API_KEY="your-openai-api-key"
# Or create a .env file
echo "OPENAI_API_KEY=your-openai-api-key" > .envpython scripts/extract_pdfs.pypython scripts/filter_biodiversity_paragraphs.py# Submit the batch job
python scripts/submit_batch_job.py --task investment_activity_classification
# Monitor progress (replace <batch-id> with actual ID)
python scripts/check_batch_status.py --batch-id <batch-id> --wait
# Download results
python scripts/download_batch_results.py --batch-id <batch-id># Submit dependent job (requires investment batch ID)
python scripts/submit_batch_job.py --task assetization_features_scoring --batch-id <investment_batch_id>
# Monitor and download
python scripts/check_batch_status.py --batch-id <assetization_batch_id> --wait
python scripts/download_batch_results.py --batch-id <assetization_batch_id># List all batch jobs with LAST-CHECKED status and timestamps
python scripts/check_batch_status.py --list-jobs
# Check CURRENT status of a specific job
python scripts/check_batch_status.py --batch-id <batch-id>
# Wait for job completion (polls every 30 seconds)
python scripts/check_batch_status.py --batch-id <batch-id> --wait
# Custom polling interval
python scripts/check_batch_status.py --batch-id <batch-id> --wait --poll-interval 60# Cancel a running batch job (requires confirmation)
python scripts/check_batch_status.py --batch-id <batch-id> --cancel=== Batch Jobs (3 found) ===
Batch ID Task Status Last Checked Submitted Paragraphs
-------------------------------------------------------------------------------------------------------------------------------
batch_686fc36b2da08190903bc237510c52f5 investment_activity_class completed 07-10 16:55 2025-07-10T15:43 120
batch_686fd9e4f814819088b69150a57753d6 assetization_features_sc submitted never 2025-07-10T17:19 3
batch_686fdd5143248190aae3f8185f24a415 investment_activity_class in_progress 07-10 14:30 2025-07-10T14:15 274
BiodiversityASSET uses external text files for prompts, making them easy to customize without code changes:
prompts/investment_activity_classification_system_prompt.txt- System prompt for investment activity classificationprompts/assetization_features_scoring_system_prompt.txt- System prompt for assetization features scoringprompts/user_prompt_template.txt- User prompt template applied to each paragraph
# Use custom system prompt
python scripts/submit_batch_job.py --task investment_activity_classification \
--system-prompt prompts/my_custom_system.txt
# Use both custom system and user prompts
python scripts/submit_batch_job.py --task investment_activity_classification \
--system-prompt prompts/my_custom_system.txt \
--user-prompt prompts/my_custom_user.txt
# Use different model with custom prompts
python scripts/submit_batch_job.py --task assetization_features_scoring \
--batch-id <investment_batch_id> \
--model gpt-4o \
--max-tokens 750 \
--system-prompt prompts/my_custom_system.txtBiodiversityASSET/
βββ π data/
β βββ π raw/
β β βββ π pdfs/ # π Input: PDF investor reports
β βββ π processed/
β β βββ π extracted_paragraphs_from_pdfs/ # Step 1: Extracted paragraphs
β β βββ π biodiversity_related_paragraphs/ # Step 2: Filtered biodiversity content
β β βββ π investment_activity_classification/ # Step 3: Investment classification results
β β β βββ π <batch_id>/
β β β βββ π batch_results.jsonl
β β β βββ π chunk_1.csv
β β β βββ π chunk_2.csv
β β βββ π assetization_features_scoring/ # Step 4: Assetization scoring results
β β βββ π <batch_id>/
β β βββ π batch_results.jsonl
β β βββ π assetization_features_scored.csv
β βββ π human_annotations/ # π₯ Manual annotations for evaluation
βββ π prompts/ # π€ LLM prompt templates
β βββ π investment_activity_classification_system_prompt.txt
β βββ π assetization_features_scoring_system_prompt.txt
β βββ π user_prompt_template.txt
βββ π results/
β βββ π batch_jobs/ # π Batch job metadata and raw results
β β βββ π <batch_id>.json
β β βββ π investment_activity_classification_processing/
β β βββ π assetization_features_scoring_processing/
β βββ π evaluation/ # π Evaluation results (future)
βββ π scripts/ # π Python processing scripts
βββ βοΈ pyproject.toml # π¦ Project dependencies
βββ π uv.lock # π Lock file for dependencies
βββ π README.md # π Project documentation
βββ π BATCH_WORKFLOW.md # π Detailed batch processing workflow
βββ π REFACTORING_SUMMARY.md # π Summary of refactoring changes
Results are organized in batch-specific subfolders to prevent conflicts and enable easy tracking:
data/processed/investment_activity_classification/<batch_id>/
βββ π batch_results.jsonl # Raw API responses
βββ π chunk_1.csv # Processed results for chunk 1
βββ π chunk_2.csv # Processed results for chunk 2
Contains: Investment activity scores, explanations, and original paragraph metadata
data/processed/assetization_features_scoring/<batch_id>/
βββ π batch_results.jsonl # Raw API responses
βββ π assetization_features_scored.csv # Scored paragraphs with all dimensions
Contains: Intrinsic value, cash flow, and ownership/control scores with detailed reasoning
- π Conflict-free: Each batch job gets its own subfolder
- π·οΈ Clean naming: Filenames without batch ID suffixes
- π Traceable: Easy to identify which batch produced which results
- π Resumable: Can re-run or reference specific batch outputs
π BATCH_WORKFLOW.md - Detailed step-by-step workflow guide with examples
π REFACTORING_SUMMARY.md - Complete summary of system architecture and changes
We welcome contributions! Please see our contribution guidelines for more information.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use BiodiversityASSET in your research, please cite:
@software{biodiversityasset,
title={BiodiversityASSET: LLM-powered analysis of biodiversity-related investment activities},
author={SoDa},
year={2025},
url={https://github.com/yourusername/BiodiversityASSET}
}This project is developed and maintained by the ODISSEI Social Data Science (SoDa) team.
Do you have questions, suggestions, or remarks? File an issue or feel free to contact Qixiang Fang or Catalina Papari.