Super-parallel corpus crawler for multilingual NLP and Computational Linguistics research
A type-safe ReScript implementation for building massive parallel corpora across 1500+ languages from Bible translation sources.
-
1500+ Languages - Crawl parallel texts from multiple Bible corpus sources
-
Type-Safe - Built with ReScript for compile-time correctness
-
Proof-Verified - Echidna integration for mathematical verification
-
RSR Gold Compliant - Follows Rhodium Standard Repository specifications
-
Semantic Grounding - OpenCyc integration for common-sense reasoning
-
Container-Ready - Podman-native (no Docker required)
# Clone the repository
git clone https://github.com/Hyperpolymath/1000Langs.git
cd 1000Langs
# Using Nix (recommended)
nix develop
# Or using npm directly
npm install1000Langs/
├── src/
│ ├── Lang1000.res # Main entry point
│ ├── crawlers/ # Web crawler implementations
│ │ ├── Crawler.res # Base crawler module
│ │ ├── BibleCloud.res # bible.cloud crawler
│ │ ├── BibleCom.res # bible.com crawler
│ │ └── PngScriptures.res # pngscriptures.org crawler
│ ├── api/ # API client wrappers
│ │ └── DigitalBiblePlatform.res
│ ├── corpus/ # Corpus management
│ │ └── Alignment.res # Parallel text alignment
│ ├── utils/ # Utility modules
│ │ ├── Iso639.res # Language code handling
│ │ └── Statistics.res # Statistical functions
│ ├── proofs/ # Mathematical proofs
│ └── cyc/ # OpenCyc integration
├── test/ # Test suites
├── proofs/ # Echidna proof files
├── config/ # Nickel configuration
├── meta/ # Reference data
├── .well-known/ # Discovery files
├── justfile # Task automation
├── flake.nix # Nix development environment
├── Containerfile # Podman container definition
└── rescript.json # ReScript configuration| Source | URL | Type | Languages |
|---|---|---|---|
Bible Cloud |
API |
1500+ |
|
Bible.com |
Scraper |
2000+ |
|
PNG Scriptures |
Download |
800+ |
|
eBible |
Download |
1000+ |
|
Find.Bible |
API |
1200+ |
Configuration is managed through Nickel for type-safe, validated settings:
# Validate configuration
just nickel-check
# Export to JSON
just nickel-export
# Show resolved config
just nickel-showSee config/main.ncl for all configuration options.
# Run all tests
just test
# Run with coverage
just test-coverage
# Run proof verification
just proveThis project integrates with Echidna for mathematical proof verification:
-
Data integrity proofs
-
Alignment correctness verification
-
Statistical property validation
-
Type safety guarantees
# Run all proofs
just prove
# Check specific proof
just prove-check alignment_correctnessUses Podman (not Docker) for container operations:
# Build container
just container-build
# Run container
just container-run
# Deploy with volume mounts
just container-devThis project targets Gold (100%) compliance with the Rhodium Standard Repository specification:
# Run compliance audit
just rsr-audit
# Generate HTML report
just rsr-audit-htmlSee CONTRIBUTING.adoc for guidelines.
This project uses the Tri-Perimeter Contribution Framework (TPCF):
-
Perimeter 1 (Core): Maintainers only
-
Perimeter 2 (Expert): Trusted contributors
-
Perimeter 3 (Community): Open contributions
Dual licensed under:
-
Palimpsest-MPL-1.0 License
-
Palimpsest License v0.8
See LICENSE.txt for details.
Commercial use with attribution is permitted. Proprietary AI training without attribution is prohibited.