Skip to content

hyperpolymath/lol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

LOL

Palimpsest-MPL-1.0 Palimpsest :toc: preamble :toclevels: 3 :icons: font :source-highlighter: rouge :experimental:

RSR Compliance License ReScript Podman

Super-parallel corpus crawler for multilingual NLP and Computational Linguistics research

A type-safe ReScript implementation for building massive parallel corpora across 1500+ languages from Bible translation sources.

Features

  • 1500+ Languages - Crawl parallel texts from multiple Bible corpus sources

  • Type-Safe - Built with ReScript for compile-time correctness

  • Proof-Verified - Echidna integration for mathematical verification

  • RSR Gold Compliant - Follows Rhodium Standard Repository specifications

  • Semantic Grounding - OpenCyc integration for common-sense reasoning

  • Container-Ready - Podman-native (no Docker required)

Quick Start

Prerequisites

  • Node.js 20+

  • Just command runner

  • Nix (recommended) or npm

  • Podman (for containers)

Installation

# Clone the repository
git clone https://github.com/Hyperpolymath/1000Langs.git
cd 1000Langs

# Using Nix (recommended)
nix develop

# Or using npm directly
npm install

Build & Run

# Build the project
just build

# Run tests
just test

# Run the crawler
just crawl-all

Project Structure

1000Langs/
├── src/
│   ├── Lang1000.res          # Main entry point
│   ├── crawlers/             # Web crawler implementations
│   │   ├── Crawler.res       # Base crawler module
│   │   ├── BibleCloud.res    # bible.cloud crawler
│   │   ├── BibleCom.res      # bible.com crawler
│   │   └── PngScriptures.res # pngscriptures.org crawler
│   ├── api/                  # API client wrappers
│   │   └── DigitalBiblePlatform.res
│   ├── corpus/               # Corpus management
│   │   └── Alignment.res     # Parallel text alignment
│   ├── utils/                # Utility modules
│   │   ├── Iso639.res        # Language code handling
│   │   └── Statistics.res    # Statistical functions
│   ├── proofs/               # Mathematical proofs
│   └── cyc/                  # OpenCyc integration
├── test/                     # Test suites
├── proofs/                   # Echidna proof files
├── config/                   # Nickel configuration
├── meta/                     # Reference data
├── .well-known/              # Discovery files
├── justfile                  # Task automation
├── flake.nix                 # Nix development environment
├── Containerfile             # Podman container definition
└── rescript.json             # ReScript configuration

Supported Sources

Source URL Type Languages

Bible Cloud

https://bible.cloud

API

1500+

Bible.com

https://bible.com

Scraper

2000+

PNG Scriptures

https://pngscriptures.org

Download

800+

eBible

https://ebible.org

Download

1000+

Find.Bible

https://find.bible

API

1200+

Configuration

Configuration is managed through Nickel for type-safe, validated settings:

# Validate configuration
just nickel-check

# Export to JSON
just nickel-export

# Show resolved config
just nickel-show

See config/main.ncl for all configuration options.

Testing

# Run all tests
just test

# Run with coverage
just test-coverage

# Run proof verification
just prove

Proof Verification

This project integrates with Echidna for mathematical proof verification:

  • Data integrity proofs

  • Alignment correctness verification

  • Statistical property validation

  • Type safety guarantees

# Run all proofs
just prove

# Check specific proof
just prove-check alignment_correctness

Container Deployment

Uses Podman (not Docker) for container operations:

# Build container
just container-build

# Run container
just container-run

# Deploy with volume mounts
just container-dev

RSR Compliance

This project targets Gold (100%) compliance with the Rhodium Standard Repository specification:

# Run compliance audit
just rsr-audit

# Generate HTML report
just rsr-audit-html

Contributing

See CONTRIBUTING.adoc for guidelines.

This project uses the Tri-Perimeter Contribution Framework (TPCF):

  • Perimeter 1 (Core): Maintainers only

  • Perimeter 2 (Expert): Trusted contributors

  • Perimeter 3 (Community): Open contributions

License

Dual licensed under:

  • Palimpsest-MPL-1.0 License

  • Palimpsest License v0.8

See LICENSE.txt for details.

Commercial use with attribution is permitted. Proprietary AI training without attribution is prohibited.

Acknowledgments

  • Original Python implementation by Ehsaneddin Asgari (LMU Munich)

  • Bible corpus data from various open Bible translation projects

  • Echidna for proof verification

  • RSR for compliance framework

Sponsor this project

Packages

No packages published

Contributors 3

  •  
  •  
  •