TanyTree

A specialized tool to scrape genealogical data from baalhatanya.org.il and export it to the standard GEDCOM format. This project is designed to help researchers and genealogists interested in the Chabad dynasty preserve and utilize family tree data in their own genealogy software.

Features

Automated Crawling: Recursively traverses individual and family pages to build a comprehensive tree.
Data Extraction: Captures names, birth/death dates, locations, and gender.
Relationship Mapping: Automatically links parents, spouses, and children.
Resumable Crawling: Automatically skips already successfully scraped individuals and tracks discovered but pending URLs.
Hebrew Support: Specifically designed to handle Hebrew text and cultural context from the source site.
GEDCOM Export: Generates valid GEDCOM files compatible with Gramps, MyHeritage, Ancestry, and other standard software.
SQLite Storage: Uses a local database to store scraped data, allowing for interrupted crawls and incremental updates.

Installation

Clone the repository:
```
git clone <repository-url>
cd TanyTree
```
Install dependencies: It is recommended to use a virtual environment.
```
pip install -r requirements.txt
```

Usage

The project provides a Command Line Interface (CLI) via src/cli.py. All commands automatically skip individuals already present in the database.

1. Scrape a Single Person

Extract data for a specific individual using their URL.

python -m src.cli scrape "https://baalhatanya.org.il/.../?i=111815"

2. Crawl the Tree

Start from a specific URL and recursively crawl connected individuals.

python -m src.cli crawl "https://baalhatanya.org.il/.../?i=111815" --limit 100

--limit: Maximum number of people to crawl (default: 100).
--workers: Number of concurrent workers (default: 2).
--delay: Delay between requests in seconds (default: 1.0).

3. Retry Failed/Pending Pages

If a crawl was interrupted or some pages failed due to network errors, use the retry command to attempt them again without re-crawling the entire tree.

python -m src.cli retry --limit 100

4. Export to GEDCOM

Convert the stored database records into a GEDCOM file.

python -m src.cli export genealogy.ged

Project Structure

src/: Core logic
- cli.py: Command-line interface.
- scraper.py: HTML parsing and data extraction logic.
- engine.py: Orchestrates crawling and scraping workflows.
- database.py: Handles SQLite storage.
- gedcom_exporter.py: Converts database records to GEDCOM format.
tests/: Unit tests for each component.
requirements.txt: Python dependencies.

Testing

Run the test suite using pytest:

pytest

Disclaimer

This tool is for personal and research use only. Please respect the terms of service of the source website and avoid placing excessive load on their servers.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
conductor		conductor
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
conftest.py		conftest.py
genealogy.db		genealogy.db
genealogy.ged		genealogy.ged
geneology.ged		geneology.ged
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TanyTree

Features

Installation

Usage

1. Scrape a Single Person

2. Crawl the Tree

3. Retry Failed/Pending Pages

4. Export to GEDCOM

Project Structure

Testing

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TanyTree

Features

Installation

Usage

1. Scrape a Single Person

2. Crawl the Tree

3. Retry Failed/Pending Pages

4. Export to GEDCOM

Project Structure

Testing

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages