Dhee - A platform for linguistic analysis of Vedic Sanskrit texts

Dhee is a platform for studying and analyzing old vedic sanskrits. Currently Rigveda is supported. The long term goal is to support any Vedic sanskrit text with a well defined chapter/verse hierarchy and English translations.

Design goals

Simple, efficient and useful UI (technology-wise: no SPA, no NPM, no need for server rendering, no bloat).
Performant backend (written in Go and use SQLite3 embedded database which resides on the same machine. aiming at <30ms response time for most pages).
Extensible and general enough to support texts other than the Rigveda. (Please contact me if you know the datasets for Vedic Sanskrit texts other than Rigveda: by which I mean the samhitas or brahmanas).

Implementation roadmap

Short term

View one / many verses directly along with translations
Search (regexp and / or text based).
Hierarchical navigation (i.e show the mandala/sukta/rik hierarchy).
Show Monier-Williams dictionary hints along with Padapatha text.
Integrate the Multi-layer annotation of rigveda to show shorter lexicon meanings before the dictionary entries.
Integrate anukramaNi data on verse authors for rigveda.
Use protocol buffer encoding in the SQLite database non-queriable blobs instead of JSON.

Long term

Embedding and textual (TF-IDF) based recommendations of similar verses. (Currently using this model: Snowflake/snowflake-arctic-embed-l-v2.0)
Graphing and visualization wizard using d3js / uplot, for analyzing word frequency and grammatical forms across multiple scriptures using an advanced form input.
Highlight and allow analysis of repeated refrains (N-gram where N >= 3)
Advanced search using a custom query syntax (boolean operators, grouping and column filters)

Very long term

Find and include data for Yajurveda and Atharvaveda samhitas.
Arbitrary embedding search
port INRIA's inflected forms generator to Go and use it to analyze arbitrary word forms.
Support auto detecting variations and verse references across texts.

How to run?

# create a bleve search index of all data
go run ./cmd/dhee index --data-dir ./data
# run server
go run ./cmd/dhee server --data-dir ./data

Regenerating embeddings

You will need python to generate embeddings

python3 ./script/cosine_similarity.py --input-file data/rv.jsonl --embedding-model Snowflake/snowflake-arctic-embed-l-v2.0 --output-file data/rv.emb.jsonl --auxiliaries griffith

go run ./cmd/dhee preprocess --input ./data --output ./data --embeddings-file ./data/rv.emb.jsonl

Acknowledgements

Much of the data present now is taken from from VedaWeb data and Monier Williams dictionary by Cologne university.

Some transliteration mappings was taken / adapted from indic-transliteration.

Favicon from Anil Sharma on Pixabay: https://pixabay.com/photos/eagle-bird-golden-eagle-bird-flying-6979972/

As of present state of this project (WIP), Cologne VedaWeb's tekst may be indeed a better resource for any serious analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.github/workflows		.github/workflows
.vscode		.vscode
app		app
cmd/dhee		cmd/dhee
data		data
script		script
spec		spec
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Dhee - A platform for linguistic analysis of Vedic Sanskrit texts

Implementation roadmap

Short term

Long term

Very long term

How to run?

Regenerating embeddings

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Languages

mahesh-hegde/dhee

Folders and files

Latest commit

History

Repository files navigation

Dhee - A platform for linguistic analysis of Vedic Sanskrit texts

Implementation roadmap

Short term

Long term

Very long term

How to run?

Regenerating embeddings

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Languages

Packages