greenclaw

YouTube content extraction and knowledge graph pipeline.

Goal

Extract transcripts from YouTube videos, store them in ArangoDB, then build per-category knowledge graphs by extracting entities and relationships from the transcript text using an LLM.

Implementation Plan

Phase 1 — Transcript Extraction (`POST /extract/youtube`)

Goal: Given a YouTube URL, produce a transcript and persist the video + transcript to ArangoDB.

Steps:

Parse the video ID from the URL.
Fetch video metadata (youtube.Client.GetVideoMetadata).
If captions exist → fetch transcript (youtube.Client.GetTranscript).
If no captions → download audio (youtube.Client.DownloadAudio) → call whisper-service (transcribe.Client.Transcribe) to get transcript text.

Store a videos vertex in ArangoDB via graphdb.Client.UpsertVertex:

{
  "_key": "<videoID>",
  "title": "...",
  "url": "...",
  "transcript": "...",
  "language": "...",
  "duration": 1234,
  "processed": false,
  "category": ""
}

Implement in: internal/service/extract_youtube.go

Phase 2 — Graph Building (`POST /extract/graph`)

Goal: Given a YouTube URL (video already extracted), run LLM-based entity/relationship extraction and populate the knowledge graph. Graph is scoped per category.

2a. Category assignment

Before entity extraction, classify the video into one of the supported categories:

Category	Description
`economy`	macroeconomics, finance, markets, trade
`technology`	software, hardware, AI, engineering

The LLM assigns the category based on transcript content. Store it back on the videos vertex (category field).

2b. Entity extraction

Prompt the LLM with the transcript (chunked if needed) to extract entities. Supported entity types:

Type	Examples
`concept`	"machine learning", "inflation", "open source"
`person`	"Elon Musk", "Jerome Powell"
`organization`	"OpenAI", "Federal Reserve", "Apple"
`event`	"2008 financial crisis", "WWDC 2024"

Each entity is stored as a vertex in the entities collection:

{
  "_key": "<normalized-slug>",
  "name": "OpenAI",
  "type": "organization",
  "category": "technology"
}

Entity normalization: lowercase, trim whitespace, deduplicate by slug key before upsert.

2c. Relationship extraction

Two relationship types, stored as edges:

Edge collection	Meaning	Direction
`related_to`	entity is related to another entity	`entities/<key>` → `entities/<key>`

related_to edges are created when the LLM identifies co-occurrence or explicit semantic relationships between entities within the same category graph.

2d. Graph scope

Graphs are per-category, not global. The named graph in ArangoDB (knowledge) spans all categories, but queries should filter by category on vertex documents to stay within a category boundary.

Implement in: internal/service/build_graph.go

Phase 3 — LLM Prompts

Two prompts needed (implement in pkg/llm or a new pkg/prompt package):

Classify — input: transcript → output: {"category": "technology"}

Extract — input: transcript chunk → output:

{
  "entities": [
    {"name": "OpenAI", "type": "organization"},
    {"name": "GPT-4", "type": "concept"}
  ],
  "relationships": [
    {"from": "GPT-4", "to": "OpenAI", "type": "related_to"}
  ]
}

Use pkg/chunker.RecursiveChunker to split long transcripts before extraction. Merge and deduplicate entities across chunks before writing to ArangoDB.

Data model summary

videos (vertex)
  _key        videoID
  title       string
  url         string
  transcript  string
  language    string
  duration    int
  category    string   ← "economy" | "technology"
  processed   bool

entities (vertex)
  _key        slug (e.g. "open-ai")
  name        string
  type        string   ← "concept" | "person" | "organization" | "event"
  category    string

related_to (edge) entities/* → entities/*

Current state

Component	Status
`pkg/youtube`	Done — caption fetch + yt-dlp audio download
`pkg/transcribe`	Done — whisper-service HTTP client
`pkg/llm`	Done — Ollama client + chunker
`pkg/graphdb`	Done — ArangoDB driver, UpsertVertex/Edge
`internal/service/extract_youtube.go`	Stub — needs implementation
`internal/service/build_graph.go`	Stub — needs implementation
LLM prompts (classify + extract)	Not yet written
Tests	None yet

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
cmd/app		cmd/app
docs		docs
internal		internal
pkg		pkg
whisper-service		whisper-service
.dockerignore		.dockerignore
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
docker-compose.yaml		docker-compose.yaml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

greenclaw

Goal

Implementation Plan

Phase 1 — Transcript Extraction (`POST /extract/youtube`)

Phase 2 — Graph Building (`POST /extract/graph`)

2a. Category assignment

2b. Entity extraction

2c. Relationship extraction

2d. Graph scope

Phase 3 — LLM Prompts

Data model summary

Current state

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

greenclaw

Goal

Implementation Plan

Phase 1 — Transcript Extraction (POST /extract/youtube)

Phase 2 — Graph Building (POST /extract/graph)

2a. Category assignment

2b. Entity extraction

2c. Relationship extraction

2d. Graph scope

Phase 3 — LLM Prompts

Data model summary

Current state

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Phase 1 — Transcript Extraction (`POST /extract/youtube`)

Phase 2 — Graph Building (`POST /extract/graph`)

Packages