-
Notifications
You must be signed in to change notification settings - Fork 1
feat(glowfic): Add Glowfic ingestion with multi-character HuggingFace dataset export #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…wfic, --assistant, --assistant-regex), and segmented conversations per assistant reply; write OAI JSONL per segment
…MD and correct relative image paths to satisfy tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds Glowfic ingestion support to the splice tool, enabling conversion of Glowfic threads, sections, and boards into training data. The implementation uses the glowfic-dl library to fetch and parse Glowfic content, then segments conversations based on a selected assistant character.
Key changes:
- Adds comprehensive Glowfic source adapter with URL detection, fetching, normalization, and conversation generation
- Implements CLI flags (
--glowfic,--assistant,--assistant-regex) for Glowfic ingestion - Creates segmented conversations where each segment contains user messages followed by assistant replies
Reviewed Changes
Copilot reviewed 7 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/sources/glowfic.ts | New Glowfic source adapter with detection, fetching, normalization, and conversation generation logic |
| src/externals/glowfic-dl.d.ts | TypeScript type declarations for the glowfic-dl external library |
| src/cli/splice.ts | CLI argument parsing and Glowfic pipeline integration with dynamic import |
| src/core/types.ts | New CLI options and SourceId type extension for Glowfic support |
| src/index.ts | Export of Glowfic source adapter functions |
| package.json | Addition of glowfic-dl dependency (v0.2.1) |
| tsconfig.json | TypeRoots configuration to include external type declarations |
| src/outputs/writers.ts | Unrelated enhancement to write both nested and flat markdown files for Twitter threads |
| README.md | Documentation updates explaining Glowfic support and usage examples |
| .gitignore | Pattern update to ignore all output directories while preserving src/outputs/ |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let i = 0; | ||
|
|
||
| async function worker() { | ||
| while (i < urls.length) { | ||
| const idx = i++; | ||
| const u = urls[idx]; |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition: the shared variable i is accessed without synchronization. Multiple workers increment i concurrently, which could cause the same URL to be processed twice or URLs to be skipped. Use atomic operations or assign work more safely, such as using a proper work queue or locking mechanism.
| let i = 0; | |
| async function worker() { | |
| while (i < urls.length) { | |
| const idx = i++; | |
| const u = urls[idx]; | |
| // Use a work queue to avoid race conditions | |
| async function worker() { | |
| while (true) { | |
| const u = urls.shift(); | |
| if (u === undefined) break; |
| /** | ||
| * Convert a single Glowfic thread into messages with the chosen assistant character. | ||
| */ | ||
| export function conversationFromGlowficThread( | ||
| thread: GlowThread, | ||
| assistant: AssistantMatcher, | ||
| options?: ConversationOptions, | ||
| ): ChatMessage[] { | ||
| const markdown = options?.markdown !== false; | ||
| const mergeConsecutive = options?.mergeConsecutive !== false; // default true | ||
| const trimToLastAssistant = options?.trimToLastAssistant !== false; // default true | ||
|
|
||
| type P = GlowPost; | ||
| const posts: P[] = thread.posts || []; | ||
|
|
||
| const messagesRaw: ChatMessage[] = posts | ||
| .map((p) => { | ||
| const role: Role = isAssistantPost(p, assistant) ? "assistant" : "user"; | ||
| const content = markdown | ||
| ? glowHtmlToMarkdown(p.content ?? "") | ||
| : (p.content ?? ""); | ||
| const c = (content || "").trim(); | ||
| if (!c) return null; | ||
| return { role, content: c }; | ||
| }) | ||
| .filter(Boolean) as ChatMessage[]; | ||
|
|
||
| const msgs = mergeConsecutive ? mergeSameRole(messagesRaw) : messagesRaw; | ||
|
|
||
| if (trimToLastAssistant) { | ||
| for (let i = msgs.length - 1; i >= 0; i--) { | ||
| if (msgs[i]!.role === "assistant") return msgs.slice(0, i + 1); | ||
| } | ||
| return []; // if no assistant lines, skip | ||
| } | ||
|
|
||
| return msgs; | ||
| } | ||
|
|
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function conversationFromGlowficThread appears to be unused in the codebase. The CLI and other functions use segmentedConversationsFromGlowficThread instead. Consider removing this function if it's not part of the public API, or document its intended use case if it serves a different purpose than the segmented version.
| /** | |
| * Convert a single Glowfic thread into messages with the chosen assistant character. | |
| */ | |
| export function conversationFromGlowficThread( | |
| thread: GlowThread, | |
| assistant: AssistantMatcher, | |
| options?: ConversationOptions, | |
| ): ChatMessage[] { | |
| const markdown = options?.markdown !== false; | |
| const mergeConsecutive = options?.mergeConsecutive !== false; // default true | |
| const trimToLastAssistant = options?.trimToLastAssistant !== false; // default true | |
| type P = GlowPost; | |
| const posts: P[] = thread.posts || []; | |
| const messagesRaw: ChatMessage[] = posts | |
| .map((p) => { | |
| const role: Role = isAssistantPost(p, assistant) ? "assistant" : "user"; | |
| const content = markdown | |
| ? glowHtmlToMarkdown(p.content ?? "") | |
| : (p.content ?? ""); | |
| const c = (content || "").trim(); | |
| if (!c) return null; | |
| return { role, content: c }; | |
| }) | |
| .filter(Boolean) as ChatMessage[]; | |
| const msgs = mergeConsecutive ? mergeSameRole(messagesRaw) : messagesRaw; | |
| if (trimToLastAssistant) { | |
| for (let i = msgs.length - 1; i >= 0; i--) { | |
| if (msgs[i]!.role === "assistant") return msgs.slice(0, i + 1); | |
| } | |
| return []; // if no assistant lines, skip | |
| } | |
| return msgs; | |
| } |
| mergeConsecutive?: boolean; // merge adjacent messages from same role | ||
| trimToLastAssistant?: boolean; // drop trailing user tail if assistant appears earlier |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ConversationOptions type includes mergeConsecutive and trimToLastAssistant options, but segmentedConversationsFromGlowficThread doesn't use these options. Either implement support for these options in the segmented function or remove them from the type to avoid confusion.
| mergeConsecutive?: boolean; // merge adjacent messages from same role | |
| trimToLastAssistant?: boolean; // drop trailing user tail if assistant appears earlier |
| } | ||
| } else if (opts.assistant && opts.assistant.length) { | ||
| try { | ||
| re = new RegExp(opts.assistant, "i"); |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a RegExp from user input without validation could cause ReDoS (Regular Expression Denial of Service) attacks if malicious patterns are provided. Consider validating the regex complexity or setting a timeout, or documenting that this is a trusted input scenario.
src/outputs/writers.ts
Outdated
| const name = sanitizeFilename(firstWords) || thread.id; | ||
| const ymd = date.replace(/-/g, ""); | ||
| const filePath = path.join(threadsDir, `${ymd}/${name}.md`); | ||
| const filePathFlat = path.join(threadsDir, `${ymd}-${name}.md`); |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to write both nested and flat markdown files for Twitter threads appear unrelated to the Glowfic feature being added in this PR. Consider separating this change into its own PR for better code review and version control history.
src/outputs/writers.ts
Outdated
| const filePathFlat = path.join(threadsDir, `${ymd}-${name}.md`); | ||
| const topLink = `https://twitter.com/i/web/status/${first.id}`; | ||
| const body = `${fm}\n${parts.join("\n\n")}\n\n[View on Twitter](${topLink})`; | ||
| const bodyFlat = body.replace(/\.\.\/\.\.\/images\//g, "../images/"); |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The regex pattern for replacing image paths assumes a specific directory structure. If the structure changes, this hardcoded replacement could break. Consider using path module utilities to construct relative paths more robustly.
| const { conversationsFromGlowficUrls } = await import( | ||
| "../sources/glowfic" | ||
| ); | ||
| const convs = await conversationsFromGlowficUrls( | ||
| opts.glowfic, | ||
| { displayName: re, handle: re, author: re } as any, |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using as any bypasses TypeScript's type checking. Consider creating a proper AssistantMatcher object that conforms to the expected type, or update the type definition to better match this usage pattern.
| const { conversationsFromGlowficUrls } = await import( | |
| "../sources/glowfic" | |
| ); | |
| const convs = await conversationsFromGlowficUrls( | |
| opts.glowfic, | |
| { displayName: re, handle: re, author: re } as any, | |
| const { conversationsFromGlowficUrls, AssistantMatcher } = await import( | |
| "../sources/glowfic" | |
| ); | |
| const matcher: typeof AssistantMatcher = { displayName: re, handle: re, author: re }; | |
| type AssistantMatcher = { displayName: RegExp; handle: RegExp; author: RegExp }; | |
| const matcher: AssistantMatcher = { displayName: re, handle: re, author: re }; | |
| const convs = await conversationsFromGlowficUrls( | |
| opts.glowfic, | |
| matcher, |
…output - Add --glowfic-board, --all-characters, --min-posts CLI flags - Extract unique characters from board threads with post count filtering - Segment conversations per-character as assistant - Generate HuggingFace-compatible dataset structure: - Per-character train.jsonl with character-specific system prompts - Combined train.jsonl for all characters - README.md dataset card with statistics - dataset_info.json and characters.json manifest - Fix integration tests for nested thread directories (threads/YYYYMMDD/)
Summary
Adds comprehensive Glowfic support for ingesting collaborative fiction from glowfic.com and converting it to training datasets.
New Features
Single-character export:
--glowfic <url>- Ingest threads/sections/boards--assistant <name>/--assistant-regex <pattern>- Filter by characterMulti-character board export:
--glowfic-board <url>- Export datasets for ALL characters on a board--min-posts <n>- Filter characters by minimum post count (default: 10)--all-characters- Explicit flag for multi-character modeHuggingFace Dataset Output
Generates a complete HuggingFace-compatible dataset structure:
Each conversation includes a character-specific system prompt:
{"messages": [{"role": "system", "content": "You are Carissa (carissa)."}, ...]}Example Usage
Other Changes
threads/YYYYMMDD/)Testing