Skip to content

Conversation

@deepfates
Copy link
Owner

@deepfates deepfates commented Nov 11, 2025

Summary

Adds comprehensive Glowfic support for ingesting collaborative fiction from glowfic.com and converting it to training datasets.

New Features

Single-character export:

  • --glowfic <url> - Ingest threads/sections/boards
  • --assistant <name> / --assistant-regex <pattern> - Filter by character

Multi-character board export:

  • --glowfic-board <url> - Export datasets for ALL characters on a board
  • --min-posts <n> - Filter characters by minimum post count (default: 10)
  • --all-characters - Explicit flag for multi-character mode

HuggingFace Dataset Output

Generates a complete HuggingFace-compatible dataset structure:

out/
├── train.jsonl              # Combined all-character dataset
├── README.md                # Dataset card with statistics
├── dataset_info.json        # HuggingFace metadata
├── characters.json          # Character manifest
└── characters/
    └── <character_id>/
        ├── train.jsonl      # Per-character conversations
        └── metadata.json    # Character metadata

Each conversation includes a character-specific system prompt:

{"messages": [{"role": "system", "content": "You are Carissa (carissa)."}, ...]}

Example Usage

# Single character export
splice --glowfic https://glowfic.com/posts/5506 --out ./out --format oai --assistant carissa

# Export all characters from a board with ≥20 posts
splice --glowfic-board https://glowfic.com/boards/215 --out ./out --min-posts 20

Other Changes

  • Speed up media copy with bounded concurrency
  • Nest thread files in date subdirectories (threads/YYYYMMDD/)
  • Fix integration tests for nested thread directory structure
  • Lazy-load Glowfic module to avoid undici issues on Node 18

Testing

  • ✅ Build passes
  • ✅ All integration tests pass (3/3)

…wfic, --assistant, --assistant-regex), and segmented conversations per assistant reply; write OAI JSONL per segment
…MD and correct relative image paths to satisfy tests
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds Glowfic ingestion support to the splice tool, enabling conversion of Glowfic threads, sections, and boards into training data. The implementation uses the glowfic-dl library to fetch and parse Glowfic content, then segments conversations based on a selected assistant character.

Key changes:

  • Adds comprehensive Glowfic source adapter with URL detection, fetching, normalization, and conversation generation
  • Implements CLI flags (--glowfic, --assistant, --assistant-regex) for Glowfic ingestion
  • Creates segmented conversations where each segment contains user messages followed by assistant replies

Reviewed Changes

Copilot reviewed 7 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/sources/glowfic.ts New Glowfic source adapter with detection, fetching, normalization, and conversation generation logic
src/externals/glowfic-dl.d.ts TypeScript type declarations for the glowfic-dl external library
src/cli/splice.ts CLI argument parsing and Glowfic pipeline integration with dynamic import
src/core/types.ts New CLI options and SourceId type extension for Glowfic support
src/index.ts Export of Glowfic source adapter functions
package.json Addition of glowfic-dl dependency (v0.2.1)
tsconfig.json TypeRoots configuration to include external type declarations
src/outputs/writers.ts Unrelated enhancement to write both nested and flat markdown files for Twitter threads
README.md Documentation updates explaining Glowfic support and usage examples
.gitignore Pattern update to ignore all output directories while preserving src/outputs/

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +90 to +95
let i = 0;

async function worker() {
while (i < urls.length) {
const idx = i++;
const u = urls[idx];
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: the shared variable i is accessed without synchronization. Multiple workers increment i concurrently, which could cause the same URL to be processed twice or URLs to be skipped. Use atomic operations or assign work more safely, such as using a proper work queue or locking mechanism.

Suggested change
let i = 0;
async function worker() {
while (i < urls.length) {
const idx = i++;
const u = urls[idx];
// Use a work queue to avoid race conditions
async function worker() {
while (true) {
const u = urls.shift();
if (u === undefined) break;

Copilot uses AI. Check for mistakes.
Comment on lines +323 to +361
/**
* Convert a single Glowfic thread into messages with the chosen assistant character.
*/
export function conversationFromGlowficThread(
thread: GlowThread,
assistant: AssistantMatcher,
options?: ConversationOptions,
): ChatMessage[] {
const markdown = options?.markdown !== false;
const mergeConsecutive = options?.mergeConsecutive !== false; // default true
const trimToLastAssistant = options?.trimToLastAssistant !== false; // default true

type P = GlowPost;
const posts: P[] = thread.posts || [];

const messagesRaw: ChatMessage[] = posts
.map((p) => {
const role: Role = isAssistantPost(p, assistant) ? "assistant" : "user";
const content = markdown
? glowHtmlToMarkdown(p.content ?? "")
: (p.content ?? "");
const c = (content || "").trim();
if (!c) return null;
return { role, content: c };
})
.filter(Boolean) as ChatMessage[];

const msgs = mergeConsecutive ? mergeSameRole(messagesRaw) : messagesRaw;

if (trimToLastAssistant) {
for (let i = msgs.length - 1; i >= 0; i--) {
if (msgs[i]!.role === "assistant") return msgs.slice(0, i + 1);
}
return []; // if no assistant lines, skip
}

return msgs;
}

Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function conversationFromGlowficThread appears to be unused in the codebase. The CLI and other functions use segmentedConversationsFromGlowficThread instead. Consider removing this function if it's not part of the public API, or document its intended use case if it serves a different purpose than the segmented version.

Suggested change
/**
* Convert a single Glowfic thread into messages with the chosen assistant character.
*/
export function conversationFromGlowficThread(
thread: GlowThread,
assistant: AssistantMatcher,
options?: ConversationOptions,
): ChatMessage[] {
const markdown = options?.markdown !== false;
const mergeConsecutive = options?.mergeConsecutive !== false; // default true
const trimToLastAssistant = options?.trimToLastAssistant !== false; // default true
type P = GlowPost;
const posts: P[] = thread.posts || [];
const messagesRaw: ChatMessage[] = posts
.map((p) => {
const role: Role = isAssistantPost(p, assistant) ? "assistant" : "user";
const content = markdown
? glowHtmlToMarkdown(p.content ?? "")
: (p.content ?? "");
const c = (content || "").trim();
if (!c) return null;
return { role, content: c };
})
.filter(Boolean) as ChatMessage[];
const msgs = mergeConsecutive ? mergeSameRole(messagesRaw) : messagesRaw;
if (trimToLastAssistant) {
for (let i = msgs.length - 1; i >= 0; i--) {
if (msgs[i]!.role === "assistant") return msgs.slice(0, i + 1);
}
return []; // if no assistant lines, skip
}
return msgs;
}

Copilot uses AI. Check for mistakes.
Comment on lines +319 to +320
mergeConsecutive?: boolean; // merge adjacent messages from same role
trimToLastAssistant?: boolean; // drop trailing user tail if assistant appears earlier
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConversationOptions type includes mergeConsecutive and trimToLastAssistant options, but segmentedConversationsFromGlowficThread doesn't use these options. Either implement support for these options in the segmented function or remove them from the type to avoid confusion.

Suggested change
mergeConsecutive?: boolean; // merge adjacent messages from same role
trimToLastAssistant?: boolean; // drop trailing user tail if assistant appears earlier

Copilot uses AI. Check for mistakes.
}
} else if (opts.assistant && opts.assistant.length) {
try {
re = new RegExp(opts.assistant, "i");
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a RegExp from user input without validation could cause ReDoS (Regular Expression Denial of Service) attacks if malicious patterns are provided. Consider validating the regex complexity or setting a timeout, or documenting that this is a trusted input scenario.

Copilot uses AI. Check for mistakes.
const name = sanitizeFilename(firstWords) || thread.id;
const ymd = date.replace(/-/g, "");
const filePath = path.join(threadsDir, `${ymd}/${name}.md`);
const filePathFlat = path.join(threadsDir, `${ymd}-${name}.md`);
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to write both nested and flat markdown files for Twitter threads appear unrelated to the Glowfic feature being added in this PR. Consider separating this change into its own PR for better code review and version control history.

Copilot uses AI. Check for mistakes.
const filePathFlat = path.join(threadsDir, `${ymd}-${name}.md`);
const topLink = `https://twitter.com/i/web/status/${first.id}`;
const body = `${fm}\n${parts.join("\n\n")}\n\n[View on Twitter](${topLink})`;
const bodyFlat = body.replace(/\.\.\/\.\.\/images\//g, "../images/");
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The regex pattern for replacing image paths assumes a specific directory structure. If the structure changes, this hardcoded replacement could break. Consider using path module utilities to construct relative paths more robustly.

Copilot uses AI. Check for mistakes.
Comment on lines +205 to +210
const { conversationsFromGlowficUrls } = await import(
"../sources/glowfic"
);
const convs = await conversationsFromGlowficUrls(
opts.glowfic,
{ displayName: re, handle: re, author: re } as any,
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using as any bypasses TypeScript's type checking. Consider creating a proper AssistantMatcher object that conforms to the expected type, or update the type definition to better match this usage pattern.

Suggested change
const { conversationsFromGlowficUrls } = await import(
"../sources/glowfic"
);
const convs = await conversationsFromGlowficUrls(
opts.glowfic,
{ displayName: re, handle: re, author: re } as any,
const { conversationsFromGlowficUrls, AssistantMatcher } = await import(
"../sources/glowfic"
);
const matcher: typeof AssistantMatcher = { displayName: re, handle: re, author: re };
type AssistantMatcher = { displayName: RegExp; handle: RegExp; author: RegExp };
const matcher: AssistantMatcher = { displayName: re, handle: re, author: re };
const convs = await conversationsFromGlowficUrls(
opts.glowfic,
matcher,

Copilot uses AI. Check for mistakes.
…output

- Add --glowfic-board, --all-characters, --min-posts CLI flags
- Extract unique characters from board threads with post count filtering
- Segment conversations per-character as assistant
- Generate HuggingFace-compatible dataset structure:
  - Per-character train.jsonl with character-specific system prompts
  - Combined train.jsonl for all characters
  - README.md dataset card with statistics
  - dataset_info.json and characters.json manifest
- Fix integration tests for nested thread directories (threads/YYYYMMDD/)
@deepfates deepfates changed the title feat(glowfic): add Glowfic ingestion via glowfic-dl, feat(glowfic): Add Glowfic ingestion with multi-character HuggingFace dataset export Dec 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants