Faxtract is an ASP.NET 8 MVC application designed for document extraction and processing (particularly flash cards, though you can give it whatever prompt you want) using LLamaSharp.
- .NET 8 SDK
- Windows OS with CUDA 12-compatible NVIDIA GPU (Note: You can swap the LlamaSharp.Backend NuGet package for other OS/GPU configurations, or you can add them all at once to let LLamaSharp figure it out)
- Visual Studio 2022 or another compatible IDE
- Clone the repository
- Open the solution in Visual Studio 2022
- Download an LLM model compatible with llama.cpp, e.g., one of the Phi-4 quantizations from Hugging Face, and place it wherever you want
- Put its path and filename in the
appsettings.jsonfile - Hit F5
- Document text extraction from HTML, PDF (not the best since it doesn't use OCR), and plain text
- Content chunking with configurable size and overlap
- Batch inference for maximum throughput with optional continuous batching (known issue with continuous batching: context may not clear as much as it was instructed to, so this can cause early inference termination with "no kv slot" errors)
- Real-time processing updates via SignalR
- Flash card generation from text content
- Configurable LLM parameters via
appsettings.json - Review/edit/delete/restore/retry/download generated flash cards
The application's core processing is primarily handled through these components:
LlamaExecutor manages the LLM (Large Language Model) interaction using LLamaSharp. Key responsibilities:
- Loads the language model from the specified path
- Manages the model's context window and batched execution
- Handles the system prompt to control the model's behavior (cached as a file for efficiency)
- Processes batches of text prompts (each is a chunk of file contents) in parallel
- Generates responses based on the input text
- Reports progress through an event system
WorkProcessor runs as a background service that orchestrates the entire processing pipeline:
- Retrieves batches of
TextChunkobjects from theIWorkProvider - Submits these batches to the
LlamaExecutorfor processing - Monitors processing progress and updates clients in real-time via SignalR
- Parses the LLM-generated responses into structured
FlashCardobjects - Tracks processing metrics (tokens processed, processing speed)
- Handles error conditions and graceful shutdown
When a batch of text chunks is processed, WorkProcessor sends them to LlamaExecutor, which runs them through the LLM and returns the responses. These responses are then parsed into flash cards and stored for later use.
TextChunker is responsible for dividing document content into manageable chunks for LLM processing:
- Breaks down large text documents into smaller, (hopefully) semantically-coherent chunks
- Uses a multi-level chunking strategy that preserves document structure:
- First splits by paragraphs/sections (double newlines)
- Then by sentences for oversized sections
- Maintains positional information (start/end positions) for each chunk for later reference
- Uses a configurable maximum chunk size (default 700 tokens) to ensure chunks fit within model context limits--longer chunks improves contextual coherence but reduces the amount of performance gains you can get from batching
- Aims for a preferred minimum chunk size (default 500 tokens) to try to keep the chunk coherent outside the entire document context
- Provides streaming support via
ChunkStreamAsyncfor memory-efficient processing of large documents - Uses a simple token estimation heuristic (roughly 4 characters per token)--this could be switched to use the model's actual tokenizer, or we an information entropy approach may be even better for trying to infer more even amounts of text per chunk
Including the text document's title, chapter title, etc. could be helpful for giving the LLM sufficient context (and that context should be shown to the student on the flash card as well), but this is not implemented.
You can adjust various parameters via the LLamaConfig section in appsettings.json; some are directly passed through to llama.cpp:
ModelPath: Path to the GGUF model file to use for inferenceBatchSize: Maximum batch size for internal model operationsTypeK/TypeV: KV cache data types (options include "GGML_TYPE_F16" and "GGML_TYPE_Q8_0", but llama.cpp crashes for most combinations other than both-16 and both-8)Threads: Number of CPU threads to use for inference done on the CPUGpuLayerCount: Number of model layers to run on the GPU (-1 for all layers)TensorBufferOverrides: Allows you to specify via regex patterns where each layer or tensor block runs; see ggml-org/llama.cpp#11397
PrePromptFile: File containing the cached system prompt state at runtime (deleted at startup so you don't have to remember to delete it when you change the settings)PrePromptText: System prompt text that instructs the model how to generate flash cardsExtraContextPrefix/ExtraContextSuffix: Text added before/after 'extra context' provided by the user to aid in control and consistency for all chunks split from a given upload
Temperature: Controls randomness (most recent [mid-2025] models recommend 0.6-0.8; lower is less random)TopK: Limits token selection to top K most likely tokensTopP: Probability threshold for nucleus samplingMinP: Minimum probability threshold for token selectionMaxTokens: Approximate maximum number of tokens to generate per response. The final KV cache size is MaxTokens x WorkBatchSize + 160 tokens to roughly fit the pre-prompt and chat template, but as responses complete, they free up the KV cache they were using. However, there's a bug probably in LlamaSharp or llama.cpp, so it only partially frees the KV cache unless all the 'conversations' end and are disposed.
WorkBatchSize: Number of chunks to process simultaneously. Higher values increase throughput up to a certain point (typically 10-60 depending on GPU or CPU) as compute or registers become the bottleneck rather than memory bandwidth.MinimumWorkBatchSize: Minimum number of chunks required before batch processing begins. This helps optimize energy usage by taking advantage of batched inference even if you upload small bits of text one by one.MaxBatchWaitTimeSeconds: Maximum time to wait for the minimum batch size before processing anyway. This prevents indefinite waiting when there are few documents to process.AllowContinuousBatching: If true, allows new chunks to start processing as soon as a previous one completes, rather than waiting for all chunks in a batch to finish. Improves throughput, but may cause catastrophic failure if that KV cache-freeing bug isn't fixed.
- The number of file chunks to process in parallel is configurable via the
WorkBatchSizesetting inappsettings.json - Larger batch sizes increase throughput but require more GPU memory--larger batch sizes become harmful instead of beneficial soon after memory usage exceeds your total VRAM, if not sooner
- KV cache can also be quantized via the appsettings to reduce memory usage
- Potential future work could involve starting on new chunks as old ones complete instead of waiting for the whole work batch, but it's complex and I didn't implement it because of the KV cache not freeing up as much as it should when conversation forks end.
- A few techniques are used to prevent models from falling into repetition loops, like checking that there were at least 5 distinct tokens in the last 20 tokens and checking for exactly duplicate non-answer lines, with a higher allowance in <think> blocks
- The ideas mentioned above (use the tokenizer for more accurate length control of chunks, including document/chapter/page titles for context, eventually allow chunks to start processing without the whole batch having to finish first if the KV cache freeing bug gets fixed).
- Could have a prompt in the chunk details modal to tell it to add missing topics or whatever.
- Could have stored prompts alongside the "extra context" input for things like expanding on the given topics in greater depth, changing the reading level, or restricting generation to be based only on information in the given text (e.g., to reduce hallucinations if facts have changed since the model's training data was gathered).
- Could allow retry, deletion, and download by input file in addition to deletion by chunk/flash card and downloading everything, maybe via a tree view of the files/chunks/flash cards with selection by node as an alternative to the histogram chart.
- Could also have a semantic deduplication step after generation. Might be able to use an embedding model to quickly find potentially duplicate question pairs, and, if necessary, evaluate if those are really "too duplicate" via prompting the LLM.
- Could show better stats with separate prefill performance, moving-average inference performance, and flash cards per second (although it can have much higher variance depending on the model).
- Could even try generating flash cards for a portion of a bigger chunk (using more context for potentially higher accuracy) and then use whatever portion doesn't result in any flash cards together with the next chunk, but a single file's chunks couldn't be batched in that case, and you'd probably have to use an LLM to evaluate the flash cards' coverage.
- .NET, ASP.NET, Visual Studio, and Windows are trademarks of Microsoft Corporation.
- NVIDIA and CUDA are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and/or other countries.
- LLamaSharp and llama.cpp are open-source projects and any references to them are for informational purposes only.
- Hugging Face is a trademark of Hugging Face, Inc.
This software is provided as-is, without warranty of any kind, express or implied.