Support Long-Document Question Generation with Token-Aware Semantic Chunking#562
Conversation
…-document question generation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a token‑aware chunking module and integrates chunked generation into MCQ, ShortQ, and BoolQ pipelines: texts are token‑counted, sentence‑segmented, split into overlapping token‑safe chunks, per‑chunk questions are generated, then aggregated and semantically deduplicated. Changes
Sequence DiagramsequenceDiagram
participant User as Input Text
participant Tokenizer as Tokenizer
participant Chunker as TextChunker
participant Gen as QuestionGenerator
participant Aggreg as Aggregator
participant Dedup as QuestionDeduplicator
participant Out as Final Output
User->>Tokenizer: count_tokens(text)
alt exceeds safe threshold
Tokenizer->>Chunker: needs_chunking -> true
Chunker->>Chunker: split_into_sentences()
Chunker->>Chunker: create_chunks()
loop for each chunk
Chunker->>Gen: send chunk_i
Gen->>Gen: generate_questions(chunk_i)
Gen->>Aggreg: append questions_i
end
Aggreg->>Dedup: merged_questions
Dedup->>Dedup: TF‑IDF vectorize & cosine_similarity
Dedup->>Out: deduplicated_questions
else within limit
Tokenizer->>Gen: text
Gen->>Gen: generate_questions(text)
Gen->>Out: questions
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan for PR comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
backend/Generator/chunking.py (3)
255-255: Fix implicitOptionaltype hint for PEP 484 compliance.The parameter
chunk_sizes: List[int] = Noneshould use explicitOptionalor union syntax.Suggested fix
-def distribute_question_count(total_questions: int, num_chunks: int, chunk_sizes: List[int] = None) -> List[int]: +def distribute_question_count(total_questions: int, num_chunks: int, chunk_sizes: List[int] | None = None) -> List[int]:Or for Python 3.8 compatibility:
from typing import Optional def distribute_question_count(total_questions: int, num_chunks: int, chunk_sizes: Optional[List[int]] = None) -> List[int]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/chunking.py` at line 255, The parameter annotation on distribute_question_count uses a default of None but types it as List[int]; update it to an explicit optional type (e.g., chunk_sizes: Optional[List[int]] = None) and add "Optional" to the typing imports so the signature is PEP 484 compliant and works on Python 3.8; keep the return type as List[int].
81-90: Good fallback strategy; considerlogging.exceptionhere as well.The period-split fallback maintains graceful degradation when NLTK fails. Same minor suggestion to use
logging.exceptionfor stack trace visibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/chunking.py` around lines 81 - 90, The except block that catches errors from sent_tokenize currently calls logger.error and drops the stack trace; replace the logger.error call with logger.exception so the exception stack is logged (keep the same message text "Error splitting sentences: {e}" or a similar descriptive message) while leaving the existing period-split fallback intact; locate the try/except surrounding sent_tokenize in chunking.py and update the logger call accordingly.
47-53: Consider usinglogging.exceptionfor better error diagnostics.When catching exceptions,
logging.exceptionautomatically includes the stack trace, which aids debugging. The word-count fallback is a reasonable degradation, but the imprecision (words ≠ tokens) could cause chunks to exceed model limits in edge cases.Suggested improvement
except Exception as e: - logger.error(f"Error counting tokens: {e}") + logger.exception(f"Error counting tokens: {e}") # Fallback to word count estimation return len(text.split())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/chunking.py` around lines 47 - 53, Replace the bare logger.error call in the token counting fallback with logging.exception to capture the stack trace: in the method that calls self.tokenizer.encode(...) switch logger.error(f"Error counting tokens: {e}") to logger.exception(...) (or logger.exception("Error counting tokens") depending on logger usage) so the exception and traceback are recorded; optionally log that you are falling back to the approximate word-count heuristic and keep the existing fallback return of len(text.split()) to preserve behavior while improving diagnosability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/main.py`:
- Around line 180-206: The fallback except block returns a dict without
time_taken; capture or reference the original start_time (or set start_time =
time.time() at the start of the surrounding function if missing) and compute
time_taken = time.time() - start_time before returning; then include
"time_taken": time_taken in the returned dict alongside "statement" and
"questions" (the changes occur in the except block that uses logger.error,
tokenize_into_sentences, identify_keywords, keyword_sentence_mapping, and
generate_multiple_choice_questions).
- Around line 559-562: There are two issues: a missing closing parenthesis in
the exception return and an infinite recursion because the truncated text still
triggers needs_chunking; fix by closing the parenthesis and changing the
fallback to either call a non-chunking path or ensure the truncated text is
below the chunking threshold — e.g., compute a safe truncation using the same
needs_chunking/safe_limit logic or call a dedicated non-chunking method (avoid
calling self.generate_boolq which delegates back to
_generate_boolq_with_chunking); reference _generate_boolq_with_chunking,
self.generate_boolq, needs_chunking and safe_limit when applying the change.
---
Nitpick comments:
In `@backend/Generator/chunking.py`:
- Line 255: The parameter annotation on distribute_question_count uses a default
of None but types it as List[int]; update it to an explicit optional type (e.g.,
chunk_sizes: Optional[List[int]] = None) and add "Optional" to the typing
imports so the signature is PEP 484 compliant and works on Python 3.8; keep the
return type as List[int].
- Around line 81-90: The except block that catches errors from sent_tokenize
currently calls logger.error and drops the stack trace; replace the logger.error
call with logger.exception so the exception stack is logged (keep the same
message text "Error splitting sentences: {e}" or a similar descriptive message)
while leaving the existing period-split fallback intact; locate the try/except
surrounding sent_tokenize in chunking.py and update the logger call accordingly.
- Around line 47-53: Replace the bare logger.error call in the token counting
fallback with logging.exception to capture the stack trace: in the method that
calls self.tokenizer.encode(...) switch logger.error(f"Error counting tokens:
{e}") to logger.exception(...) (or logger.exception("Error counting tokens")
depending on logger usage) so the exception and traceback are recorded;
optionally log that you are falling back to the approximate word-count heuristic
and keep the existing fallback return of len(text.split()) to preserve behavior
while improving diagnosability.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 35afaf56-f6f7-4d5f-bc49-6d372ba90679
📒 Files selected for processing (2)
backend/Generator/chunking.pybackend/Generator/main.py
|
@coderabbitai review |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
backend/Generator/main.py (1)
182-184:⚠️ Potential issue | 🟠 MajorFallback truncation is still not token-safe.
text[:2000]is character-based, so these recovery paths can still exceed the tokenizer budget that chunking was added to enforce. That reintroduces the same overflow/silent-truncation risk on the exception path.Safer fallback
- truncated_text = text[:2000] + safe_limit = self.chunker.max_tokens - 50 + token_ids = self.tokenizer.encode(text, add_special_tokens=False)[:safe_limit] + truncated_text = self.tokenizer.decode(token_ids, skip_special_tokens=True)Using the first safe chunk from
self.chunker.create_chunks(text)would also work if chunk creation itself succeeded.Also applies to: 350-352, 565-567
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 182 - 184, The fallback truncation uses character-slicing (truncated_text = text[:2000]) which is not token-safe; update the fallback in the blocks around truncated_text/tokenize_into_sentences to obtain a token-safe chunk instead—preferably call self.chunker.create_chunks(text) and use the first returned chunk (or the first chunk's text) as the fallback, and only if chunk creation raises or returns empty, perform a tokenizer-aware truncation (truncate by token count using your tokenizer utility) before passing into tokenize_into_sentences/modified_text; apply the same change to the other occurrences mentioned (around lines 350-352 and 565-567) so all exception/recovery paths use token-safe chunking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/chunking.py`:
- Around line 104-106: The current early-return branches can emit oversized
chunks (returning [text] when sentence splitting fails and returning a single
long sentence when a sentence > max_tokens), breaking the token-safety
guarantee; update the logic in the function that uses variables sentences,
max_tokens and logger.warning so that before returning any chunk you ensure it's
<= max_tokens: when sentences is empty (the logger.warning branch) do not return
[text] unchanged but fall back to the chunking routine that iteratively slices
the raw text into token-sized subchunks (reuse or extract the same splitting
logic used later), and likewise when encountering a single sentence whose token
length > max_tokens (the branch around lines 115-125), split that sentence into
multiple token-bounded pieces instead of returning it whole; ensure you
reference and use the same tokenization/counting helper used elsewhere so all
returned chunks respect max_tokens.
In `@backend/Generator/main.py`:
- Around line 117-119: Replace bare {} returns in backend/Generator/main.py with
stable response shapes expected by backend/server.py: instead of returning {} in
branches that handle ShortQ outputs return {"questions": []}, for BoolQ outputs
return {"Boolean_Questions": [], "Count": 0}, and for other places that expect
both keys return the combined shape (e.g., {"questions": [],
"Boolean_Questions": [], "Count": 0}) so callers indexing output["questions"] or
output["Boolean_Questions"] won't get KeyError; update the return statements
found around the if-not-chunks checks and the similar early-fallback branches
referenced in the comment (lines ~117-119, ~176-178, ~197-198, ~288-290,
~344-346, ~503-505, ~557-559) by returning the appropriate keyed empty
lists/zero counts matching the function's expected response shape.
- Around line 205-208: The return in the MCQ fallback uses an undefined variable
end_time causing UnboundLocalError; change the time field to use the correctly
set rend_time (or ensure end_time is defined earlier). Specifically, in the
fallback return that returns {"statement": modified_text, "questions":
generated_questions["questions"], "time_taken": end_time - start_time}, replace
end_time with rend_time (or initialize end_time before the try/except) so the
function returns a valid payload when chunking fails.
---
Duplicate comments:
In `@backend/Generator/main.py`:
- Around line 182-184: The fallback truncation uses character-slicing
(truncated_text = text[:2000]) which is not token-safe; update the fallback in
the blocks around truncated_text/tokenize_into_sentences to obtain a token-safe
chunk instead—preferably call self.chunker.create_chunks(text) and use the first
returned chunk (or the first chunk's text) as the fallback, and only if chunk
creation raises or returns empty, perform a tokenizer-aware truncation (truncate
by token count using your tokenizer utility) before passing into
tokenize_into_sentences/modified_text; apply the same change to the other
occurrences mentioned (around lines 350-352 and 565-567) so all
exception/recovery paths use token-safe chunking.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c350b868-4d12-4e7f-8de6-5cbfce4e50dd
📒 Files selected for processing (2)
backend/Generator/chunking.pybackend/Generator/main.py
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
backend/Generator/main.py (3)
517-519:⚠️ Potential issue | 🟠 MajorPreserve the BoolQ response contract on empty/failure paths.
The non-chunked BoolQ path returns
Text,Count, andBoolean_Questions, but these branches return either{}or an MCQ-style payload. That makes long-document failures incompatible with the existing BoolQ caller expectations.Suggested fix
if not chunks: logger.warning("No chunks created from text") - return {} + return { + "Text": text, + "Count": 0, + "Boolean_Questions": [], + } @@ else: logger.warning("No questions generated from any chunk") return { - "statement": text, - "questions": [], - "time_taken": time.time() - start_time, + "Text": text, + "Count": 0, + "Boolean_Questions": [], }Also applies to: 573-577
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 517 - 519, The empty/chunk-failure branches currently return {} or an MCQ payload which breaks the BoolQ response contract; update the branches that handle "if not chunks" (and the similar block around lines 573-577) to return the canonical BoolQ response shape with keys Text (empty string), Count (0), and Boolean_Questions (empty list) so callers always receive the expected types, and ensure any MCQ-specific returns are only used in MCQ code paths.
298-300:⚠️ Potential issue | 🟠 MajorKeep the ShortQ empty result shape stable.
Returning
{}here breaks the method's own contract for chunked requests. The non-chunked path always returnsstatementandquestions, so callers shouldn't need a special-case for long inputs.Suggested fix
if not chunks: logger.warning("No chunks created from text") - return {} + return { + "statement": text, + "questions": [], + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 298 - 300, The early return when no chunks are created currently returns an empty dict which breaks the ShortQ response contract; instead, when chunks is empty in the function handling chunked requests (the block checking "if not chunks"), return the same stable shape as the non-chunked path — include the "statement" key with an empty string and the "questions" key with an empty list (and preserve any other top-level keys the non-chunked path returns) so callers always receive a consistent response shape.
207-208:⚠️ Potential issue | 🟠 MajorReturn the normal MCQ payload on this fallback path.
This branch still returns
{}, so any long-document MCQ request that reaches it loses thequestionskey entirely instead of getting an empty result in the usual shape.Suggested fix
if len(keyword_sentence_mapping) == 0: - return {} + return { + "statement": modified_text, + "questions": [], + "time_taken": time.time() - start_time, + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/Generator/main.py` around lines 207 - 208, The fallback branch that checks "if len(keyword_sentence_mapping) == 0" currently returns an empty dict ({}), which breaks the expected MCQ payload shape; change this to return the normal MCQ response structure with an empty questions list (e.g., the same payload shape used elsewhere, such as {'questions': []} or via the existing payload builder) so callers still receive the expected keys; update the return in the branch that references keyword_sentence_mapping accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/Generator/chunking.py`:
- Around line 143-164: The overlap restoration can make the following chunk
exceed self.max_tokens because the code always appends the new sentence; in the
block that sets current_chunk = overlap_chunk and current_token_count =
overlap_tokens (when self.overlap_tokens > 0), trim the restored overlap so that
overlap_tokens + sentence_tokens <= self.max_tokens before appending the new
sentence: repeatedly remove earliest sentences from overlap_chunk (updating
overlap_tokens via self.count_tokens and current_token_count) until the sum
fits, or if no overlap can fit, clear current_chunk/current_token_count so the
sentence starts a fresh chunk; ensure references to self.overlap_tokens,
current_chunk, current_token_count, self.count_tokens, sentence_tokens, and
self.max_tokens are used to locate and update the logic.
In `@backend/Generator/main.py`:
- Around line 364-383: Replace the character-based slice text[:2000] with a
token-aware truncation using the model tokenizer so the fallbacks respect the
model token budget: before calling tokenize_into_sentences/identify_keywords,
encode the input with the same tokenizer used by generate_normal_questions
(e.g., self.tokenizer) and trim the encoded token list to the model's max
allowable tokens, then decode back to text for tokenize_into_sentences; apply
the same token-based truncation change in the other fallback block that uses
text[:2000] (around the code invoking tokenize_into_sentences,
identify_keywords, find_sentences_with_keywords, and generate_normal_questions).
- Around line 356-360: The branch that returns {"statement": text, "questions":
[], "time_taken": time.time() - start_time} will raise NameError because
start_time is not defined in _generate_shortq_with_chunking; fix by initializing
start_time (e.g., start_time = time.time()) at the start of
_generate_shortq_with_chunking or by accepting/propagating a start_time
parameter from the caller (mirroring _generate_mcq_with_chunking), then use that
defined start_time in the return statement so time_taken computes correctly.
---
Duplicate comments:
In `@backend/Generator/main.py`:
- Around line 517-519: The empty/chunk-failure branches currently return {} or
an MCQ payload which breaks the BoolQ response contract; update the branches
that handle "if not chunks" (and the similar block around lines 573-577) to
return the canonical BoolQ response shape with keys Text (empty string), Count
(0), and Boolean_Questions (empty list) so callers always receive the expected
types, and ensure any MCQ-specific returns are only used in MCQ code paths.
- Around line 298-300: The early return when no chunks are created currently
returns an empty dict which breaks the ShortQ response contract; instead, when
chunks is empty in the function handling chunked requests (the block checking
"if not chunks"), return the same stable shape as the non-chunked path — include
the "statement" key with an empty string and the "questions" key with an empty
list (and preserve any other top-level keys the non-chunked path returns) so
callers always receive a consistent response shape.
- Around line 207-208: The fallback branch that checks "if
len(keyword_sentence_mapping) == 0" currently returns an empty dict ({}), which
breaks the expected MCQ payload shape; change this to return the normal MCQ
response structure with an empty questions list (e.g., the same payload shape
used elsewhere, such as {'questions': []} or via the existing payload builder)
so callers still receive the expected keys; update the return in the branch that
references keyword_sentence_mapping accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fb0be4a4-d7ea-4b77-bedc-78a5af04273d
📒 Files selected for processing (2)
backend/Generator/chunking.pybackend/Generator/main.py
Addressed Issues
Fixes #560
Overview
This PR introduces token-aware semantic chunking to enable question generation from long documents that exceed transformer model token limits.
Currently, EduAid uses transformer models such as T5-large and T5-base, which typically support around 512 tokens. When longer inputs are provided, the model truncates the text, causing:
This enhancement introduces a chunking pipeline that splits long input text into token-safe segments before passing them to the model.
Approach
The system now preprocesses long inputs using the following pipeline:
This ensures that all sections of a document contribute to the generated questions.
Key Improvements
1. Token-Aware Text Chunking
A new
TextChunkermodule was introduced to split input text based on token limits rather than character count.Each chunk stays within the model's safe token range.
2. Sentence-Aware Segmentation
Chunk boundaries respect sentence structure using:
This preserves semantic integrity and improves question quality.
3. Context Overlap
Chunks include overlapping tokens to maintain context continuity.
Example:
Overlap prevents information loss at chunk boundaries.
4. Proportional Question Distribution
Requested questions are distributed across chunks.
Example:
This ensures balanced question generation across the document.
5. Semantic Deduplication
Questions generated from multiple chunks may overlap.
To address this, a QuestionDeduplicator module removes semantically similar questions.
Approach:
This ensures the final output contains unique questions only.
Generators Updated
Chunking support was integrated into:
Each generator now follows this logic:
This maintains full backward compatibility for short inputs.
Benefits
This enhancement provides several improvements:
The system can now generate questions from:
Additional Fixes Included
This PR also includes the following improvements:
Checklist
AI Usage Disclosure
AI tools used: ChatGPT (OpenAI GPT-5)
AI Usage Disclosure:
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.
Summary by CodeRabbit
New Features
Chores