Skip to content

Support Long-Document Question Generation with Token-Aware Semantic Chunking#562

Open
piyush06singhal wants to merge 4 commits intoAOSSIE-Org:mainfrom
piyush06singhal:feature/semantic-chunking-question-generator
Open

Support Long-Document Question Generation with Token-Aware Semantic Chunking#562
piyush06singhal wants to merge 4 commits intoAOSSIE-Org:mainfrom
piyush06singhal:feature/semantic-chunking-question-generator

Conversation

@piyush06singhal
Copy link

@piyush06singhal piyush06singhal commented Mar 12, 2026

Addressed Issues

Fixes #560


Overview

This PR introduces token-aware semantic chunking to enable question generation from long documents that exceed transformer model token limits.

Currently, EduAid uses transformer models such as T5-large and T5-base, which typically support around 512 tokens. When longer inputs are provided, the model truncates the text, causing:

  • Loss of context
  • Reduced question quality
  • Incomplete coverage of the document

This enhancement introduces a chunking pipeline that splits long input text into token-safe segments before passing them to the model.


Approach

The system now preprocesses long inputs using the following pipeline:

Long Input Text
      │
      ▼
Token Counting
      │
      ▼
Sentence-Aware Chunking
      │
      ▼
Overlapping Context Segments
      │
      ▼
Question Generation per Chunk
      │
      ▼
Semantic Question Deduplication
      │
      ▼
Final Aggregated Question Set

This ensures that all sections of a document contribute to the generated questions.


Key Improvements

1. Token-Aware Text Chunking

A new TextChunker module was introduced to split input text based on token limits rather than character count.

Chunk 1 → 0–400 tokens
Chunk 2 → 350–750 tokens
Chunk 3 → 700–1100 tokens
Chunk 4 → 1050–1500 tokens

Each chunk stays within the model's safe token range.


2. Sentence-Aware Segmentation

Chunk boundaries respect sentence structure using:

nltk.sent_tokenize()

This preserves semantic integrity and improves question quality.


3. Context Overlap

Chunks include overlapping tokens to maintain context continuity.

Example:

Chunk 1 → tokens 0–400
Chunk 2 → tokens 350–750

Overlap prevents information loss at chunk boundaries.


4. Proportional Question Distribution

Requested questions are distributed across chunks.

Example:

Requested questions: 10
Chunks: 4

Distribution:
Chunk 1 → 3 questions
Chunk 2 → 3 questions
Chunk 3 → 2 questions
Chunk 4 → 2 questions

This ensures balanced question generation across the document.


5. Semantic Deduplication

Questions generated from multiple chunks may overlap.
To address this, a QuestionDeduplicator module removes semantically similar questions.

Approach:

TF-IDF Vectorization
      │
      ▼
Cosine Similarity
      │
      ▼
Remove questions with similarity > 0.85

This ensures the final output contains unique questions only.


Generators Updated

Chunking support was integrated into:

MCQGenerator
ShortQGenerator
BoolQGenerator

Each generator now follows this logic:

if chunker.needs_chunking(text):
    use chunking pipeline
else:
    use original generation pipeline

This maintains full backward compatibility for short inputs.


Benefits

This enhancement provides several improvements:

  • Enables processing of long documents
  • Prevents transformer token overflow
  • Improves question diversity and quality
  • Ensures complete document coverage
  • Removes duplicate questions

The system can now generate questions from:

  • lecture notes
  • research summaries
  • long articles
  • uploaded documents
  • Wikipedia content

Additional Fixes Included

This PR also includes the following improvements:

  • Fixed token counting using tokenizer-based encoding
  • Prevented question distribution overflow
  • Added safe token buffer to prevent model overflow
  • Fixed recursion risk in chunking fallback logic
  • Updated tokenizer padding usage to modern HuggingFace API

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

AI Usage Disclosure

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with the policy. I have tested the code locally and I am responsible for it.

AI tools used: ChatGPT (OpenAI GPT-5)


AI Usage Disclosure:

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.

Summary by CodeRabbit

  • New Features

    • Long-document support: token-aware document chunking for all question generators.
    • Semantic deduplication of questions to remove near-duplicates.
    • Proportional distribution of requested questions across chunks and consolidated outputs.
  • Chores

    • Added logging, improved error handling/fallbacks, and updated tokenizer usage for paraphrasing paths.

@coderabbitai
Copy link

coderabbitai bot commented Mar 12, 2026

Warning

Rate limit exceeded

@piyush06singhal has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 52 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 10f61bc9-6c33-4c98-b679-ad0df56ba98a

📥 Commits

Reviewing files that changed from the base of the PR and between ae60f68 and b8aad4e.

📒 Files selected for processing (2)
  • backend/Generator/chunking.py
  • backend/Generator/main.py
📝 Walkthrough

Walkthrough

Adds a token‑aware chunking module and integrates chunked generation into MCQ, ShortQ, and BoolQ pipelines: texts are token‑counted, sentence‑segmented, split into overlapping token‑safe chunks, per‑chunk questions are generated, then aggregated and semantically deduplicated.

Changes

Cohort / File(s) Summary
Chunking Module
backend/Generator/chunking.py
New module adding TextChunker (token counting with fallback, sentence splitting, token‑aware chunk creation with overlap and logging), QuestionDeduplicator (TF‑IDF + cosine similarity deduplication and extraction helpers), and distribute_question_count() for proportional/equal question allocation.
Generator Integration
backend/Generator/main.py
MCQ/ShortQ/BoolQ generators now instantiate TextChunker and QuestionDeduplicator, and add _generate_*_with_chunking() flows: detect long text, chunk, allocate per‑chunk question counts, generate per chunk, aggregate and deduplicate. Added module logger and updated paraphrase tokenizer call to tokenizer(..., padding="max_length", truncation=True, return_tensors="pt"). Error fallbacks preserve short‑text behavior.

Sequence Diagram

sequenceDiagram
    participant User as Input Text
    participant Tokenizer as Tokenizer
    participant Chunker as TextChunker
    participant Gen as QuestionGenerator
    participant Aggreg as Aggregator
    participant Dedup as QuestionDeduplicator
    participant Out as Final Output

    User->>Tokenizer: count_tokens(text)
    alt exceeds safe threshold
        Tokenizer->>Chunker: needs_chunking -> true
        Chunker->>Chunker: split_into_sentences()
        Chunker->>Chunker: create_chunks()
        loop for each chunk
            Chunker->>Gen: send chunk_i
            Gen->>Gen: generate_questions(chunk_i)
            Gen->>Aggreg: append questions_i
        end
        Aggreg->>Dedup: merged_questions
        Dedup->>Dedup: TF‑IDF vectorize & cosine_similarity
        Dedup->>Out: deduplicated_questions
    else within limit
        Tokenizer->>Gen: text
        Gen->>Gen: generate_questions(text)
        Gen->>Out: questions
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibbled long pages, counted each token fair,
Hopped sentences into cozy chunks with care,
Questions sprouted in every patch and nook,
Twins trimmed away with a clever look,
Now every chapter gets a carrot-share ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Out of Scope Changes check ❓ Inconclusive The PR includes a tokenizer API update (encode_plus to modern HuggingFace padding) which is a necessary supporting change but falls slightly outside the explicit chunking scope described in issue #560. Clarify whether the tokenizer API update is required for the chunking feature or is a separate improvement that should be addressed independently.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main objective of the PR: implementing token-aware semantic chunking for long-document question generation.
Linked Issues check ✅ Passed The PR comprehensively implements all core objectives from issue #560: token detection, sentence-aware chunking, proportional question distribution, per-chunk generation, semantic deduplication, and backward compatibility.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan for PR comments
  • Generate coding plan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
backend/Generator/chunking.py (3)

255-255: Fix implicit Optional type hint for PEP 484 compliance.

The parameter chunk_sizes: List[int] = None should use explicit Optional or union syntax.

Suggested fix
-def distribute_question_count(total_questions: int, num_chunks: int, chunk_sizes: List[int] = None) -> List[int]:
+def distribute_question_count(total_questions: int, num_chunks: int, chunk_sizes: List[int] | None = None) -> List[int]:

Or for Python 3.8 compatibility:

from typing import Optional
def distribute_question_count(total_questions: int, num_chunks: int, chunk_sizes: Optional[List[int]] = None) -> List[int]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/chunking.py` at line 255, The parameter annotation on
distribute_question_count uses a default of None but types it as List[int];
update it to an explicit optional type (e.g., chunk_sizes: Optional[List[int]] =
None) and add "Optional" to the typing imports so the signature is PEP 484
compliant and works on Python 3.8; keep the return type as List[int].

81-90: Good fallback strategy; consider logging.exception here as well.

The period-split fallback maintains graceful degradation when NLTK fails. Same minor suggestion to use logging.exception for stack trace visibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/chunking.py` around lines 81 - 90, The except block that
catches errors from sent_tokenize currently calls logger.error and drops the
stack trace; replace the logger.error call with logger.exception so the
exception stack is logged (keep the same message text "Error splitting
sentences: {e}" or a similar descriptive message) while leaving the existing
period-split fallback intact; locate the try/except surrounding sent_tokenize in
chunking.py and update the logger call accordingly.

47-53: Consider using logging.exception for better error diagnostics.

When catching exceptions, logging.exception automatically includes the stack trace, which aids debugging. The word-count fallback is a reasonable degradation, but the imprecision (words ≠ tokens) could cause chunks to exceed model limits in edge cases.

Suggested improvement
         except Exception as e:
-            logger.error(f"Error counting tokens: {e}")
+            logger.exception(f"Error counting tokens: {e}")
             # Fallback to word count estimation
             return len(text.split())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/chunking.py` around lines 47 - 53, Replace the bare
logger.error call in the token counting fallback with logging.exception to
capture the stack trace: in the method that calls self.tokenizer.encode(...)
switch logger.error(f"Error counting tokens: {e}") to logger.exception(...) (or
logger.exception("Error counting tokens") depending on logger usage) so the
exception and traceback are recorded; optionally log that you are falling back
to the approximate word-count heuristic and keep the existing fallback return of
len(text.split()) to preserve behavior while improving diagnosability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 180-206: The fallback except block returns a dict without
time_taken; capture or reference the original start_time (or set start_time =
time.time() at the start of the surrounding function if missing) and compute
time_taken = time.time() - start_time before returning; then include
"time_taken": time_taken in the returned dict alongside "statement" and
"questions" (the changes occur in the except block that uses logger.error,
tokenize_into_sentences, identify_keywords, keyword_sentence_mapping, and
generate_multiple_choice_questions).
- Around line 559-562: There are two issues: a missing closing parenthesis in
the exception return and an infinite recursion because the truncated text still
triggers needs_chunking; fix by closing the parenthesis and changing the
fallback to either call a non-chunking path or ensure the truncated text is
below the chunking threshold — e.g., compute a safe truncation using the same
needs_chunking/safe_limit logic or call a dedicated non-chunking method (avoid
calling self.generate_boolq which delegates back to
_generate_boolq_with_chunking); reference _generate_boolq_with_chunking,
self.generate_boolq, needs_chunking and safe_limit when applying the change.

---

Nitpick comments:
In `@backend/Generator/chunking.py`:
- Line 255: The parameter annotation on distribute_question_count uses a default
of None but types it as List[int]; update it to an explicit optional type (e.g.,
chunk_sizes: Optional[List[int]] = None) and add "Optional" to the typing
imports so the signature is PEP 484 compliant and works on Python 3.8; keep the
return type as List[int].
- Around line 81-90: The except block that catches errors from sent_tokenize
currently calls logger.error and drops the stack trace; replace the logger.error
call with logger.exception so the exception stack is logged (keep the same
message text "Error splitting sentences: {e}" or a similar descriptive message)
while leaving the existing period-split fallback intact; locate the try/except
surrounding sent_tokenize in chunking.py and update the logger call accordingly.
- Around line 47-53: Replace the bare logger.error call in the token counting
fallback with logging.exception to capture the stack trace: in the method that
calls self.tokenizer.encode(...) switch logger.error(f"Error counting tokens:
{e}") to logger.exception(...) (or logger.exception("Error counting tokens")
depending on logger usage) so the exception and traceback are recorded;
optionally log that you are falling back to the approximate word-count heuristic
and keep the existing fallback return of len(text.split()) to preserve behavior
while improving diagnosability.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 35afaf56-f6f7-4d5f-bc49-6d372ba90679

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and df3ff38.

📒 Files selected for processing (2)
  • backend/Generator/chunking.py
  • backend/Generator/main.py

@piyush06singhal
Copy link
Author

@coderabbitai review

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
backend/Generator/main.py (1)

182-184: ⚠️ Potential issue | 🟠 Major

Fallback truncation is still not token-safe.

text[:2000] is character-based, so these recovery paths can still exceed the tokenizer budget that chunking was added to enforce. That reintroduces the same overflow/silent-truncation risk on the exception path.

Safer fallback
-            truncated_text = text[:2000]
+            safe_limit = self.chunker.max_tokens - 50
+            token_ids = self.tokenizer.encode(text, add_special_tokens=False)[:safe_limit]
+            truncated_text = self.tokenizer.decode(token_ids, skip_special_tokens=True)

Using the first safe chunk from self.chunker.create_chunks(text) would also work if chunk creation itself succeeded.

Also applies to: 350-352, 565-567

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 182 - 184, The fallback truncation
uses character-slicing (truncated_text = text[:2000]) which is not token-safe;
update the fallback in the blocks around truncated_text/tokenize_into_sentences
to obtain a token-safe chunk instead—preferably call
self.chunker.create_chunks(text) and use the first returned chunk (or the first
chunk's text) as the fallback, and only if chunk creation raises or returns
empty, perform a tokenizer-aware truncation (truncate by token count using your
tokenizer utility) before passing into tokenize_into_sentences/modified_text;
apply the same change to the other occurrences mentioned (around lines 350-352
and 565-567) so all exception/recovery paths use token-safe chunking.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/chunking.py`:
- Around line 104-106: The current early-return branches can emit oversized
chunks (returning [text] when sentence splitting fails and returning a single
long sentence when a sentence > max_tokens), breaking the token-safety
guarantee; update the logic in the function that uses variables sentences,
max_tokens and logger.warning so that before returning any chunk you ensure it's
<= max_tokens: when sentences is empty (the logger.warning branch) do not return
[text] unchanged but fall back to the chunking routine that iteratively slices
the raw text into token-sized subchunks (reuse or extract the same splitting
logic used later), and likewise when encountering a single sentence whose token
length > max_tokens (the branch around lines 115-125), split that sentence into
multiple token-bounded pieces instead of returning it whole; ensure you
reference and use the same tokenization/counting helper used elsewhere so all
returned chunks respect max_tokens.

In `@backend/Generator/main.py`:
- Around line 117-119: Replace bare {} returns in backend/Generator/main.py with
stable response shapes expected by backend/server.py: instead of returning {} in
branches that handle ShortQ outputs return {"questions": []}, for BoolQ outputs
return {"Boolean_Questions": [], "Count": 0}, and for other places that expect
both keys return the combined shape (e.g., {"questions": [],
"Boolean_Questions": [], "Count": 0}) so callers indexing output["questions"] or
output["Boolean_Questions"] won't get KeyError; update the return statements
found around the if-not-chunks checks and the similar early-fallback branches
referenced in the comment (lines ~117-119, ~176-178, ~197-198, ~288-290,
~344-346, ~503-505, ~557-559) by returning the appropriate keyed empty
lists/zero counts matching the function's expected response shape.
- Around line 205-208: The return in the MCQ fallback uses an undefined variable
end_time causing UnboundLocalError; change the time field to use the correctly
set rend_time (or ensure end_time is defined earlier). Specifically, in the
fallback return that returns {"statement": modified_text, "questions":
generated_questions["questions"], "time_taken": end_time - start_time}, replace
end_time with rend_time (or initialize end_time before the try/except) so the
function returns a valid payload when chunking fails.

---

Duplicate comments:
In `@backend/Generator/main.py`:
- Around line 182-184: The fallback truncation uses character-slicing
(truncated_text = text[:2000]) which is not token-safe; update the fallback in
the blocks around truncated_text/tokenize_into_sentences to obtain a token-safe
chunk instead—preferably call self.chunker.create_chunks(text) and use the first
returned chunk (or the first chunk's text) as the fallback, and only if chunk
creation raises or returns empty, perform a tokenizer-aware truncation (truncate
by token count using your tokenizer utility) before passing into
tokenize_into_sentences/modified_text; apply the same change to the other
occurrences mentioned (around lines 350-352 and 565-567) so all
exception/recovery paths use token-safe chunking.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c350b868-4d12-4e7f-8de6-5cbfce4e50dd

📥 Commits

Reviewing files that changed from the base of the PR and between df3ff38 and 972c378.

📒 Files selected for processing (2)
  • backend/Generator/chunking.py
  • backend/Generator/main.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
backend/Generator/main.py (3)

517-519: ⚠️ Potential issue | 🟠 Major

Preserve the BoolQ response contract on empty/failure paths.

The non-chunked BoolQ path returns Text, Count, and Boolean_Questions, but these branches return either {} or an MCQ-style payload. That makes long-document failures incompatible with the existing BoolQ caller expectations.

Suggested fix
             if not chunks:
                 logger.warning("No chunks created from text")
-                return {}
+                return {
+                    "Text": text,
+                    "Count": 0,
+                    "Boolean_Questions": [],
+                }
@@
             else:
                 logger.warning("No questions generated from any chunk")
                 return {
-                    "statement": text,
-                    "questions": [],
-                    "time_taken": time.time() - start_time,
+                    "Text": text,
+                    "Count": 0,
+                    "Boolean_Questions": [],
                 }

Also applies to: 573-577

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 517 - 519, The empty/chunk-failure
branches currently return {} or an MCQ payload which breaks the BoolQ response
contract; update the branches that handle "if not chunks" (and the similar block
around lines 573-577) to return the canonical BoolQ response shape with keys
Text (empty string), Count (0), and Boolean_Questions (empty list) so callers
always receive the expected types, and ensure any MCQ-specific returns are only
used in MCQ code paths.

298-300: ⚠️ Potential issue | 🟠 Major

Keep the ShortQ empty result shape stable.

Returning {} here breaks the method's own contract for chunked requests. The non-chunked path always returns statement and questions, so callers shouldn't need a special-case for long inputs.

Suggested fix
             if not chunks:
                 logger.warning("No chunks created from text")
-                return {}
+                return {
+                    "statement": text,
+                    "questions": [],
+                }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 298 - 300, The early return when no
chunks are created currently returns an empty dict which breaks the ShortQ
response contract; instead, when chunks is empty in the function handling
chunked requests (the block checking "if not chunks"), return the same stable
shape as the non-chunked path — include the "statement" key with an empty string
and the "questions" key with an empty list (and preserve any other top-level
keys the non-chunked path returns) so callers always receive a consistent
response shape.

207-208: ⚠️ Potential issue | 🟠 Major

Return the normal MCQ payload on this fallback path.

This branch still returns {}, so any long-document MCQ request that reaches it loses the questions key entirely instead of getting an empty result in the usual shape.

Suggested fix
             if len(keyword_sentence_mapping) == 0:
-                return {}
+                return {
+                    "statement": modified_text,
+                    "questions": [],
+                    "time_taken": time.time() - start_time,
+                }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 207 - 208, The fallback branch that
checks "if len(keyword_sentence_mapping) == 0" currently returns an empty dict
({}), which breaks the expected MCQ payload shape; change this to return the
normal MCQ response structure with an empty questions list (e.g., the same
payload shape used elsewhere, such as {'questions': []} or via the existing
payload builder) so callers still receive the expected keys; update the return
in the branch that references keyword_sentence_mapping accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/chunking.py`:
- Around line 143-164: The overlap restoration can make the following chunk
exceed self.max_tokens because the code always appends the new sentence; in the
block that sets current_chunk = overlap_chunk and current_token_count =
overlap_tokens (when self.overlap_tokens > 0), trim the restored overlap so that
overlap_tokens + sentence_tokens <= self.max_tokens before appending the new
sentence: repeatedly remove earliest sentences from overlap_chunk (updating
overlap_tokens via self.count_tokens and current_token_count) until the sum
fits, or if no overlap can fit, clear current_chunk/current_token_count so the
sentence starts a fresh chunk; ensure references to self.overlap_tokens,
current_chunk, current_token_count, self.count_tokens, sentence_tokens, and
self.max_tokens are used to locate and update the logic.

In `@backend/Generator/main.py`:
- Around line 364-383: Replace the character-based slice text[:2000] with a
token-aware truncation using the model tokenizer so the fallbacks respect the
model token budget: before calling tokenize_into_sentences/identify_keywords,
encode the input with the same tokenizer used by generate_normal_questions
(e.g., self.tokenizer) and trim the encoded token list to the model's max
allowable tokens, then decode back to text for tokenize_into_sentences; apply
the same token-based truncation change in the other fallback block that uses
text[:2000] (around the code invoking tokenize_into_sentences,
identify_keywords, find_sentences_with_keywords, and generate_normal_questions).
- Around line 356-360: The branch that returns {"statement": text, "questions":
[], "time_taken": time.time() - start_time} will raise NameError because
start_time is not defined in _generate_shortq_with_chunking; fix by initializing
start_time (e.g., start_time = time.time()) at the start of
_generate_shortq_with_chunking or by accepting/propagating a start_time
parameter from the caller (mirroring _generate_mcq_with_chunking), then use that
defined start_time in the return statement so time_taken computes correctly.

---

Duplicate comments:
In `@backend/Generator/main.py`:
- Around line 517-519: The empty/chunk-failure branches currently return {} or
an MCQ payload which breaks the BoolQ response contract; update the branches
that handle "if not chunks" (and the similar block around lines 573-577) to
return the canonical BoolQ response shape with keys Text (empty string), Count
(0), and Boolean_Questions (empty list) so callers always receive the expected
types, and ensure any MCQ-specific returns are only used in MCQ code paths.
- Around line 298-300: The early return when no chunks are created currently
returns an empty dict which breaks the ShortQ response contract; instead, when
chunks is empty in the function handling chunked requests (the block checking
"if not chunks"), return the same stable shape as the non-chunked path — include
the "statement" key with an empty string and the "questions" key with an empty
list (and preserve any other top-level keys the non-chunked path returns) so
callers always receive a consistent response shape.
- Around line 207-208: The fallback branch that checks "if
len(keyword_sentence_mapping) == 0" currently returns an empty dict ({}), which
breaks the expected MCQ payload shape; change this to return the normal MCQ
response structure with an empty questions list (e.g., the same payload shape
used elsewhere, such as {'questions': []} or via the existing payload builder)
so callers still receive the expected keys; update the return in the branch that
references keyword_sentence_mapping accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fb0be4a4-d7ea-4b77-bedc-78a5af04273d

📥 Commits

Reviewing files that changed from the base of the PR and between 972c378 and ae60f68.

📒 Files selected for processing (2)
  • backend/Generator/chunking.py
  • backend/Generator/main.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Long-Document Question Generation via Token-Aware Semantic Chunking

1 participant