Skip to content

perf(mcq): improve computational resource management with batching and lazy model loading#552

Open
piyush06singhal wants to merge 3 commits intoAOSSIE-Org:mainfrom
piyush06singhal:perf/mcq-resource-management
Open

perf(mcq): improve computational resource management with batching and lazy model loading#552
piyush06singhal wants to merge 3 commits intoAOSSIE-Org:mainfrom
piyush06singhal:perf/mcq-resource-management

Conversation

@piyush06singhal
Copy link

@piyush06singhal piyush06singhal commented Mar 10, 2026

Addressed Issues:

Fixes #261


Additional Notes:

This PR improves computational resource management in the MCQ generation pipeline located in:

backend/Generator/mcq.py

The existing implementation could lead to high memory consumption and inefficient model usage when processing large input texts or running multiple models simultaneously. This PR introduces several optimizations to improve performance, stability, and scalability.


Key Improvements

1. Batch Processing

  • Implemented chunked text processing to avoid processing large inputs at once.
  • Added configurable batch size to reduce peak memory usage.
  • Aggregates results from multiple batches safely.

2. Lazy Model Loading

  • Introduced a model management mechanism that loads models only when required.
  • Prevents unnecessary memory allocation during application startup.
  • Ensures models are reused instead of repeatedly reloaded.

3. Memory Optimization

  • Added memory cleanup mechanisms after heavy model inference.
  • Utilized gc.collect() for garbage collection.
  • Added GPU memory cleanup using torch.cuda.empty_cache() when CUDA is available.

4. Resource Monitoring

  • Implemented system memory monitoring to detect low-memory environments.
  • Enables adaptive behavior such as reducing batch size when resources are constrained.

5. CPU / GPU Compatibility

  • Added automatic device detection using torch.cuda.is_available().
  • Ensures models run on GPU when available and fall back to CPU otherwise.

6. Improved Error Handling

  • Added safeguards for large input texts.
  • Prevented crashes caused by memory pressure.
  • Introduced controlled fallback mechanisms to maintain service stability.

Expected Benefits

  • Reduced memory consumption during MCQ generation
  • Improved performance for large input texts
  • Better stability in resource-constrained environments
  • More scalable processing pipeline for future feature extensions

These improvements maintain full backward compatibility with the existing MCQ generation workflow and do not modify the external API behavior.


Screenshots/Recordings:

N/A — This change focuses on backend performance optimization and resource management within the MCQ generation pipeline. No UI changes were introduced.


Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review other otherwise.

Check one of the checkboxes below:

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: ChatGPT (for architectural guidance and code review assistance)


AI Usage Disclosure:

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.

Summary by CodeRabbit

  • New Features

    • Lazy, device-aware model loading with per-model caching and on-demand initialization.
    • Memory-aware batching and dynamic batch sizing for generation jobs.
  • Bug Fixes

    • Broader error handling with guaranteed cleanup to reduce failures and memory leaks during generation.
  • Refactor

    • Centralized logging and lazy initialization of backend services/endpoints to improve startup time and observability.

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

Adds a centralized, thread-safe ModelManager for lazy, device-aware model loading/unloading and caching; implements memory-aware batching, monitoring, and cleanup in MCQ generation; converts eager global instantiation in server routes to concurrency-safe lazy getters; and adds psutil dependency.

Changes

Cohort / File(s) Summary
Model Management & Lazy Loading
backend/Generator/main.py
Added ModelManager class and global model_manager; introduced per-model locks, device-aware load/unload/cache APIs; converted generators/evaluators to lazy properties for tokenizers/models/nlp/s2v/fdist/etc.; added centralized logging and expanded error handling.
Memory Management & Batching
backend/Generator/mcq.py
Added memory utilities (get_memory_usage, cleanup_memory, adjust_batch_size, split_into_batches) and resource constants; refactored generate_multiple_choice_questions and generate_normal_questions to use memory-aware batching, dynamic batch adjustment, per-batch error handling, and explicit GC / cuda.empty_cache calls; replaced prints with structured logging.
Server-Side Lazy Initialization
backend/server.py
Replaced eager global component instantiation with thread-safe lazy getters (get_mcq_generator, get_answer_predictor, get_boolq_generator, get_shortq_generator, get_question_generator, get_docs_service, get_file_processor, get_mediawiki, get_qa_model) and updated route handlers to call these getters.
Dependencies
requirements.txt
Added psutil dependency to enable memory monitoring used by MCQ memory utilities.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Server
    participant Generator
    participant ModelManager
    participant Cache as "Model Cache"
    participant Device
    participant Memory

    Client->>Server: request (e.g., generate_mcq)
    Server->>Generator: obtain generator (get_..._generator) and invoke generation
    Generator->>ModelManager: load_model(model_key, pretrained)
    alt model cached
        ModelManager->>Cache: retrieve(model_key)
        Cache-->>ModelManager: model
    else not cached
        ModelManager->>Memory: get_memory_usage()
        alt sufficient memory
            ModelManager->>Device: select_device()
            Device-->>ModelManager: device
            ModelManager->>ModelManager: instantiate model on device
            ModelManager->>Cache: store(model_key, model)
        else insufficient memory
            ModelManager->>Memory: cleanup_memory()
            Memory->>Memory: gc.collect(), cuda.empty_cache()
            ModelManager->>Device: select_device()
            ModelManager->>ModelManager: instantiate model on device
            ModelManager->>Cache: store(model_key, model)
        end
    end
    ModelManager-->>Generator: return model
    Generator->>Generator: run generation (batching, tokenize, model.generate, decode)
    Generator-->>Server: results
    Server->>Memory: cleanup_memory()
    Memory->>Memory: gc.collect(), cuda.empty_cache()
    Server-->>Client: response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 I wait until I'm called to spring,
Load my models light and cling,
Nibble batches, tidy crumbs away,
Cache my carrots for another day,
Soft hops, clean burrow — ready to play! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 64.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main performance enhancement focus: computational resource management improvements via batching and lazy model loading, which directly aligns with the primary PR changes.
Linked Issues check ✅ Passed The PR comprehensively addresses issue #261's objectives: implements batch processing with configurable sizes [mcq.py], adds lazy model loading via ModelManager [main.py], includes memory monitoring with get_memory_usage() [mcq.py], implements memory cleanup with gc.collect() and cuda.empty_cache() [mcq.py], provides dynamic batch sizing [mcq.py], and includes robust error handling across generators [main.py, mcq.py].
Out of Scope Changes check ✅ Passed While primary changes focus on MCQ resource management, additional scope includes broader generator refactoring (ShortQ, BoolQ, Paraphrase, AnswerPredictor, QuestionGenerator) and server-side lazy initialization, which are supportive infrastructure enhancements aligned with the overall resource management objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/server.py (2)

236-244: ⚠️ Potential issue | 🔴 Critical

/get_boolean_answer is passing the wrong payload shape.

AnswerPredictor.predict_boolean_answer() expects the full input_question list, but this route passes one string at a time. The predictor then iterates over characters of that string, returns a non-empty list, and if (qa_response) makes almost every non-empty question come back "True".

🛠️ Minimal fix
-    answer_pred = get_answer_predictor()
-    for question in input_questions:
-        qa_response = answer_pred.predict_boolean_answer(
-            {"input_text": input_text, "input_question": question}
-        )
-        if(qa_response):
-            output.append("True")
-        else:
-            output.append("False")
+    answers = get_answer_predictor().predict_boolean_answer(
+        {"input_text": input_text, "input_question": input_questions}
+    )
+    output = ["True" if answer else "False" for answer in answers]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 236 - 244, The route is calling
AnswerPredictor.predict_boolean_answer with a single question string, causing
the predictor to iterate characters; instead pass the whole input_questions list
and handle the returned per-question results. Call
answer_pred.predict_boolean_answer with payload {"input_text": input_text,
"input_question": input_questions}, assign its result (qa_response) which should
be a list of booleans, then iterate over qa_response to append "True"/"False" to
output; adjust references to input_questions, input_text,
predict_boolean_answer, qa_response and output accordingly.

480-486: ⚠️ Potential issue | 🔴 Critical

/get_boolq_hard is wired to an unsupported answer_style.

QuestionGenerator.generate_qg_inputs() only accepts "all", "sentences", and "multiple_choice". Passing "true_false" forces the exception path in QuestionGenerator.generate() and this endpoint silently returns []. Either add true/false support end-to-end or route this endpoint through BoolQGenerator.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 480 - 486, The /get_boolq_hard endpoint is
passing answer_style="true_false" to get_question_generator().generate(), but
QuestionGenerator.generate_qg_inputs() only supports "all", "sentences", and
"multiple_choice", causing the exception path and an empty response; fix by
routing this endpoint to the BoolQGenerator (e.g., instantiate/use
BoolQGenerator instead of generic QuestionGenerator) or extend
QuestionGenerator.generate_qg_inputs() and QuestionGenerator.generate() to
accept and handle "true_false" end-to-end (implement input conversion and output
formatting for true/false questions), and update the call site in backend.server
(where get_question_generator() is used) to use the appropriate generator.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 263-280: The branches that currently return an empty dict
(final_output = {}) break the expected response schema; instead ensure
final_output is initialized with the standard envelope keys (at minimum
"questions": [] for both MCQ and ShortQ) and for MCQ also include a "time_taken"
numeric value (e.g., 0 or measured duration) before any early return; update
both the "no keywords" path and the exception catch around
generate_multiple_choice_questions (and the analogous branch at lines 414-418)
to return final_output with these keys rather than {} so downstream code can
safely access output["questions"].
- Around line 693-700: The NLI model is loaded onto self.device but
predict_boolean_answer sends CPU tensors to it; fix by moving all input tensors
to the model device before inference. In predict_boolean_answer (and the other
similar block referenced), take the tokenized inputs/torch tensors produced for
the NLI call and do something like: inputs = {k: v.to(self.device) for k, v in
inputs.items()} (or call .to(self.device) on each tensor) immediately before
calling self.nli_model(**inputs), ensuring device consistency with the loaded
model.
- Around line 39-104: The load_model cache access is unsynchronized causing
duplicate concurrent loads; add synchronization: in __init__ create self._lock =
threading.Lock() and self._model_locks = {} (or defaultdict), then in load_model
first use self._lock to check if model_name in self._models and to create/get a
per-model lock (e.g., lock = self._model_locks.setdefault(model_name,
threading.Lock())); then acquire that per-model lock, re-check self._models for
model_name, perform the actual loading (the existing body that constructs the
model) only if still missing, store it into self._models, and finally release
the per-model lock so concurrent callers wait for the single load instead of
duplicating work.

In `@backend/Generator/mcq.py`:
- Around line 148-155: The logger calls in backend/Generator/mcq.py currently
write the raw extracted answer (variable answer) into logs in the try/except
around find_similar_words(s2v_model) and at the other noted locations (lines
~361-363 and ~479-480); change these logger.info/logger.error calls to avoid
printing the full answer—log a redacted form such as a truncated snippet, a
length, an identifier, or a secure hash (e.g., "answer_length=...,
answer_hash=...") and keep the existing contextual messages; update the messages
in the block containing find_similar_words(answer, s2v_model) and the other two
similar logging sites to use the redacted/hashed info instead of the raw answer
variable.
- Around line 313-317: The batch_encode_plus calls in mcq.py can produce
arbitrarily large tensors because they don’t bound sequence length; update both
occurrences (the tokenizer.batch_encode_plus calls around the shown diff and at
the other occurrence) to pass truncation=True and a max_length (e.g.,
max_length=tokenizer.model_max_length or a project MAX_SEQ_LEN constant) and use
explicit padding (padding='max_length' or padding=True as appropriate) so each
input is truncated to the model limit before batching; also keep the existing
recovery/skip logic but operate on the truncated inputs so a single long snippet
cannot blow up memory for the whole batch.
- Around line 30-40: The batching flow drops work because answer_batches is
materialized once and adjust_batch_size() never reinserts smaller batches when
memory checks or exceptions occur; change the loop that consumes answer_batches
(and related code paths handling memory checks and exceptions) to treat batches
as requeueable work: on low memory or on OOM/encoding exceptions inside the
processing of a batch (references: answer_batches, adjust_batch_size), split the
failing batch into smaller batches using adjust_batch_size or a splitter, push
those smaller batches back onto the active queue/iterator and retry them instead
of continuing, and/or convert answer_batches into a lazy generator so subsequent
adjustments apply; ensure all exception handlers replace the discarded batch
with retried smaller units rather than silently dropping questions.

In `@backend/server.py`:
- Around line 36-108: The lazy singletons (get_mcq_generator,
get_answer_predictor, get_boolq_generator, get_shortq_generator,
get_question_generator, get_docs_service, get_file_processor, get_mediawiki,
get_qa_model) must be protected with first-use locking to avoid duplicate
initialization under concurrency: add a module-level threading.Lock (or
individual locks) and implement double-checked locking in each getter (check if
_x is None, acquire the lock, check again, then assign), ensure you import
threading and use the same lock variable (e.g., _init_lock) and the exact global
names (_mcq_gen, _answer_predictor, etc.) so initialization happens only once.

---

Outside diff comments:
In `@backend/server.py`:
- Around line 236-244: The route is calling
AnswerPredictor.predict_boolean_answer with a single question string, causing
the predictor to iterate characters; instead pass the whole input_questions list
and handle the returned per-question results. Call
answer_pred.predict_boolean_answer with payload {"input_text": input_text,
"input_question": input_questions}, assign its result (qa_response) which should
be a list of booleans, then iterate over qa_response to append "True"/"False" to
output; adjust references to input_questions, input_text,
predict_boolean_answer, qa_response and output accordingly.
- Around line 480-486: The /get_boolq_hard endpoint is passing
answer_style="true_false" to get_question_generator().generate(), but
QuestionGenerator.generate_qg_inputs() only supports "all", "sentences", and
"multiple_choice", causing the exception path and an empty response; fix by
routing this endpoint to the BoolQGenerator (e.g., instantiate/use
BoolQGenerator instead of generic QuestionGenerator) or extend
QuestionGenerator.generate_qg_inputs() and QuestionGenerator.generate() to
accept and handle "true_false" end-to-end (implement input conversion and output
formatting for true/false questions), and update the call site in backend.server
(where get_question_generator() is used) to use the appropriate generator.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 61b65e61-f4b9-46f2-aa81-e14268bf70bd

📥 Commits

Reviewing files that changed from the base of the PR and between fc3bf1a and 7d31b3e.

📒 Files selected for processing (4)
  • backend/Generator/main.py
  • backend/Generator/mcq.py
  • backend/server.py
  • requirements.txt

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/Generator/main.py (1)

1106-1114: ⚠️ Potential issue | 🟠 Major

_prepare_qg_inputs_MC loads a fresh spaCy model on every call, defeating lazy loading.

Line 1113 calls en_core_web_sm.load() directly instead of using the ModelManager-cached spaCy model. This creates a new model instance on each call, causing memory bloat and negating the benefits of lazy loading.

Consider adding a lazy nlp property to QuestionGenerator (similar to MCQGenerator) and using it here.

♻️ Proposed fix

Add a lazy property to QuestionGenerator:

`@property`
def nlp(self):
    """Lazy load spacy model."""
    if not hasattr(self, '_nlp') or self._nlp is None:
        self._nlp = self.model_manager.load_model(
            'spacy_model', 
            'spacy', 
            model='en_core_web_sm'
        )
    return self._nlp

Then update _prepare_qg_inputs_MC:

     def _prepare_qg_inputs_MC(
         self, sentences: List[str]
     ) -> Tuple[List[str], List[str]]:
-        spacy_nlp = en_core_web_sm.load()
+        spacy_nlp = self.nlp
         docs = list(spacy_nlp.pipe(sentences, disable=["parser"]))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 1106 - 1114, The
_prepare_qg_inputs_MC method currently calls en_core_web_sm.load() each time
causing repeated spaCy model loads; add a lazy nlp property on QuestionGenerator
that caches the model in self._nlp using
model_manager.load_model('spacy_model','spacy', model='en_core_web_sm') (or
similar) and then replace the direct en_core_web_sm.load() call in
_prepare_qg_inputs_MC with self.nlp so the spaCy pipeline is reused across
calls.
🧹 Nitpick comments (4)
backend/server.py (1)

121-128: QA model pipeline initialization could benefit from explicit device specification.

The pipeline("question-answering") call at line 127 uses default device selection. For consistency with the rest of the codebase (which explicitly uses model_manager.get_device()), consider specifying the device explicitly to ensure GPU utilization when available.

♻️ Proposed enhancement
 def get_qa_model():
     """Lazy load QA model with thread safety."""
     global _qa_model
     if _qa_model is None:
         with _init_lock:
             if _qa_model is None:
-                _qa_model = pipeline("question-answering")
+                import torch
+                device = 0 if torch.cuda.is_available() else -1
+                _qa_model = pipeline("question-answering", device=device)
     return _qa_model
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/server.py` around lines 121 - 128, The get_qa_model() lazy
initializer currently calls pipeline("question-answering") without a device, so
it may not use the intended accelerator; update the call inside get_qa_model
(the block that sets _qa_model) to pass the explicit device obtained from
model_manager.get_device() (or its integer/torch device equivalent) so the QA
pipeline uses the same device as the rest of the codebase.
backend/Generator/mcq.py (3)

157-160: Use logger.exception for better diagnostics.

When logging inside an exception handler, logger.exception automatically includes the stack trace, which aids debugging without manually formatting the error.

♻️ Proposed fix
     except Exception as e:
         # Redact sensitive content from logs
         answer_hash = hashlib.sha256(answer.encode()).hexdigest()[:8]
-        logger.error(f"Failed to generate choices | answer_hash={answer_hash}. Error: {e}")
+        logger.exception(f"Failed to generate choices | answer_hash={answer_hash}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/mcq.py` around lines 157 - 160, Replace the logger.error
call inside the exception handler with logger.exception so the stack trace is
included automatically; keep the redaction logic (using answer and answer_hash)
unchanged and still log the hashed answer for context—update the except block
that currently computes answer_hash from answer and calls logger.error to
instead call logger.exception with the same message including answer_hash.

373-376: Consider using logger.exception in exception handlers.

Multiple exception handlers use logger.error(f"... {e}"). Using logger.exception() would automatically include the full stack trace, improving debuggability for production issues.

♻️ Proposed fix for representative locations
                     except Exception as e:
-                        logger.error(f"Error processing question for answer: {e}")
+                        logger.exception("Error processing question for answer")
                         continue
             except Exception as e:
-                logger.error(f"Unexpected error processing batch: {e}")
+                logger.exception("Unexpected error processing batch")
                 # Try to recover by cleaning memory and continuing
                 cleanup_memory(device)
                 continue

Also applies to: 402-406

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/mcq.py` around lines 373 - 376, Replace the plain error
logging in the exception handlers with logger.exception to capture stack traces:
locate the except blocks that currently call logger.error(f"Error processing
question for answer: {e}") (and the similar handler around the other occurrence
at 402-406) and change them to logger.exception("Error processing question for
answer") (or an appropriate short context message) so the full stack trace is
recorded; keep the except Exception as e structure but remove the manual
inclusion of e in the message since logger.exception will log the exception
details automatically.

24-29: MODEL_CACHE_ENABLED constant is defined but never used.

This constant suggests an intent to toggle model caching, but no code path references it. Either implement the toggle or remove the dead constant to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/mcq.py` around lines 24 - 29, The MODEL_CACHE_ENABLED
constant is defined but unused; either remove it to avoid dead code or implement
it as a guard around the module's model-caching logic—locate the code that
creates/stores cached model instances (the caching block that currently
unconditionally caches models) and wrap that behavior with if
MODEL_CACHE_ENABLED: (or check the flag before returning/adding to the cache),
ensuring the constant controls whether caching is enabled and default behavior
stays consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 135-158: unload_model and clear_all mutate CUDA state and call
gc.collect()/torch.cuda.empty_cache() without synchronization; wrap the critical
sections in self._lock to prevent concurrent mutations (use a with self._lock:
or explicit acquire/release) around model deletion/clearing, gc.collect(), and
torch.cuda.empty_cache() in both unload_model and clear_all so they cannot run
concurrently with load_model or cleanup_memory in mcq.py; ensure you reference
and use the same lock object (self._lock) used elsewhere in the class to
maintain consistent synchronization.

In `@backend/Generator/mcq.py`:
- Around line 32-42: The get_memory_usage() function currently returns 0 on any
exception from psutil.virtual_memory(), which falsely indicates no memory
pressure; change the exception handling in get_memory_usage to log the full
exception (using logger.exception or include the stack/exception) and return a
conservative high value (e.g., 100.0) instead of 0 so downstream adaptive logic
treats failures as high memory usage (alternatively re-raise if you prefer
fail-fast), keeping references to psutil.virtual_memory(), get_memory_usage, and
logger.warning/logger.exception for locating the change.

---

Outside diff comments:
In `@backend/Generator/main.py`:
- Around line 1106-1114: The _prepare_qg_inputs_MC method currently calls
en_core_web_sm.load() each time causing repeated spaCy model loads; add a lazy
nlp property on QuestionGenerator that caches the model in self._nlp using
model_manager.load_model('spacy_model','spacy', model='en_core_web_sm') (or
similar) and then replace the direct en_core_web_sm.load() call in
_prepare_qg_inputs_MC with self.nlp so the spaCy pipeline is reused across
calls.

---

Nitpick comments:
In `@backend/Generator/mcq.py`:
- Around line 157-160: Replace the logger.error call inside the exception
handler with logger.exception so the stack trace is included automatically; keep
the redaction logic (using answer and answer_hash) unchanged and still log the
hashed answer for context—update the except block that currently computes
answer_hash from answer and calls logger.error to instead call logger.exception
with the same message including answer_hash.
- Around line 373-376: Replace the plain error logging in the exception handlers
with logger.exception to capture stack traces: locate the except blocks that
currently call logger.error(f"Error processing question for answer: {e}") (and
the similar handler around the other occurrence at 402-406) and change them to
logger.exception("Error processing question for answer") (or an appropriate
short context message) so the full stack trace is recorded; keep the except
Exception as e structure but remove the manual inclusion of e in the message
since logger.exception will log the exception details automatically.
- Around line 24-29: The MODEL_CACHE_ENABLED constant is defined but unused;
either remove it to avoid dead code or implement it as a guard around the
module's model-caching logic—locate the code that creates/stores cached model
instances (the caching block that currently unconditionally caches models) and
wrap that behavior with if MODEL_CACHE_ENABLED: (or check the flag before
returning/adding to the cache), ensuring the constant controls whether caching
is enabled and default behavior stays consistent.

In `@backend/server.py`:
- Around line 121-128: The get_qa_model() lazy initializer currently calls
pipeline("question-answering") without a device, so it may not use the intended
accelerator; update the call inside get_qa_model (the block that sets _qa_model)
to pass the explicit device obtained from model_manager.get_device() (or its
integer/torch device equivalent) so the QA pipeline uses the same device as the
rest of the codebase.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a3cb1cf8-5acd-447b-9b6f-424aeda324cc

📥 Commits

Reviewing files that changed from the base of the PR and between 7d31b3e and 43d72f4.

📒 Files selected for processing (3)
  • backend/Generator/main.py
  • backend/Generator/mcq.py
  • backend/server.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
backend/Generator/main.py (2)

760-762: ⚠️ Potential issue | 🟠 Major

Don't log raw question text from request payloads.

These error paths write full user-supplied questions into logs. That can leak document content or PII; log a hash/length or question index instead.

Also applies to: 801-803

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 760 - 762, Replace logging of raw
user questions in the exception handlers to avoid writing PII/content to logs:
in the except block around the prediction call (the try/except that currently
does logger.error(f"Error predicting answer for question '{ques}': {e}") and
then answers.append("")), and the similar handler later (the block at the other
occurrence), change the log to reference a non-sensitive identifier such as the
question index, the question length, or a hash (e.g., SHA256 of ques) instead of
the full ques, and keep the exception message; leave answers.append("") behavior
unchanged. Ensure you update both occurrences in main.py (the exception handling
around the prediction step and the duplicate at the later lines) so no raw
question text is ever logged.

61-119: ⚠️ Potential issue | 🟠 Major

load_model() can still race with clear_all() and unload_model().

The final cache publish at Line 117 is outside the same coordination used by the clear/unload paths. An in-flight load can therefore complete after clear_all() returns and silently reinsert the model.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 61 - 119, The final cache insertion
in load_model can race with clear_all/unload_model; after loading under
model_lock, acquire the main self._lock before mutating self._models, re-check
whether the model was cleared/unloaded and only then set
self._models[model_name] (otherwise discard/cleanup the newly loaded model);
update load_model to perform the final self._models assignment inside a with
self._lock block and leave the earlier double-checks as-is to prevent
reinsertion after clear_all()/unload_model(), referencing load_model, clear_all,
unload_model, self._lock and self._model_locks for locating the change.
🧹 Nitpick comments (1)
backend/Generator/main.py (1)

305-308: Avoid forcing full cleanup on every successful MCQ/ShortQ request.

generate_multiple_choice_questions() and generate_normal_questions() already clean up inside backend/Generator/mcq.py, so these extra gc.collect() / torch.cuda.empty_cache() calls add hot-path latency and throw away allocator reuse immediately after generation.

Also applies to: 447-450

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/Generator/main.py` around lines 305 - 308, The post-generation forced
cleanup in backend/Generator/main.py (the torch.cuda.empty_cache() and
gc.collect() block) duplicates cleanup already performed inside
generate_multiple_choice_questions() and generate_normal_questions() in
backend/Generator/mcq.py and should be removed to avoid hot-path latency and
allocator churn; delete the conditional that checks self.device.type == 'cuda'
and the gc.collect() call (and replicate the same removal at the other
occurrence around the 447-450 region), or gate these calls behind a debug/config
flag so they only run when explicitly requested (keep cleanup only in mcq.py
functions and any error-handling paths).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/Generator/main.py`:
- Around line 939-940: The decorator for the nlp property is written as a string
(backticked) instead of a Python decorator; remove the backticks so `@property`
sits directly above def nlp(self): (i.e., replace the quoted/backticked line
with a real `@property` decorator for the nlp method) and ensure
indentation/spacing matches surrounding decorators so the module can import
cleanly.

---

Duplicate comments:
In `@backend/Generator/main.py`:
- Around line 760-762: Replace logging of raw user questions in the exception
handlers to avoid writing PII/content to logs: in the except block around the
prediction call (the try/except that currently does logger.error(f"Error
predicting answer for question '{ques}': {e}") and then answers.append("")), and
the similar handler later (the block at the other occurrence), change the log to
reference a non-sensitive identifier such as the question index, the question
length, or a hash (e.g., SHA256 of ques) instead of the full ques, and keep the
exception message; leave answers.append("") behavior unchanged. Ensure you
update both occurrences in main.py (the exception handling around the prediction
step and the duplicate at the later lines) so no raw question text is ever
logged.
- Around line 61-119: The final cache insertion in load_model can race with
clear_all/unload_model; after loading under model_lock, acquire the main
self._lock before mutating self._models, re-check whether the model was
cleared/unloaded and only then set self._models[model_name] (otherwise
discard/cleanup the newly loaded model); update load_model to perform the final
self._models assignment inside a with self._lock block and leave the earlier
double-checks as-is to prevent reinsertion after clear_all()/unload_model(),
referencing load_model, clear_all, unload_model, self._lock and
self._model_locks for locating the change.

---

Nitpick comments:
In `@backend/Generator/main.py`:
- Around line 305-308: The post-generation forced cleanup in
backend/Generator/main.py (the torch.cuda.empty_cache() and gc.collect() block)
duplicates cleanup already performed inside generate_multiple_choice_questions()
and generate_normal_questions() in backend/Generator/mcq.py and should be
removed to avoid hot-path latency and allocator churn; delete the conditional
that checks self.device.type == 'cuda' and the gc.collect() call (and replicate
the same removal at the other occurrence around the 447-450 region), or gate
these calls behind a debug/config flag so they only run when explicitly
requested (keep cleanup only in mcq.py functions and any error-handling paths).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 24c83134-1b90-45b1-a05d-7af3b5395081

📥 Commits

Reviewing files that changed from the base of the PR and between 43d72f4 and 735f29a.

📒 Files selected for processing (3)
  • backend/Generator/main.py
  • backend/Generator/mcq.py
  • backend/server.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Manage computational resources carefully

1 participant