733 vectordatabasequeryreactor output divider value for word doc chunks always 1 by Varaham · Pull Request #804 · SEMOSS/Semoss

Varaham · 2025-05-12T15:31:01Z

Description :

The VectorDatabaseQueryReactor enables users to perform similarity searches between a user input and a vector database.
The "output" JSON value in the "pixelReturn" Response JSON contains the "Divider" as the attribute which is meant to indicate the page number(s) the text chunk is on.

Current issue:

In the current output ,for .docx files, the "Divider" is equal to 1 for all text chunks irrespective of the page number at which the text exists / spanned in the document.

Workaround implementation :

Below are the steps in workaround implementation -

The docx contents are divided into individual paragraphs as separate chunks using XWPF API POI class.
When the page break is encountered, the page count is incremented and the Divider will have the page number at which the text chunk (paragraph) exists.

Exceptions :

The page break will be ignored in the below scenarios:

Whenever any image has been spanned in between the pages.
Whenever any tabular fields are spanned in between the pages
Any JSON values are spanned in between the pages.
In the above scenario, the DocReader will take the entire value as a single text chunk (paragraph) and will take the last page till the objects are spanned. The Divider will take all the page numbers with comma separated values and will be displayed in the JSON output.

Steps :

Click on 'Open Vector' and create a new Vector Catalog database with all the required fields.
In the Vector Catalog window, go to the 'Files' tab to upload a new docx file. (pic 1)
In the Files tab, click on 'Embed new Document' button (pic 2)
Select any docx file (pic 3)
Click on 'Embed' and ensure that it is embedded (pic 4)
Once the document is uploaded , listed (pic 5)
Go to the 'Q & A' tab. Type a question and click on ' Generate Answer'. See in the response JSON in the 'Networks tab'.

Response JSON.txt

…-value-for-word-doc-chunks-always-1

github-actions · 2025-05-12T15:31:10Z

@CodiumAI-Agent /describe

QodoAI-Agent · 2025-05-12T15:31:43Z

Title

733 vectordatabasequeryreactor output divider value for word doc chunks always 1

User description

Description

Changes Made

How to Test

Steps to reproduce/test the behavior
Expected outcomes

Notes

PR Type

Bug fix

Description

Fix incorrect divider value for Word doc chunks
Detect page breaks via CTBr and CTEmpty runs
Track last paragraph for accurate page logging
Remove deprecated paragraph.isPageBreak logic

Changes walkthrough 📝

Relevant files

Bug fix

DocProcessor.java `Fix page break detection logic` src/prerna/reactor/frame/gaas/processors/DocProcessor.java Added XWPFRun, CTBr, CTEmpty imports Implemented run-based page break detection Tracked last paragraph for logging Removed paragraph.isPageBreak handling	+34/-7

Need help?
Type /help how to ... in the comments thread for any questions about PR-Agent usage.
Check out the documentation for more information.

github-actions · 2025-05-12T15:31:56Z

@CodiumAI-Agent /review

ppatel9703 · 2025-05-12T15:32:34Z

Can we add in a proper description and testing steps.

QodoAI-Agent · 2025-05-12T15:32:41Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Unused variable The variable `pageCount` is retrieved but never utilized anywhere in the method. Consider removing it to avoid confusion and keep the code clean. int pageCount = document.getProperties().getExtendedProperties().getUnderlyingProperties().getPages(); Debug print instead of logging There is a `System.out.println` call for debugging. Replace it with a proper logger call (e.g., `classLogger.debug`) and remove any temporary print statements. System.out.println(" pageCount -> " + pageCount);

github-actions · 2025-05-12T15:32:41Z

@CodiumAI-Agent /improve

ppatel9703 · 2025-05-12T15:33:11Z

src/prerna/reactor/frame/gaas/processors/DocProcessor.java

-		for (XWPFParagraph paragraph : document.getParagraphs()) {
+		XWPFParagraph lastParaOnPage = null;
+		int pageCount = document.getProperties().getExtendedProperties().getUnderlyingProperties().getPages();
+	    System.out.println(" pageCount -> " + pageCount);


Pls Remove the sysout

code committed after replacing sysout with loggers

QodoAI-Agent · 2025-05-12T15:34:01Z

PR Code Suggestions ✨

Latest suggestions up to d195d84

Category	Suggestion	Impact
General	Null-check paragraph runs Add a null check for `paragraph.getRuns()` to prevent `NullPointerException` when a paragraph has no runs. src/prerna/reactor/frame/gaas/processors/DocProcessor.java [124] -for (XWPFRun run : paragraph.getRuns()) { +List<XWPFRun> runs = paragraph.getRuns(); +if (runs != null) { + for (XWPFRun run : runs) { + // existing break detection logic + } +} Suggestion importance[1-10]: 2 __ Why: Adding a null check for `getRuns()` is unnecessary since Apache POI always returns a list, so this defensive change has minimal value.	Low

Previous suggestions

Suggestions up to commit 138af49

Category	Suggestion	Impact
Possible issue	Always count both break types Split the page‐break detection into two independent loops instead of an `if/else`, so both explicit and rendered page breaks are always counted. This avoids missing page breaks when both lists have entries. src/prerna/reactor/frame/gaas/processors/DocProcessor.java [125-139] -if (brList != null && !brList.isEmpty()) { +List<CTBr> brList = run.getCTR().getBrList(); +if (brList != null) { for (CTBr br : brList) { if (br.getType() == STBrType.PAGE) { pageNo++; } } -} else { - List<CTEmpty> lastRenderedPageBreakList = run.getCTR().getLastRenderedPageBreakList(); - if (lastRenderedPageBreakList != null) { - for (CTEmpty lastRenderedPageBreak : lastRenderedPageBreakList) { - pageNo++; - } - } +} +List<CTEmpty> lastRenderedPageBreakList = run.getCTR().getLastRenderedPageBreakList(); +if (lastRenderedPageBreakList != null) { + pageNo += lastRenderedPageBreakList.size(); } Suggestion importance[1-10]: 8 __ Why: The current `if/else` logic skips counting one break type when the other is present; separating the loops ensures both `CTBr` and `CTEmpty` page breaks are always accounted for.	Medium
	Fix undefined logger reference Use the existing logger instance (e.g., `LOGGER`) instead of the undefined `classLogger` and parameterize the message. This fixes the compile error and ensures proper logging. src/prerna/reactor/frame/gaas/processors/DocProcessor.java [145] -classLogger.info("Last paragraph on Page " + pageNo); +LOGGER.info("Last paragraph on page {}", pageNo); Suggestion importance[1-10]: 8 __ Why: Replacing the undefined `classLogger` with the existing `LOGGER` corrects the compile error and ensures the log message is emitted properly.	Medium
	Guard run iteration against null Add a null check around `paragraph.getRuns()` to prevent a potential NullPointerException when there are no runs. This guards the loop against invalid state. src/prerna/reactor/frame/gaas/processors/DocProcessor.java [124] -for (XWPFRun run : paragraph.getRuns()) { +List<XWPFRun> runs = paragraph.getRuns(); +if (runs != null) { + for (XWPFRun run : runs) { Suggestion importance[1-10]: 3 __ Why: Apache POI’s `paragraph.getRuns()` typically returns an empty list rather than `null`, so this null check is largely redundant and adds unnecessary complexity.	Low
General	Use logger instead of println Replace the stdout print with a logger call and remove or validate the unused `pageCount` variable. This ensures consistent logging and prevents stray `System.out` usage. src/prerna/reactor/frame/gaas/processors/DocProcessor.java [113] -System.out.println(" pageCount -> " + pageCount); +LOGGER.debug("Page count: {}", pageCount); Suggestion importance[1-10]: 6 __ Why: Replacing `System.out.println` with `LOGGER.debug` aligns with the existing logging approach and removes stray console output, improving consistency.	Low

Varaham · 2025-05-13T09:31:08Z

Console Output replaced with Loggers as per review comments

…-value-for-word-doc-chunks-always-1

github-actions · 2025-05-14T13:51:21Z

@CodiumAI-Agent /update_changelog

QodoAI-Agent · 2025-05-14T13:51:52Z

Changelog updates: 🔄

2025-05-14 *

Fixed

Correct page-based divider values in vector database query output for Word documents

Changed

Replace console logging with structured loggers in document processing

to commit the new content to the CHANGELOG.md file, please type:
'/update_changelog --pr_update_changelog.push_changelog_changes=true'

* fix(python): handle eval when it is a single line execution but there is string input with space (#756) * Update Dockerfile.tomcat (#757) * fix: tomcat builder setting env var * fix: updating tomcat to 9.0.104 * Update Dockerfile.ubuntu22.04 * Update Dockerfile.ubuntu22.04 * Update Dockerfile.ubuntu22.04 * feat: creating KubernetesModelScaler class (#763) * Update Dockerfile.ubuntu22.04 * feat: adding ability to attach a file to a vector db source (#736) * Added AttachSourceToVectorDbReactor for uploading pdf file to an existing csv file and modified VectorFileDownloadReactor * fix: proper return for the download and matching the reactor name * fix: error for downloading single file vs multiple; error for copyToDirectory instead of copyFile * chore: renaming so reactor matches VectorFileDownload --------- Co-authored-by: Maher Khalil <themaherkhalil@gmail.com> * Update Dockerfile.ubuntu22.04 * Update ubuntu2204.yml * Update ubuntu2204.yml * Update ubuntu2204_cuda.yml * Update Dockerfile.nvidia.cuda.12.5.1.ubuntu22.04 * Update ubuntu2204_cuda.yml * Update ubuntu2204.yml * feat: exposing tools calling through models (#764) * 587 unit test for prernadsutil (#654) * test(unit): unit tests for the prerna.util.ds package * test(unit): unit tests for the prerna.util.ds.flatfile package * test(unit): removed reflections, added paraquet tests * test(unit): unit tests for the prerna.util.ds package * test(unit): unit tests for the prerna.util.ds.flatfile package * test(unit): removed reflections, added paraquet tests * Update ubuntu2204.yml * Update ubuntu2204.yml * Update ubuntu2204.yml * fix: update pipeline docker buildx version * fix: ignore buildx * fix: adjusting pipeline for cuda * feat: switching dynamic sas to default false (#766) * fix: changes to account for version 2.0.0 of pyjarowinkler (#769) * chore: using 'Py' instead of 'py' to be consistent (#770) * feat: full ast parsing of code to return evaluation of the last expression (#771) * Python Deterministic Token Trimming for Message Truncation (#765) * feat: deterministic-token-trimming * feat: modifying logic such that system prompt is second to last message for truncation --------- Co-authored-by: Maher Khalil <themaherkhalil@gmail.com> * fix: added date added column to enginepermission table (#768) * fix: add docker-in-docker container to run on sef-hosted runner (#773) Co-authored-by: Raul Esquivel <resmas.work@gmail.com> * fix: properly passing in the parameters from kwargs/smss into model limits calculation (#774) * fix: removing legacy param from arguments (#777) * fix: Fix docker cache build issue (#778) * adding no cache * adding no cache * feat: Adding Semantic Text Splitting & Token Text Splitting (#720) * [696] - build - Add chonky semantic text splitting - Added the function for chonky semantic text splitting and integrated with existing flow. * [696] - build - Add chonky semantic text splitting - Updated the code * [696] - build - Add chonky semantic text splitting - Updated the code comments * feat: adding reactor support through java * feat: updating pyproject.toml with chonky package * feat: check for default chunking method in smss * [696] - feat - Add chonku semantic text splitting - Resolved the conflicts * [696] - feat - Add chonky semantic text splitting - Organized the code. * feat: adding chunking by tokens and setting as default * updating comments on chunking strategies --------- Co-authored-by: Weiler, Ryan <ryanweiler92@gmail.com> Co-authored-by: kunal0137 <kunal0137@gmail.com> * feat: allowing for tools message in full prompt (#780) * UPDATE ::: Add docker in docker Dockerfiler (#784) * add docker in docker Dockerfile * Update Dockerfile.dind Remove python and tomcat arguments from Dockerfile * fix: remove-paddle-ocr (#786) * [#595] test(unit): adds unit test for prerna.engine.impl.model.kserve Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> * feat: Tag semoss image (#789) * adding changes for non-release docker build * adding non-release build logic to cuda-semoss builder * updating push branches * fix: branch names on docker builds * fix: branch names on docker builds cuda * fix: adding push condition - change to pyproject toml file; adding event input vars to env vars (#790) * fix: python builder toml file change (#792) * fix: Catch errors when calling pixels from Python (#787) Co-authored-by: Weiler, Ryan <ryanweiler92@gmail.com> * Creating db links between engines and default apps (#693) * create db links between engine and default app * Rename column APPID to TOOL_APP * feat: add database_tool_app to getUserEngineList --------- Co-authored-by: Weiler, Ryan <ryanweiler92@gmail.com> * Adding sort options to the myengines reactor (#479) * added sort feature to MyEnginesReactor and genericized reactor imports * formatting * overloading method * validate sortList --------- Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> * feat: cleaning up unused imports in MyEngine reactor (#793) * feat: Create Enum projectTemplate and update CreateAppFromTemplateReactor to accept existing appID for cloning applications (#621) Co-authored-by: kunal0137 <kunal0137@gmail.com> * Update GetEngineUsageReactor.java (#417) Co-authored-by: Maher Khalil <themaherkhalil@gmail.com> Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> * Issue 596: Adds Unit Tests for prerna/engine/impl/model/responses and workers (#727) * [#596] test(unit): adds unit tests * fix: implements ai-agents suggestions --------- Co-authored-by: Jeff Vitunac <jvitunac@gmail.com> Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> * 609 implement native blob storage for azure gcp and aws (#674) * Initial commit : implementation for azure blob storage * added dependency for azure in pom.xml * update logic to fetch the metadata from list details * changed functionality from listing containers to listing files within a selected container * initial commit for google cloud storage implementation * added field contant in enum class and removed unused method * add methods to parse comma-separated local and cloud paths * add methods to parse comma-separated local and cloud paths * implementation for aws s3 bucket * normalize container prefix path * merged all: implementation for azure, aws and gcp * refactor(storage): replace manual path normalization with normalizePath from Utility class --------- Co-authored-by: pvijayaraghavareddy <pvijayaraghavareddy@WORKSPA-6QV71G7.us.deloitte.com> Co-authored-by: Parth <parthpatel3@deloitte.com> Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> * Get Node Pool Information for Remote Models (#806) * 590 unit test for prernaengineimpl (#808) * test(unit): update to filesystems hijacking for testing files * test: start of unit tests for abstract database engine * test(unit): added unit test for prerna.engine.impl * test(unit): finsihed tests for prerna.engine.impl * test(unit): adding back unused assignment --------- Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> * Creating WordCountTokenizer Class (#802) * feat: creating word count tokenizer class && falling back to word count tokenizer if tiktok fails * feat: updating comment * feat: setting default chunking method as recursive (#810) * Unit tests fixes and Unit test Class file location updates (#812) * test(unit): moved tests to correct packages * test(unit): fixed a couple of unit tests * VectorDatabaseQueryReactor: output divider value for word doc chunks always 1 (#804) * Code implementation for #733 * feat: Added code to resolve Divider page issue * Console output replaced by LOGGERs as per review comments * feat: replaced Console with Loggers --------- Co-authored-by: Varaham <katchabi50@gmail.com> Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> * GetCurrentUserReactor (#818) Adding GetCurrentUserReactor to return user info including if user is an admin. * Python User Class (#819) * fix: trimming properties read from smss; fix: logging commands before executing (#821) * Updating getNodePoolsInfo() to parse and return zk info and models active actual (#822) * feat: update get node pool information for zk info and models active actual * feat: get remote model configs * Add unit tests for package prerna\engine\impl\vector (#728) * Create ChromaVectorDatabaseEngineUnitTests.java * completed tests for ChromaVectorDatabaseEngine class * [#604] test(unit): Created ChromaVectorDatabaseEngine unit tests * [604] tests(unit) : Completed test cases for ChromaVectorDatabaseEngine; update File operations to nio operations in ChromaVectorDatabaseEngine.java * [#604] tests(unit): added unit tests for all vector database engines and util classes in the prerna\engine\impl\vector package * [604] test(unit): replaced creating file paths with string literals with java.nio Paths.resolve/Paths.get methods --------- Co-authored-by: Maher Khalil <themaherkhalil@gmail.com> Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> * feat: adding to the return of getenginemetadata (#813) * feat: adding to the return of getenginemetadata * fix: removing throws --------- Co-authored-by: Arash Afghahi <48933336+AAfghahi@users.noreply.github.com> Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> * 718 create a single reactor to search both engines and apps (#794) * feat(engineProject): Initial commit * chore: 718 create a single reactor to search both engines and apps * chore: 718 create a single reactor to search both engines and apps --------- Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> Co-authored-by: Vijayaraghavareddy <pvijayaraghavareddy@deloitte.com> * feat: update openai wrapper to handle multiple images (#832) * feat: adding user room map (#840) * feat: hiding side menu bar for non admins (#833) * Side menu changes * Review Comments fixed * Flag is renamed in Constants.java * Review Comment fixed in Utility.java * fix: cleaning up defaults and comments --------- Co-authored-by: kunal0137 <kunal0137@gmail.com> --------- Co-authored-by: Maher Khalil <themaherkhalil@gmail.com> Co-authored-by: kunal0137 <kunal0137@gmail.com> Co-authored-by: Ryan Weiler <ryanweiler92@gmail.com> Co-authored-by: ManjariYadav2310 <manjayadav@deloitte.com> Co-authored-by: dpartika <dpartika@deloitte.com> Co-authored-by: Raul Esquivel <resmas.work@gmail.com> Co-authored-by: Pasupathi Muniyappan <pasupathi.muniyappan@kanini.com> Co-authored-by: resmas-tx <131498457+resmas-tx@users.noreply.github.com> Co-authored-by: AndrewRodddd <62724891+AndrewRodddd@users.noreply.github.com> Co-authored-by: radkalyan <107957324+radkalyan@users.noreply.github.com> Co-authored-by: samarthKharote <samarth.kharote@kanini.com> Co-authored-by: Shubham Mahure <shubham.mahure@kanini.com> Co-authored-by: rithvik-doshi <81876806+rithvik-doshi@users.noreply.github.com> Co-authored-by: Mogillapalli Manoj kumar <86736340+Khumar23@users.noreply.github.com> Co-authored-by: Jeff Vitunac <jvitunac@gmail.com> Co-authored-by: pvijayaraghavareddy <pvijayaraghavareddy@WORKSPA-6QV71G7.us.deloitte.com> Co-authored-by: Parth <parthpatel3@deloitte.com> Co-authored-by: KT Space <119169984+Varaham@users.noreply.github.com> Co-authored-by: Varaham <katchabi50@gmail.com> Co-authored-by: ericgonzal8 <ericgonzalez8@deloitte.com> Co-authored-by: Arash Afghahi <48933336+AAfghahi@users.noreply.github.com> Co-authored-by: Vijayaraghavareddy <pvijayaraghavareddy@deloitte.com> Co-authored-by: ammb-123 <ammb@deloitte.com>

katchabi50 added 3 commits May 9, 2025 16:36

Code implementation for #733

f0ef20a

feat: Added code to resolve Divider page issue

c757cf2

Merge branch 'dev' into 733-vectordatabasequeryreactor-output-divider…

138af49

…-value-for-word-doc-chunks-always-1

Varaham requested a review from ppatel9703 May 12, 2025 15:31

Varaham self-assigned this May 12, 2025

Varaham linked an issue May 12, 2025 that may be closed by this pull request

[BE] VectorDatabaseQueryReactor output: "Divider" value for word doc chunks always "1" #733

Closed

ppatel9703 reviewed May 12, 2025

View reviewed changes

katchabi50 added 2 commits May 13, 2025 14:53

Console output replaced by LOGGERs as per review comments

ddd4279

feat: replaced Console with Loggers

d195d84

ppatel9703 approved these changes May 14, 2025

View reviewed changes

Merge branch 'dev' into 733-vectordatabasequeryreactor-output-divider…

bf4821b

…-value-for-word-doc-chunks-always-1

ryanweiler92 merged commit ffa6e1d into dev May 14, 2025
3 checks passed

ryanweiler92 deleted the 733-vectordatabasequeryreactor-output-divider-value-for-word-doc-chunks-always-1 branch May 14, 2025 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

733 vectordatabasequeryreactor output divider value for word doc chunks always 1#804

733 vectordatabasequeryreactor output divider value for word doc chunks always 1#804
ryanweiler92 merged 6 commits intodevfrom
733-vectordatabasequeryreactor-output-divider-value-for-word-doc-chunks-always-1

Varaham commented May 12, 2025 •

edited

Loading

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

QodoAI-Agent commented May 12, 2025

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

ppatel9703 commented May 12, 2025

Uh oh!

QodoAI-Agent commented May 12, 2025

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

ppatel9703 May 12, 2025

Uh oh!

Varaham May 13, 2025

Uh oh!

QodoAI-Agent commented May 12, 2025 •

edited

Loading

Uh oh!

Varaham commented May 13, 2025

Uh oh!

Uh oh!

github-actions bot commented May 14, 2025

Uh oh!

QodoAI-Agent commented May 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Varaham commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description :

Current issue:

Workaround implementation :

Exceptions :

Steps :

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

QodoAI-Agent commented May 12, 2025

Title

User description

Description

Changes Made

How to Test

Notes

PR Type

Description

Changes walkthrough 📝

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

ppatel9703 commented May 12, 2025

Uh oh!

QodoAI-Agent commented May 12, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

ppatel9703 May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Varaham May 13, 2025

Choose a reason for hiding this comment

Uh oh!

QodoAI-Agent commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

Varaham commented May 13, 2025

Uh oh!

Uh oh!

github-actions bot commented May 14, 2025

Uh oh!

QodoAI-Agent commented May 14, 2025

2025-05-14 *

Fixed

Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Varaham commented May 12, 2025 •

edited

Loading

QodoAI-Agent commented May 12, 2025 •

edited

Loading