✨ Feature: add content extractor #20

catturtle123 · 2025-08-22T12:28:26Z

📌 Overview

add content extractor

🔍 Related Issues

Closes ✨ [Feature] add content extractor #19

✨ Changes

add content extractor
♻️ Refactor: val 불변성 확
♻️ Refactor: dto 내재화
🐛 Fix: dto 에러 수정
✨ Feature: add ContentExtractor
♻️ Refactor: loader를 extractor로 변경
✅ Test: index test add

📸 Screenshots / Test Results (Optional)

Attach images or videos if necessary.

✅ Checklist

I have followed the code style guidelines.
I have removed unnecessary comments or console logs.
I have written/updated related tests.
I have verified that all features work correctly in my local environment.

🗒️ Additional Notes

Add any other context or information here.

Summary by CodeRabbit

New Features
- Uploads now extract and store text content from PDFs, DOCX, and common text formats (TXT/CSV/MD/JSON/HTML).
- Indexes are created directly from stored content.
- Added user-facing error for unsupported file types.
Improvements
- Response IDs are now consistently non-null.
- Faster, batched tag assignment for uploaded files.
Bug Fixes
- More reliable data file listing/grouping.
Chores
- Removed cloud storage integration and related settings.
- Updated dependencies and test configuration.

coderabbitai · 2025-08-22T12:28:33Z

Walkthrough

Replaces S3-based file handling with in-app content extraction and storage. Adds PDF/DOCX/TXT extractors and a resolver. Shifts entities’ ids to non-null, changes DataFile to store content (LOB) instead of URL, updates services/DTOs accordingly, removes S3 and URL loader utilities/config, adjusts tests and dependencies.

Changes

Cohort / File(s)	Summary
Build and dependencies `build.gradle`	Replace Spring AI pgvector starter with `com.pgvector:pgvector`; remove AWS S3 BOM/S3 deps; add `pdfbox` and `poi-ooxml`.
DTO adjustments `.../document/dto/DataFileRequestDTO.kt`, `.../document/dto/DataFileResponseDTO.kt`, `.../index/dto/IndexResponseDTO.kt`	Nest DataFileCreateItem inside bulk request; make multiple ids non-null and DTOs immutable where applicable; remove requireNotNull calls.
Entity refactors (IDs/content) `.../document/entity/DataFile.kt`, `.../document/entity/DataFileTag.kt`, `.../document/entity/Tag.kt`, `.../index/entity/Index.kt`, `.../index/entity/DataFileIndex.kt`, `.../index/entity/ChunkEmbedding.kt`, `.../prompt/entity/FewShot.kt`, `.../prompt/entity/Prompt.kt`	Convert id fields from nullable to non-null val in bodies; DataFile replaces fileUrl with LOB content; loosen join column nullability in DataFileTag; add BaseEntity inheritance where added.
Service logic changes `.../document/service/DataFileService.kt`, `.../index/service/IndexService.kt`	Replace S3/upload + remote load with ContentExtractorResolver and in-DB content; remove rollback cleanup; adjust index creation to read DataFile.content; constructor dependency changes.
Extractor feature `.../global/util/extractor/ContentExtractor.kt`, `ContentExtractorResolver.kt`, `PdfContentExtractor.kt`, `DocxContentExtractor.kt`, `TxtContentExtractor.kt`	Add extractor interface, resolver, and implementations for PDF, DOCX, and TXT-like types.
Removed S3 and URL loader `.../global/config/S3Config.kt`, `.../global/util/s3/S3Util.kt`, `S3UtilImpl.kt`, `S3Type.kt`, `.../global/storage/FakeS3Util.kt`, `.../global/util/loader/ContentLoader.kt`, `HttpContentLoader.kt`	Remove S3 config, utilities, and HTTP content loader; delete test fake S3.
Utilities `.../global/util/converter/FileConvertUtil.kt`, `.../global/error/ErrorCode.kt`	Add streaming metrics computation; restrict content-type mapping to supported types and throw INVALID_FILE_TYPE; add new error code.
Embedding/test support `.../index/embed/FakeEmbder.kt`, `.../index/entity/enums/EmbeddingModel.kt`	Add test-profile fake embedder; add FAKE enum value.
Configuration `src/main/resources/application-local.yml`, `src/test/resources/application-test.yml`	Switch ddl-auto to create-drop; remove AWS config; add AI embedding model test config.
Tests `.../document/service/DataFileServiceTest.kt`, `.../index/service/IndexServiceTest.kt`	Update tests for content-based flow, nested DTO, error message; persist DataFile for indexing; clean up embeddings before indices; adjust API usage.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant DataFileService
  participant ExtractorResolver
  participant Extractor
  participant DataFileRepo
  participant TagRepo
  participant DataFileTagRepo

  Client->>DataFileService: upload(files, items)
  loop for each file
    DataFileService->>ExtractorResolver: extractContent(file, type)
    ExtractorResolver->>Extractor: supports(type)?
    alt supported
      ExtractorResolver->>Extractor: extract(file)
      Extractor-->>ExtractorResolver: content (String)
    else unsupported
      ExtractorResolver-->>DataFileService: throw INVALID_FILE_TYPE
    end
    DataFileService->>DataFileRepo: save(DataFile.with(content))
  end
  DataFileService->>TagRepo: find/create tags
  DataFileService->>DataFileTagRepo: saveAll(mappings)
  DataFileService-->>Client: upload response

sequenceDiagram
  participant Client
  participant IndexService
  participant DataFileRepo
  participant Embedder
  participant IndexRepo
  participant ChunkEmbeddingRepo

  Client->>IndexService: createIndex(request{dataFileIds,...})
  IndexService->>DataFileRepo: findAllById(dataFileIds)
  loop for each DataFile
    IndexService->>IndexService: chunk(DataFile.content)
    IndexService->>Embedder: embed(chunk)
    Embedder-->>IndexService: vector
    IndexService->>ChunkEmbeddingRepo: save(embedding)
  end
  IndexService->>IndexRepo: save(Index)
  IndexService-->>Client: IndexDetailResponse

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Add DOCX extractor (#19)	✅
Add PDF extractor (#19)	✅
Add TXT extractor (#19)	✅

Assessment against linked issues: Out-of-scope changes

Code Change	Explanation
Replace S3 storage with LOB content in DataFile and remove all S3 utilities/config (src/main/kotlin/.../document/entity/DataFile.kt; .../global/util/s3/*; .../global/config/S3Config.kt)	Not required by issue #19; it specifies adding extractors, not changing storage architecture.
Make ids non-null and move from constructors across multiple entities (e.g., Index, Tag, Prompt, FewShot, ChunkEmbedding)	Structural JPA changes unrelated to adding content extractors.
Modify IndexService to read DataFile.content and remove ContentLoader (src/main/kotlin/.../index/service/IndexService.kt; delete .../global/util/loader/*)	Switching index input source is beyond extractor addition scope.
DTO API changes: nesting DataFileCreateItem and making ids non-null (document/index DTOs)	Public API reshaping not specified in the extractor feature.
Dependency switch to `com.pgvector:pgvector` and removal of AWS SDK (build.gradle)	Dependency stack changes are unrelated to extractor addition.

Possibly related PRs

✨ Feature: 파일 업로드 기능 추가 #2 — Refactors file-upload domain and S3 utilities; closely related to this PR’s transition from S3 to content extraction.
✨ Feature: index 메타데이터 CRUD 기능 추가 #11 — Alters index domain services/DTOs; overlaps with this PR’s IndexService and DTO adjustments.
✨ Feature: indexing #18 — Modifies embedding/vector-store stack; related to dependency change and embedding additions here.

Poem

In pages, bytes, and whispering text I hop,
From DOCX lanes to PDF mountaintop.
I nibble TXT—so light, so neat—
Extracting crumbs of meaning sweet.
No clouds today; I burrow deep—
Content within is mine to keep. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/#10-add-content-extractor

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 40

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)

src/main/kotlin/simplerag/ragback/global/util/converter/FileConvertUtil.kt (2)
34-51: Streaming hash implementation looks good; minor reuse and clarity

Reuse the existing sha256Hex helper to avoid duplicate hex logic.
-    val sha256 = digest.digest().joinToString("") { "%02x".format(it) }
-    return FileMetrics(sha256, totalBytes)
+    return FileMetrics(sha256Hex(digest.digest()), totalBytes)
Efficient streaming with DigestInputStream and an 8KB buffer is appropriate.

20-31: Enforce extension-first allowlist to match available extractors

I verified that the codebase only provides TxtContentExtractor, PdfContentExtractor, and DocxContentExtractor—there are no extractors or tests for CSV, Markdown, or JSON formats. Therefore the FileConvertUtil.resolveContentType() implementation must be restricted to exactly those three types, and must not trust a spoofed contentType.

Please apply the following mandatory refactor in src/main/kotlin/simplerag/ragback/global/util/converter/FileConvertUtil.kt:
-private val supportedByExt = mapOf(
-    "pdf"  to "application/pdf",
-    "txt"  to "text/plain",
-    "csv"  to "text/csv",
-    "md"   to "text/markdown",
-    "json" to "application/json",
-    "docx" to "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
-)
+private val supportedByExt = mapOf(
+    "pdf"  to "application/pdf",
+    "txt"  to "text/plain",
+    "docx" to "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+)
 private val supportedContentTypes = supportedByExt.values.toSet()
 
 fun MultipartFile.resolveContentType(): String {
     // 1) Derive from filename extension only if supported
     val ext = originalFilename
-        ?.substringAfterLast('.', "")
+        ?.substringAfterLast('.', "")
         ?.lowercase()
     supportedByExt[ext]?.let { return it }
 
     // 2) Fall back to client-provided contentType only if it’s in the allowlist
     val ct = contentType?.lowercase()
     if (ct != null && ct in supportedContentTypes) {
         return ct
     }
 
     throw CustomException(ErrorCode.INVALID_FILE_TYPE)
 }
– If CSV, MD or JSON support is truly required, please add corresponding ContentExtractor implementations and unit tests before re-adding them here.
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (2)
14-18: Enforce embedding size invariants close to the data — prevent runtime index errors early

You’re storing FloatArray in a vector(1536) column while also persisting embeddingDim. If embedding.size doesn’t match embeddingDim (or 1536), you’ll get DB errors or silent data skew.

If you plan to support multiple models with different dimensions, avoid hardcoding vector(1536) at the column level (use vector and enforce dimension elsewhere), or segregate by table per dimension.

Add an entity-level check to fail fast.

Apply this guard:
 class ChunkEmbedding(
@@
     var embedding: FloatArray,
@@
     val embeddingDim: Int,
@@
     val index: Index,
-) : BaseEntity() {
+) : BaseEntity() {
+    init {
+        require(embedding.isNotEmpty()) { "embedding must not be empty" }
+        require(embedding.size == embeddingDim) {
+            "embedding size (${embedding.size}) must equal embeddingDim ($embeddingDim)"
+        }
+        // If the column is vector(1536), also enforce it here to avoid DB-time failures.
+        // require(embedding.size == 1536) { "embedding must be 1536-dim to fit vector(1536) column" }
+    }
14-16: Ensure proper Hibernate mapping for the vector(1536) column

It looks like the project currently depends on the JDBC library com.pgvector:pgvector:0.1.6, but I did not find any Hibernate vector module, JPA AttributeConverter, or @JdbcTypeCode/@Array annotations in the Kotlin entity classes. Without one of these, Hibernate will treat FloatArray as a standard SQL array (float4[]) and fail to read/write the Postgres vector(1536) type. (mvnrepository.com)

Please address this in one of the following ways:
Use Hibernate 6.4+ built-in vector support
Add the Maven/Gradle dependency
org.hibernate.orm:hibernate-vector:6.4.x
In src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (lines 14–16), update the embedding property:
import org.hibernate.annotations.Array
import org.hibernate.annotations.JdbcTypeCode
import org.hibernate.type.SqlTypes

@Column(name = "embedding", columnDefinition = "vector(1536)", nullable = false)
@JdbcTypeCode(SqlTypes.VECTOR)
@Array(length = 1536)
var embedding: FloatArray
See the official pgvector-Java README for Hibernate usage. (github.com)
Or use the PGvector Java type directly

Change the property to
var embedding: PGvector

Implement or add an AttributeConverter<PGvector, PGvector> (or use a community pgvector-Hibernate integration) to marshal between PGvector and the database.
File to update:

src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (lines 14–16)
src/main/kotlin/simplerag/ragback/domain/prompt/entity/FewShot.kt (1)
13-19: Nit: Consider columnDefinition for large text fields for portability/clarity

You’re using @Lob on answer and evidence. Depending on the dialect, specifying columnDefinition = "text" (for PostgreSQL) can reduce ambiguity in schema generation. Non-blocking.

Example:
-    @Column(name = "answer", nullable = false)
+    @Column(name = "answer", nullable = false, columnDefinition = "text")
@@
-    @Column(name = "evidence", nullable = false)
+    @Column(name = "evidence", nullable = false, columnDefinition = "text")
src/main/kotlin/simplerag/ragback/domain/index/entity/Index.kt (2)
49-59: Guard domain invariants early (chunking vs overlap, trimming)

Ensure overlapSize is less than chunkingSize to avoid infinite or degenerate chunking.

You already trim snapshotName; good.

Apply:
         fun toIndex(createRequest: IndexCreateRequest): Index {
+            require(createRequest.chunkingSize > 0) { "chunkingSize must be > 0" }
+            require(createRequest.overlapSize >= 0) { "overlapSize must be >= 0" }
+            require(createRequest.overlapSize < createRequest.chunkingSize) {
+                "overlapSize (${createRequest.overlapSize}) must be less than chunkingSize (${createRequest.chunkingSize})"
+            }
             return Index(
                 snapshotName = createRequest.snapshotName.trim(),
                 overlapSize = createRequest.overlapSize,
                 chunkingSize = createRequest.chunkingSize,
62-69: Also validate on updates to avoid drifting into invalid state

Mirror the same requirements in update() to keep invariants consistent.

Apply:
     fun update(req: IndexUpdateRequest) {
-        snapshotName = req.snapshotName.trim()
-        chunkingSize = req.chunkingSize
-        overlapSize = req.overlapSize
+        require(req.chunkingSize > 0) { "chunkingSize must be > 0" }
+        require(req.overlapSize >= 0) { "overlapSize must be >= 0" }
+        require(req.overlapSize < req.chunkingSize) {
+            "overlapSize (${req.overlapSize}) must be less than chunkingSize (${req.chunkingSize})"
+        }
+        snapshotName = req.snapshotName.trim()
+        chunkingSize = req.chunkingSize
+        overlapSize = req.overlapSize
         similarityMetric = req.similarityMetric
         topK = req.topK
         reranker = req.reranker
     }
src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileResponseDTO.kt (1)
33-41: Fix non-nullable id lookup in tag mapping

The DataFile.id property is declared as a non-nullable Long (val id: Long = 0), so the safe-call (?.let { … }) doesn’t compile and isn’t needed. Update the tag-lookup to use the non-nullable id directly:

• File: src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileResponseDTO.kt (within the from function)
-    val tags = file.id?.let { tagsByFileId[it] } ?: emptyList()
+    val tags = tagsByFileId[file.id] ?: emptyList()
This change simplifies the code and aligns with the non-nullable declaration of DataFile.id.
src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt (2)
58-89: Add happy-path extractor coverage for PDF/DOCX.

You’re validating TXT extraction; consider adding similar tests that upload small PDF and DOCX files and assert non-blank content and correct tag normalization. I can draft minimal in-memory fixtures if helpful.

174-187: Fix PDF MIME type in tests

The text/pdf value isn’t a valid PDF MIME type—it should be application/pdf. Updating this in your fixtures and assertions will keep the tests accurate and reduce confusion for future readers.

Affected locations:

src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt:183

src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt:206

Proposed diff:
--- a/src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt
@@ -181,7 +181,7 @@
                 DataFile(
                     title = "exists2",
-                    type = "text/pdf",
+                    type = "application/pdf",
                     sizeBytes = 0,
                     sha256 = sha2,
                     content = "fake://original/exists.txt",
@@ -204,7 +204,7 @@
         val dataFileDetailResponse2 = dataFiles.dataFileDetailResponseList[1]
         assertEquals(dataFileDetailResponse2.title, "exists2")
-        assertEquals(dataFileDetailResponse2.type, "text/pdf")
+        assertEquals(dataFileDetailResponse2.type, "application/pdf")
         assertEquals(dataFileDetailResponse2.sizeMB, 0.0)
         assertEquals(dataFileDetailResponse2.sha256, sha2)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d991822 and f35e708.

📒 Files selected for processing (34)

build.gradle (1 hunks)
src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileRequestDTO.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileResponseDTO.kt (2 hunks)
src/main/kotlin/simplerag/ragback/domain/document/entity/DataFile.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/document/entity/DataFileTag.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/document/entity/Tag.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/document/service/DataFileService.kt (5 hunks)
src/main/kotlin/simplerag/ragback/domain/index/dto/IndexResponseDTO.kt (2 hunks)
src/main/kotlin/simplerag/ragback/domain/index/embed/FakeEmbder.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/entity/DataFileIndex.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/entity/Index.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/entity/enums/EmbeddingModel.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/service/IndexService.kt (4 hunks)
src/main/kotlin/simplerag/ragback/domain/prompt/entity/FewShot.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/prompt/entity/Prompt.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/config/S3Config.kt (0 hunks)
src/main/kotlin/simplerag/ragback/global/error/ErrorCode.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/storage/FakeS3Util.kt (0 hunks)
src/main/kotlin/simplerag/ragback/global/util/converter/FileConvertUtil.kt (2 hunks)
src/main/kotlin/simplerag/ragback/global/util/extractor/ContentExtractor.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/extractor/ContentExtractorResolver.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/extractor/DocxContentExtractor.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/extractor/PdfContentExtractor.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/extractor/TxtContentExtractor.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/loader/ContentLoader.kt (0 hunks)
src/main/kotlin/simplerag/ragback/global/util/loader/HttpContentLoader.kt (0 hunks)
src/main/kotlin/simplerag/ragback/global/util/s3/S3Type.kt (0 hunks)
src/main/kotlin/simplerag/ragback/global/util/s3/S3Util.kt (0 hunks)
src/main/kotlin/simplerag/ragback/global/util/s3/S3UtilImpl.kt (0 hunks)
src/main/resources/application-local.yml (2 hunks)
src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt (5 hunks)
src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt (5 hunks)
src/test/resources/application-test.yml (1 hunks)

💤 Files with no reviewable changes (7)

src/main/kotlin/simplerag/ragback/global/config/S3Config.kt
src/main/kotlin/simplerag/ragback/global/util/loader/ContentLoader.kt
src/main/kotlin/simplerag/ragback/global/storage/FakeS3Util.kt
src/main/kotlin/simplerag/ragback/global/util/loader/HttpContentLoader.kt
src/main/kotlin/simplerag/ragback/global/util/s3/S3Util.kt
src/main/kotlin/simplerag/ragback/global/util/s3/S3Type.kt
src/main/kotlin/simplerag/ragback/global/util/s3/S3UtilImpl.kt

🧰 Additional context used

🧬 Code graph analysis (17)

src/main/kotlin/simplerag/ragback/global/error/ErrorCode.kt (2)

src/main/kotlin/simplerag/ragback/global/error/CustomException.kt (4)

errorCode (19-23)

errorCode (14-17)

errorCode (9-12)

errorCode (3-7)

src/main/kotlin/simplerag/ragback/global/error/GlobalExceptionHandler.kt (2)

handleFileException (72-78)

handleMissingPart (37-43)

src/main/kotlin/simplerag/ragback/domain/index/dto/IndexResponseDTO.kt (2)

src/main/kotlin/simplerag/ragback/domain/index/dto/IndexRequestDTO.kt (2)

dataFileId (11-34)

max (36-53)

src/main/kotlin/simplerag/ragback/domain/index/controller/IndexController.kt (1)

updateIndexes (41-48)

src/main/kotlin/simplerag/ragback/global/util/extractor/ContentExtractor.kt (2)

src/main/kotlin/simplerag/ragback/global/util/loader/ContentLoader.kt (2)

load (4-6)

load (5-5)

src/main/kotlin/simplerag/ragback/global/util/s3/S3Util.kt (1)

upload (5-11)

src/main/kotlin/simplerag/ragback/domain/index/entity/DataFileIndex.kt (2)

src/main/kotlin/simplerag/ragback/global/entity/BaseEntity.kt (1)

name (11-21)

src/main/kotlin/simplerag/ragback/domain/chat/entity/Model.kt (1)

name (8-29)

src/main/kotlin/simplerag/ragback/domain/index/service/IndexService.kt (1)

src/main/kotlin/simplerag/ragback/global/util/loader/ContentLoader.kt (1)

load (4-6)

src/main/kotlin/simplerag/ragback/domain/index/embed/FakeEmbder.kt (2)

src/main/kotlin/simplerag/ragback/domain/index/embed/OpenAIEmbbeder.kt (2)

openAiEmbeddingModel (6-13)

embed (11-12)

src/main/kotlin/simplerag/ragback/domain/index/embed/Embedder.kt (2)

dim (3-6)

embed (5-5)

src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (1)

src/main/kotlin/simplerag/ragback/domain/index/repository/ChunkEmbeddingRepository.kt (1)

interface ChunkEmbeddingRepository : JpaRepository<ChunkEmbedding, Long> (6-6)

src/main/kotlin/simplerag/ragback/domain/document/service/DataFileService.kt (1)

src/main/kotlin/simplerag/ragback/domain/document/controller/DataFileController.kt (1)

dataFileService (20-64)

src/main/kotlin/simplerag/ragback/domain/prompt/entity/FewShot.kt (2)

src/main/kotlin/simplerag/ragback/global/entity/BaseEntity.kt (1)

name (11-21)

src/main/kotlin/simplerag/ragback/domain/chat/entity/Model.kt (1)

name (8-29)

src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt (1)

src/main/kotlin/simplerag/ragback/domain/document/controller/DataFileController.kt (1)

dataFileService (20-64)

src/main/kotlin/simplerag/ragback/domain/index/entity/Index.kt (1)

src/main/kotlin/simplerag/ragback/domain/index/dto/IndexRequestDTO.kt (1)

dataFileId (11-34)

src/main/kotlin/simplerag/ragback/domain/document/entity/DataFileTag.kt (1)

src/main/kotlin/simplerag/ragback/domain/document/repository/DataFileTagRepository.kt (1)

existsByDataFileIdAndTagId (8-15)

src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt (1)

src/main/kotlin/simplerag/ragback/domain/index/dto/IndexRequestDTO.kt (1)

dataFileId (11-34)

src/main/kotlin/simplerag/ragback/domain/prompt/entity/Prompt.kt (2)

src/main/kotlin/simplerag/ragback/domain/chat/entity/Model.kt (1)

name (8-29)

src/main/kotlin/simplerag/ragback/global/entity/BaseEntity.kt (1)

name (11-21)

src/main/kotlin/simplerag/ragback/domain/index/entity/enums/EmbeddingModel.kt (2)

src/main/kotlin/simplerag/ragback/domain/index/embed/OpenAIEmbbeder.kt (1)

openAiEmbeddingModel (6-13)

src/main/kotlin/simplerag/ragback/domain/index/embed/Embedder.kt (2)

dim (3-6)

embed (5-5)

src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileRequestDTO.kt (1)

src/main/kotlin/simplerag/ragback/domain/document/controller/DataFileController.kt (1)

dataFileService (20-64)

src/main/kotlin/simplerag/ragback/domain/document/entity/Tag.kt (1)

src/main/kotlin/simplerag/ragback/global/entity/BaseEntity.kt (1)

name (11-21)

🪛 YAMLlint (1.37.1)

src/test/resources/application-test.yml

[error] 14-14: no new line character at the end of file

(new-line-at-end-of-file)

src/main/resources/application-local.yml

[error] 23-23: no new line character at the end of file

(new-line-at-end-of-file)

🪛 detekt (1.23.8)

src/main/kotlin/simplerag/ragback/domain/document/service/DataFileService.kt

[warning] 56-56: The caught exception is swallowed. The original exception could be lost.

(detekt.exceptions.SwallowedException)

🔇 Additional comments (23)

src/main/kotlin/simplerag/ragback/global/error/ErrorCode.kt (1)

16-16: ✅ File type validation consistently surfaces INVALID_FILE_TYPE exception

I’ve confirmed that all content-type resolution now goes through MultipartFile.resolveContentType(), which throws CustomException(ErrorCode.INVALID_FILE_TYPE) for any unsupported extension. A search for manual extension checks or default content-type fallbacks found no other occurrences outside of:

src/main/kotlin/simplerag/ragback/global/util/converter/FileConvertUtil.kt

src/main/kotlin/simplerag/ragback/domain/document/service/DataFileService.kt

No additional call sites need updates—this change fully covers unsupported file-type handling.
src/main/kotlin/simplerag/ragback/domain/index/entity/DataFileIndex.kt (1)
19-24: Ensure JPA-generated IDs are nullable and mutable

I’ve verified there are no other Kotlin entities defining an @Id field with a default numeric value (sentinel 0). Only DataFileIndex.kt needs updating:

• File: src/main/kotlin/simplerag/ragback/domain/index/entity/DataFileIndex.kt
Apply:
 ) : BaseEntity() {
   @Id @GeneratedValue(strategy = GenerationType.IDENTITY)
   @Column(name = "data_files_indexes_id")
-  val id: Long = 0
+  var id: Long? = null
 }
With this change, JPA will correctly recognize an uninitialized (null) identifier before persistence.
src/main/resources/application-local.yml (2)
15-17: Suspicious empty key under spring.ai.model.embedding

options: is empty and then model: text-embedding-3-small sits at the same indentation level. If the binder expects spring.ai.model.embedding.model, the empty options: line is unnecessary and could be misleading.

Confirm your binder paths and consider removing the empty options: or nesting under it if intentional.

9-9: Verify create-drop is confined to the local profile

Search across src/main/resources shows only application-local.yml defines
  ddl-auto: create-drop
No other profiles or the default configuration specify spring.jpa.hibernate.ddl-auto with a destructive setting.

Please ensure that:

The local profile is strictly used in disposable, developer-only environments.

There’s no path in CI/CD or shared servers that activates the local profile.

You have safeguards (documentation, deployment scripts, environment checks) to prevent accidental use of this profile against any persistent/shared database.
src/main/kotlin/simplerag/ragback/domain/index/dto/IndexResponseDTO.kt (1)

35-35: Non-null indexId is safe to keep
All uses of toIndexPreviewResponse and toIndexDetailResponse happen after the Index entity has been saved or loaded from the repository, so index.id is guaranteed to be non-null. No controller returns these DTOs before persistence, and all service methods map only fully initialized entities.

src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileRequestDTO.kt (2)

7-11: Nice: Request-level @Valid on items ensures nested validation is applied

Good use of container element validation with a clear minimum-size message.

11-20: Nested type references updated — no further action required

Verified with rg -nP -g '*.kt' -C2 '\bDataFileCreateItem\b' that the only occurrences of DataFileCreateItem are its declaration inside DataFileBulkCreateRequest and the test imports/uses in DataFileServiceTest.kt.

Test code correctly imports the nested type via
import simplerag.ragback.domain.document.dto.DataFileBulkCreateRequest.DataFileCreateItem.

No remaining controllers, clients, or other modules reference a top-level DataFileCreateItem, so all downstream code has been updated for the new FQCN.
build.gradle (2)
49-49: The scripts above will reveal whether a Hibernate @Type annotation, a custom TypeContributor, or any com.pgvector‐specific configuration is present, as well as inspect database and JPA settings. Once we have these results, we can confirm if the PGvector mapping is fully wired up or if additional AttributeConverter/column mapping work is required.

60-62: Verify dependency footprint and pin extractor library versions

I wasn’t able to inspect the runtimeClasspath here due to Gradle daemon connectivity errors. Please verify locally that pulling in PDFBox and POI-OOXML doesn’t drag in unexpectedly large transitive dependencies, and pin their versions centrally for easier future CVE upgrades:

• In build.gradle (around lines 60–62), replace the direct implementation calls with centrally managed constraints:
 dependencyManagement {
   imports {
     mavenBom "org.springframework.ai:spring-ai-bom:${springAiVersion}"
   }
+  dependencies {
+    dependency "org.apache.pdfbox:pdfbox:2.0.30"
+    dependency "org.apache.poi:poi-ooxml:5.2.5"
+  }
 }
• Run locally:
./gradlew --no-daemon dependencies --configuration runtimeClasspath
# or per-dependency:
./gradlew dependencyInsight \
  --configuration runtimeClasspath \
  --dependency org.apache.pdfbox:pdfbox
./gradlew dependencyInsight \
  --configuration runtimeClasspath \
  --dependency org.apache.poi:poi-ooxml
• Ensure your extraction code only processes text so you don’t inadvertently load images or large binary streams into memory.

Let me know if any heavy transitive pulls appear so we can decide on exclusions or shading.
src/main/kotlin/simplerag/ragback/domain/document/entity/DataFile.kt (1)
25-28: Ensure lazy loading for content and verify bytecode enhancement

File: src/main/kotlin/simplerag/ragback/domain/document/entity/DataFile.kt (lines 25–28)

Apply the following diff to mark content as a lazy LOB and use an explicit Postgres text column:
-    @Column(nullable = false)
-    @Lob
-    val content: String,
+    @Lob
+    @Basic(fetch = FetchType.LAZY)
+    @Column(nullable = false, columnDefinition = "text")
+    val content: String,
Next steps:

• I did not locate any Hibernate bytecode‐enhancement settings in your configuration files or build scripts. Please verify that build-time enhancement is enabled (for example, via the org.hibernate.orm Gradle plugin or the corresponding Maven plugin) so that @Basic(fetch = LAZY) on a non-@OneToMany/@ManyToOne field takes effect.
• If you cannot enable bytecode enhancement, reconsider isolating this large content payload—either move it into a separate entity or load it via a dedicated repository method when needed.
src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileResponseDTO.kt (2)

44-64: Size conversion and timestamp mapping verified

Both sizeMB conversion and lastModified mapping have been confirmed as correct—no changes required.

sizeMB uses toMegaBytes(2) to round to two decimal places as intended.

BaseEntity.updatedAt is declared as a non-nullable LocalDateTime and annotated with @LastModifiedDate alongside @EntityListeners(AuditingEntityListener::class), ensuring it’s automatically set on persist and update.

69-73: Mapping Safety Confirmed: Tag.id is Non-nullable

Tag entity defines val id: Long = 0, ensuring a non-nullable identifier. The from(tag: Tag): TagDTO mapper in DataFileResponseDTO.kt is therefore safe and will not encounter null-related issues.

• Checked in src/main/kotlin/simplerag/ragback/domain/document/entity/Tag.kt at line 18:
val id: Long = 0
• Mapping occurs in src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileResponseDTO.kt lines 69–73:
fun from(tag: Tag): TagDTO = TagDTO(tag.id, tag.name)

Approving these code changes.

src/main/kotlin/simplerag/ragback/domain/index/entity/enums/EmbeddingModel.kt (1)

23-25: Comma in enum looks good; please confirm the E5 embedder outputs 768-dimensional vectors

I didn’t find any Kotlin class or function in the repo that implements “E5_BASE” embedding logic, so we need to be sure wherever E5_BASE is used, the embedding pipeline actually emits 768-length vectors.

• Verify your E5 embedding implementation (wherever you call "intfloat/e5-base-v2") produces 768-dimensional output.
• Add or update tests that assert EmbeddingModel.E5_BASE.dim == 768 and that the returned vector’s size matches.

src/main/kotlin/simplerag/ragback/domain/index/service/IndexService.kt (4)

67-71: Consistent not-found handling.

Switch to IndexException(ErrorCode.NOT_FOUND) reads consistent with the rest of the service’s exception semantics.

78-86: Update path reads clean and safe.

Null-guard + update + mapping back to preview is standard; no issues spotted.

90-94: Delete path is correct.

Null-guard and delete are straightforward.

39-44: Approved: Sequence return type confirmed and DTO property naming consistent
LGTM; direct use of persisted content is correct.

TextChunker.chunkByCharsSeq returns a Sequence<String>, so you’re already using the most memory-friendly iteration for large files.

IndexRequestDTO defines val dataFileId: List<Long> and there are no occurrences of dataFileIds, so the DTO property name is consistent.

src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt (1)

82-83: Good assertion on persisted content.

Verifies we’re storing extracted content, not a URL. This guards the new model well.

src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt (1)

35-36: LGTM: injecting repositories for test setup/teardown

Wiring DataFileRepository and ChunkEmbeddingRepository directly improves test determinism and speeds up setup.

src/main/kotlin/simplerag/ragback/domain/document/service/DataFileService.kt (4)

24-24: LGTM: class-level read-only transaction default

Setting @Transactional(readOnly = true) at class level with selective write overrides on mutating methods is a solid default.

78-80: LGTM: grouping by non-null id simplifies pagination assembly

Using it.dataFile.id directly is consistent with the non-null id migration.

88-93: LGTM: explicit 404 mapping for deletion path

Resolving by id first and throwing NOT_FOUND keeps controller semantics clear.

52-53: Do not reject empty content during extraction; validate post-extraction if needed

The extractor’s responsibility is to return whatever it finds—even if that’s an empty result—reserving outright file rejection for truly unreadable or corrupted inputs. If your business logic requires that files contain non-blank content, enforce that requirement after calling extractContent, not within the extractor itself.

• Location: src/main/kotlin/simplerag/ragback/domain/document/service/DataFileService.kt, around line 52
• Change: remove or ignore blank‐content rejection here; add an after-extraction validation step in the service or pipeline if needed.

coderabbitai · 2025-08-22T12:42:16Z

build.gradle

    testImplementation "org.testcontainers:postgresql"
    testImplementation "org.testcontainers:junit-jupiter"
-    testImplementation 'org.springframework.ai:spring-ai-starter-vector-store-pgvector'
+    testImplementation 'com.pgvector:pgvector:0.1.6'


🧹 Nitpick (assertive)

Avoid version drift; centralize pgvector version and deduplicate test dependency.

Declare the version once and reuse it for both main and test scopes. Also, you typically don’t need testImplementation if implementation already pulls it transitively for tests.

Apply:

- implementation 'com.pgvector:pgvector:0.1.6' + def pgvectorVersion = '0.1.6' + implementation "com.pgvector:pgvector:${pgvectorVersion}" ... - testImplementation 'com.pgvector:pgvector:0.1.6' + // If tests require explicit access, keep this; otherwise remove. + testImplementation "com.pgvector:pgvector:${pgvectorVersion}"

🤖 Prompt for AI Agents

In build.gradle around line 55, the pgvector version is hardcoded for testImplementation which risks version drift and duplicates the dependency; declare a single version variable (e.g., ext or extra property like pgvectorVersion = '0.1.6') and use it for the implementation dependency (implementation "com.pgvector:pgvector:${pgvectorVersion}"), then remove the testImplementation line unless tests need a separate scope—if tests require it, reference the same variable (testImplementation "com.pgvector:pgvector:${pgvectorVersion}") so the version is centralized and not duplicated.

coderabbitai · 2025-08-22T12:42:16Z

src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileRequestDTO.kt

+    data class DataFileCreateItem(
+        @field:NotBlank(message = "title은 비어있을 수 없습니다")
+        @field:Size(max = 100)
+        val title: String,

-data class DataFileCreateItem(
-    @field:NotBlank(message = "title은 비어있을 수 없습니다")
-    @field:Size(max = 100)
-    val title: String,
-
-    @field:Size(max = 10, message = "태그는 최대 10개까지 가능합니다")
-    val tags: List<String> = emptyList()
-)
+        @field:Size(max = 10, message = "태그는 최대 10개까지 가능합니다")
+        val tags: List<String> = emptyList()
+    )


🧹 Nitpick (assertive)

Optional: Strengthen validation on tags to prevent blank/oversized tag values

Currently only the list size is validated. Consider enforcing per-tag constraints.

Apply:

data class DataFileCreateItem( @field:NotBlank(message = "title은 비어있을 수 없습니다") @field:Size(max = 100) val title: String, - @field:Size(max = 10, message = "태그는 최대 10개까지 가능합니다") - val tags: List<String> = emptyList() + @field:Size(max = 10, message = "태그는 최대 10개까지 가능합니다") + val tags: List<@NotBlank(message = "빈 태그는 허용되지 않습니다") @Size(max = 30, message = "태그는 최대 30자까지 가능합니다") String> = emptyList() )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

data class DataFileCreateItem(

@field:NotBlank(message = "title은 비어있을 수 없습니다")

@field:Size(max = 100)

val title: String,

data class DataFileCreateItem(

@field:NotBlank(message = "title은 비어있을 수 없습니다")

@field:Size(max = 100)

val title: String,

@field:Size(max = 10, message = "태그는 최대 10개까지 가능합니다")

val tags: List<String> = emptyList()

)

@field:Size(max = 10, message = "태그는 최대 10개까지 가능합니다")

val tags: List<String> = emptyList()

)

data class DataFileCreateItem(

@field:NotBlank(message = "title은 비어있을 수 없습니다")

@field:Size(max = 100)

val title: String,

@field:Size(max = 10, message = "태그는 최대 10개까지 가능합니다")

val tags: List<@NotBlank(message = "빈 태그는 허용되지 않습니다") @Size(max = 30, message = "태그는 최대 30자까지 가능합니다") String> = emptyList()

)

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileRequestDTO.kt around lines 12 to 19, strengthen validation on the tags list by adding per-element constraints so individual tags cannot be blank or too long; change the tags declaration to use type-use annotations (e.g. List<@field:NotBlank @field:Size(max = 30) String> or preferably List<@NotBlank @Size(max = 30) String>) and add @field:Valid if needed, update/adjust validation messages and imports accordingly so each tag is validated for non-blank content and a maximum length (choose an appropriate max like 30).

coderabbitai · 2025-08-22T12:42:16Z

src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileResponseDTO.kt

+                id = file.id,
                sha256 = file.sha256,
            )


⚠️ Potential issue

Map id defensively; entity ids can be null before persistence.

After switching entity id to nullable (recommended), avoid propagating a null or “0” id to API responses.

Apply:

- id = file.id, + id = requireNotNull(file.id) { "DataFile.id must be assigned" }, sha256 = file.sha256,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

id = file.id,

sha256 = file.sha256,

)

id = requireNotNull(file.id) { "DataFile.id must be assigned" },

sha256 = file.sha256,

)

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileResponseDTO.kt around lines 21 to 23, the code maps file.id directly but entity ids can be null before persistence; change the mapping to handle nulls defensively: either make DataFileResponseDTO.id nullable (Long?) and assign id = file.id ?: null, or if the API requires a non-null id, fail fast by checking file.id and throw an IllegalStateException with a clear message when it's null—update the DTO type and callers accordingly so no null or sentinel “0” value is returned in responses.

coderabbitai · 2025-08-22T12:42:16Z

src/main/kotlin/simplerag/ragback/domain/document/entity/DataFile.kt

    @Id @GeneratedValue(strategy = GenerationType.IDENTITY)
    @Column(name = "data_files_id")
-    val id: Long? = null,
-) : BaseEntity() {
+    val id: Long = 0
+


⚠️ Potential issue

Generated ID cannot reliably be a non-null val initialized to 0; use nullable or mutable id.

Hibernate needs to assign the identifier after insert. val id: Long = 0 risks persistence or hydration issues. The common pattern is var id: Long? = null. This also aligns with DTOs that can assert non-null post-persist.

Apply:

- @Id @GeneratedValue(strategy = GenerationType.IDENTITY) - @Column(name = "data_files_id") - val id: Long = 0 + @Id + @GeneratedValue(strategy = GenerationType.IDENTITY) + @Column(name = "data_files_id") + var id: Long? = null

Follow-up: update DTO mappers to require a non-null id at mapping time (see comments in DataFileResponseDTO.kt).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@Id @GeneratedValue(strategy = GenerationType.IDENTITY)

@Column(name = "data_files_id")

val id: Long? = null,

) : BaseEntity() {

val id: Long = 0

@Id

@GeneratedValue(strategy = GenerationType.IDENTITY)

@Column(name = "data_files_id")

var id: Long? = null

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/domain/document/entity/DataFile.kt around lines 29-32, the entity declares a non-null val id initialized to 0 which prevents Hibernate from assigning the generated identifier correctly; change the declaration to a nullable mutable property (e.g., var id: Long? = null) so the persistence provider can set the ID after insert, and then update DTO mappers to treat the entity ID as nullable at mapping time and assert non-null only after persist (per comment in DataFileResponseDTO.kt).

coderabbitai · 2025-08-22T12:42:16Z

src/main/kotlin/simplerag/ragback/domain/document/entity/DataFile.kt

+        fun from(title: String, type: String, sizeBytes : Long, sha256 : String, content: String): DataFile {
+            return DataFile(title, type, sizeBytes, sha256, content)
        }


🧹 Nitpick (assertive)

Factory method aligns with new model; consider basic invariants.

Add lightweight checks to guard against invalid inputs, e.g., blank title/type, negative sizes.

Example:

- fun from(title: String, type: String, sizeBytes : Long, sha256 : String, content: String): DataFile { - return DataFile(title, type, sizeBytes, sha256, content) - } + fun from(title: String, type: String, sizeBytes: Long, sha256: String, content: String): DataFile { + require(title.isNotBlank()) { "title must not be blank" } + require(type.isNotBlank()) { "type must not be blank" } + require(sizeBytes >= 0) { "sizeBytes must be >= 0" } + require(sha256.length == 64) { "sha256 must be 64-char hex" } + return DataFile(title, type, sizeBytes, sha256, content) + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fun from(title: String, type: String, sizeBytes : Long, sha256 : String, content: String): DataFile {

return DataFile(title, type, sizeBytes, sha256, content)

}

fun from(title: String, type: String, sizeBytes: Long, sha256: String, content: String): DataFile {

require(title.isNotBlank()) { "title must not be blank" }

require(type.isNotBlank()) { "type must not be blank" }

require(sizeBytes >= 0) { "sizeBytes must be >= 0" }

require(sha256.length == 64) { "sha256 must be 64-char hex" }

return DataFile(title, type, sizeBytes, sha256, content)

}

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/domain/document/entity/DataFile.kt around lines 34 to 36, the factory method does not validate inputs; add lightweight invariants to guard against invalid data by checking: title and type are not blank (trim before check), sizeBytes is >= 0, sha256 is not blank (optionally validate length/hex if desired), and content is non-null (allow empty if acceptable); enforce these using Kotlin require(...) or throw IllegalArgumentException with clear messages, and return the DataFile only after all checks pass.

coderabbitai · 2025-08-22T12:42:20Z

src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt

+        val datafile = dataFileRepository.save(DataFile("a", "txt", 1000000L, "sdf", "sfd"))
+
        val indexCreateRequest =
-            IndexCreateRequest("test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_LARGE, true)
+            IndexCreateRequest(listOf(datafile.id), "test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_SMALL, true)



🧹 Nitpick (assertive)

Prefer factory constructor to honor domain invariants

Use DataFile.from(...) for clarity and to avoid bypassing domain invariants. Also, give the SHA-256 a realistic shape to prevent accidental collisions in future tests.

- val datafile = dataFileRepository.save(DataFile("a", "txt", 1000000L, "sdf", "sfd")) + val datafile = dataFileRepository.save( + DataFile.from( + title = "a", + type = "txt", + sizeBytes = 1_000_000L, + sha256 = "00000000000000000000000000000000000000000000000000000000000000aa", + content = "sfd" + ) + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

val datafile = dataFileRepository.save(DataFile("a", "txt", 1000000L, "sdf", "sfd"))

val indexCreateRequest =

IndexCreateRequest("test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_LARGE, true)

IndexCreateRequest(listOf(datafile.id), "test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_SMALL, true)

val datafile = dataFileRepository.save(

DataFile.from(

title = "a",

type = "txt",

sizeBytes = 1_000_000L,

sha256 = "00000000000000000000000000000000000000000000000000000000000000aa",

content = "sfd"

)

)

val indexCreateRequest =

IndexCreateRequest(

listOf(datafile.id),

"test",

1,

0,

SimilarityMetric.COSINE,

1,

EmbeddingModel.TEXT_EMBEDDING_3_SMALL,

true

)

🤖 Prompt for AI Agents

In src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt around lines 62 to 66, the test constructs a DataFile via the primary constructor which can bypass domain invariants and uses a short dummy SHA-256; change the test to create the DataFile using DataFile.from(...) (the factory constructor) and pass a realistic 64-character hex SHA-256 string so the factory enforces invariants and reduces accidental collisions in future tests.

coderabbitai · 2025-08-22T12:42:20Z

src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt

+            IndexCreateRequest(listOf(datafile.id), "test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_SMALL, true)



🧹 Nitpick (assertive)

Nit: use named arguments for readability on long parameter lists

IndexCreateRequest(...) has many positional parameters. Named args reduce mistakes when signatures evolve.

Example:

val indexCreateRequest = IndexCreateRequest( dataFileId = listOf(datafile.id), snapshotName = "test", chunkingSize = 1, overlapSize = 0, similarityMetric = SimilarityMetric.COSINE, topK = 1, embeddingModel = EmbeddingModel.TEXT_EMBEDDING_3_SMALL, reranker = true )

🤖 Prompt for AI Agents

In src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt around lines 65 to 66, the IndexCreateRequest invocation uses many positional parameters which harms readability and future-proofing; change it to use named arguments for each parameter (e.g., dataFileId = ..., snapshotName = ..., chunkingSize = ..., overlapSize = ..., similarityMetric = ..., topK = ..., embeddingModel = ..., reranker = ...) so the call is self-documenting and resilient to signature changes.

coderabbitai · 2025-08-22T12:42:20Z

src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt

        val indexCreateRequest =
-            IndexCreateRequest("test", 1, 1, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_LARGE, true)
+            IndexCreateRequest(listOf(1),"test", 1, 1, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_LARGE, true)



⚠️ Potential issue

Fragile test: constructing request with a non-existent DataFile id

IndexCreateRequest(listOf(1), ...) assumes a DataFile with id = 1 exists. Test order is not guaranteed; this can intermittently fail if cleanup changes or the first test doesn’t run. Persist a local DataFile and use its id.

// given - val indexCreateRequest = - IndexCreateRequest(listOf(1),"test", 1, 1, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_LARGE, true) + val df = dataFileRepository.save( + DataFile.from( + title = "overlap", + type = "txt", + sizeBytes = 10L, + sha256 = "00000000000000000000000000000000000000000000000000000000000000bb", + content = "x" + ) + ) + val indexCreateRequest = + IndexCreateRequest(listOf(df.id), "test", 1, 1, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_LARGE, true)

🤖 Prompt for AI Agents

In src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt around lines 81 to 83, the test constructs IndexCreateRequest with a hardcoded DataFile id (listOf(1)) which is fragile because tests run in any order and that id may not exist; instead create and persist a local DataFile within the test (use the test repository or factory to save a DataFile) and use the returned entity's id when building IndexCreateRequest, and ensure any necessary cleanup or transactional rollback is in place so the test is isolated.

coderabbitai · 2025-08-22T12:42:20Z

src/test/resources/application-test.yml

+      ddl-auto: create-drop
+  ai:
+    openai:
+      api-key: dummy
+    model:
+      embedding:
+        options:
+        model: text-embedding-3-small


💡 Verification agent

🧩 Analysis chain

YAML indentation/keys are off for embedding model; property path should be under openai.embedding.options.model.

The current structure:

has options: and model: at the same indentation

uses spring.ai.model.embedding... instead of spring.ai.openai.embedding...

This will not bind to OpenAiEmbeddingModel.

Apply:

spring: datasource: driver-class-name: org.postgresql.Driver jpa: database-platform: org.hibernate.dialect.PostgreSQLDialect hibernate: ddl-auto: create-drop ai: openai: api-key: dummy - model: - embedding: - options: - model: text-embedding-3-small + embedding: + options: + model: text-embedding-3-small

Validate property binding:

🏁 Script executed:

#!/bin/bash # Expect this property path to exist in tests after the fix. rg -n 'spring\.ai\.openai\.embedding\.options\.model' src/test/resources --type=yaml

Length of output: 86

🏁 Script executed:

#!/bin/bash # Display application-test.yml with line numbers nl -ba src/test/resources/application-test.yml || echo "application-test.yml not found" # Show context around ai, openai, embedding, options, and model keys rg -n -C2 '^ *ai:' src/test/resources/application-test.yml || true rg -n -C2 '^ *openai:' src/test/resources/application-test.yml || true rg -n -C2 '^ *embedding:' src/test/resources/application-test.yml || true rg -n -C2 '^ *options:' src/test/resources/application-test.yml || true rg -n -C2 '^ *model:' src/test/resources/application-test.yml || true

Length of output: 1066

Fix YAML nesting for embedding model in application-test.yml

The embedding block is currently defined under spring.ai.model instead of under spring.ai.openai, so Spring won’t bind the OpenAiEmbeddingModel. Please update src/test/resources/application-test.yml to nest embedding (and its options.model) under openai:

spring: datasource: driver-class-name: org.postgresql.Driver jpa: database-platform: org.hibernate.dialect.PostgreSQLDialect hibernate: ddl-auto: create-drop ai: openai: api-key: dummy - model: - embedding: - options: - model: text-embedding-3-small + embedding: + options: + model: text-embedding-3-small

• File to update: src/test/resources/application-test.yml (around lines 11–14)
• After applying, verify the binding by running:

rg -n 'spring\.ai\.openai\.embedding\.options\.model' src/test/resources --type=yaml

🧰 Tools

🪛 YAMLlint (1.37.1)

[error] 14-14: no new line character at the end of file

(new-line-at-end-of-file)

🤖 Prompt for AI Agents

In src/test/resources/application-test.yml around lines 7 to 14, the embedding configuration is nested under spring.ai.model instead of spring.ai.openai; move the entire embedding block (including options and model) so it is a child of openai (spring.ai.openai.embedding.options.model) so Spring can bind OpenAiEmbeddingModel; update indentation accordingly and save the file, then verify with the provided ripgrep command to ensure spring.ai.openai.embedding.options.model is present.

coderabbitai · 2025-08-22T12:42:20Z

src/test/resources/application-test.yml

+    model:
+      embedding:
+        options:
+        model: text-embedding-3-small


🧹 Nitpick (assertive)

Add a trailing newline.

Satisfy linters and POSIX text file conventions.

- model: text-embedding-3-small + model: text-embedding-3-small +

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

model: text-embedding-3-small

model: text-embedding-3-small

🧰 Tools

🪛 YAMLlint (1.37.1)

[error] 14-14: no new line character at the end of file

(new-line-at-end-of-file)

🤖 Prompt for AI Agents

In src/test/resources/application-test.yml at line 14, the file is missing a trailing newline; add a newline character at the end of the file so the last line ("model: text-embedding-3-small") is terminated by a newline to satisfy linters and POSIX conventions.

catturtle123 added 6 commits August 22, 2025 17:56

♻️ Refactor: val 불변성 확

a8554d9

♻️ Refactor: dto 내재화

97c34aa

🐛 Fix: dto 에러 수정

1fb358d

✨ Feature: add ContentExtractor

abf077d

♻️ Refactor: loader를 extractor로 변경

fe09783

✅ Test: index test add

f35e708

catturtle123 self-assigned this Aug 22, 2025

catturtle123 added the feature New feature or request label Aug 22, 2025

catturtle123 linked an issue Aug 22, 2025 that may be closed by this pull request

✨ [Feature] add content extractor #19

Closed

3 tasks

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

catturtle123 merged commit 0b38578 into develop Aug 22, 2025
1 check passed

catturtle123 deleted the feature/#10-add-content-extractor branch August 22, 2025 12:56

-        fun from(title: String, type: String, sizeBytes : Long, sha256 : String, content: String): DataFile {
-            return DataFile(title, type, sizeBytes, sha256, content)
-        }
+        fun from(title: String, type: String, sizeBytes: Long, sha256: String, content: String): DataFile {
+            require(title.isNotBlank()) { "title must not be blank" }
+            require(type.isNotBlank()) { "type must not be blank" }
+            require(sizeBytes >= 0) { "sizeBytes must be >= 0" }
+            require(sha256.length == 64) { "sha256 must be 64-char hex" }
+            return DataFile(title, type, sizeBytes, sha256, content)
+        }

		IndexCreateRequest(listOf(datafile.id), "test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_SMALL, true)

✨ Feature: add content extractor #20

✨ Feature: add content extractor #20

Uh oh!

Conversation

catturtle123 commented Aug 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Overview

🔍 Related Issues

✨ Changes

📸 Screenshots / Test Results (Optional)

✅ Checklist

🗒️ Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

catturtle123 commented Aug 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 22, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)