✨ Feature: indexing #18

catturtle123 · 2025-08-22T04:14:27Z

📌 Overview

indexing

🔍 Related Issues

Closes ✨ [Feature] 인덱스 임베딩 저장 #10

✨ Changes

-indexing
🐛 Fix: Autowired 해결
✨ feature: indexing

📸 Screenshots / Test Results (Optional)

Attach images or videos if necessary.

✅ Checklist

I have followed the code style guidelines.
I have removed unnecessary comments or console logs.
I have written/updated related tests.
I have verified that all features work correctly in my local environment.

🗒️ Additional Notes

Add any other context or information here.

Summary by CodeRabbit

New Features
- Index creation now requires at least one selected file, ingests file content, chunks text, and generates embeddings via OpenAI to produce a fully populated index.
- Added HTTP content loader and text chunking utilities.
Refactor
- S3 URL handling standardized to a fixed public base and utilities reorganized.
Chores
- Added OpenAI embedding and vector-store configuration/dependencies (requires OPENAI_KEY).
Tests
- Improved test DI and container setup for more reliable test runs.

coderabbitai · 2025-08-22T04:19:50Z

Walkthrough

Adds Spring AI/OpenAI and pgvector starter dependencies, introduces embedding interfaces and OpenAI embedder, implements content loading and text chunking, changes ChunkEmbedding to store FloatArray with vector(1536), implements full createIndex flow producing embeddings, reorganizes utility packages, adjusts S3 URL behavior, and updates tests and local config.

Changes

Cohort / File(s)	Summary
Build and dependencies `build.gradle`	Enables `org.springframework.ai:spring-ai-starter-model-openai`, adds `org.springframework.ai:spring-ai-starter-vector-store-pgvector`, removes legacy `com.pgvector:pgvector:0.1.6`; minor quoting change.
Index embedding feature `src/main/kotlin/.../domain/index/embed/Embedder.kt`, `src/main/kotlin/.../domain/index/embed/OpenAIEmbbeder.kt`, `src/main/kotlin/.../domain/index/entity/ChunkEmbedding.kt`, `src/main/kotlin/.../domain/index/service/IndexService.kt`, `src/main/kotlin/.../domain/index/dto/IndexRequestDTO.kt`, `src/main/kotlin/.../domain/index/dto/IndexResponseDTO.kt`, `src/main/kotlin/.../global/util/TextChunker.kt`, `src/main/kotlin/.../global/util/loader/ContentLoader.kt`, `src/main/kotlin/.../global/util/loader/HttpContentLoader.kt`	Adds `Embedder` interface and `OpenAIEmbedder` component (dim=1536); changes `ChunkEmbedding.embedding` to `FloatArray` and column to `vector(1536)`; implements `IndexService.createIndex` flow: validation, load DataFiles, content loading, chunking (TextChunker), embedding per chunk, attach embeddings to index; adds `dataFileId` (non-empty) to create DTO and small DTO formatting tweak.
Utility package reorg `src/main/kotlin/.../global/util/converter/FileConvertUtil.kt`, `src/main/kotlin/.../global/util/converter/MultipartJackson2HttpMessageConverter.kt`, `src/main/kotlin/.../global/storage/FakeS3Util.kt`, `src/main/kotlin/.../domain/document/service/DataFileService.kt`, `src/main/kotlin/.../global/util/s3/S3Type.kt`, `src/main/kotlin/.../global/util/s3/S3Util.kt`	Moves utility files into `converter` and `s3` subpackages and updates imports; most files unchanged aside from package paths.
S3 behavior change `src/main/kotlin/.../global/util/s3/S3UtilImpl.kt`	Replaces dynamic S3 URL construction with a hard-coded base URL concatenated with the key in `urlFromKey`.
Configuration `src/main/resources/application-local.yml`	Adds `spring.ai.openai.api-key` (from `${OPENAI_KEY}`) and `spring.ai.model.embedding.model: text-embedding-3-small` with empty options mapping.
Tests `src/test/kotlin/.../domain/document/service/DataFileServiceTest.kt`, `src/test/kotlin/.../domain/index/service/IndexServiceTest.kt`	Switches tests to constructor-based Spring injection with `@Autowired` on parameters; replaces Testcontainers `@Container` usage with `@ServiceConnection`; updates imports for moved utilities.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant IS as IndexService
  participant DFR as DataFileRepository
  participant CLR as ContentLoader
  participant TC as TextChunker
  participant EMB as Embedder
  participant IR as IndexRepository
  participant DB as Database

  Client->>IS: createIndex(req)
  IS->>IS: validateOverlap(req)
  IS->>DFR: findAllById(req.dataFileId)
  DFR-->>IS: DataFiles
  IS->>IS: require(embedder.dim == req.embeddingModel.dim)
  IS->>IR: save(Index.from(req))
  IR-->>IS: persisted Index
  loop each DataFile
    IS->>CLR: load(file.url)
    CLR-->>IS: content
    alt content not blank
      IS->>TC: chunkByCharsSeq(content, size, overlap)
      TC-->>IS: chunks
      loop each chunk
        IS->>EMB: embed(chunk)
        EMB-->>IS: FloatArray(1536)
        IS->>DB: persist ChunkEmbedding(embedding, content, index)
      end
    end
  end
  IS-->>Client: IndexPreviewResponse

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Assessment against linked issues

Objective	Addressed	Explanation
Match embedding size during chunking/embedding and store embeddings [#10]	✅

Assessment against linked issues: Out-of-scope changes

Code Change	Explanation
Hard-coded S3 URL in `urlFromKey` (src/main/kotlin/simplerag/ragback/global/util/s3/S3UtilImpl.kt, `urlFromKey`)	Changing S3 URL construction to a fixed base URL modifies storage URL resolution unrelated to embedding size validation or storing embeddings.

Possibly related PRs

✨ Feature: erd 수정 #6 — Overlaps index/embedding domain and ChunkEmbedding mapping changes; likely related to embedding storage adjustments.
🚀 Chore: entity 수정 및 모델 수 줄이기 #14 — Directly touches ChunkEmbedding and pgvector dependency decisions; closely related to the FloatArray/vector mapping change.
✨ Feature: index 메타데이터 CRUD 기능 추가 #11 — Modifies index service and entity flows; relevant to the new createIndex embedding flow.

Poem

I nibbled the text into tidy little bites,
Overlap counted in soft, warm lights,
Vectors tucked in burrows deep,
Indices hum while rabbits sleep.
OpenAI winds and chunk-by-chunk delights. 🐇✨

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/#10-indexing

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 20

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)

src/main/kotlin/simplerag/ragback/global/util/converter/FileConvertUtil.kt (1)
18-38: Broaden content-type detection or document security expectations.

Relying on file extension is OK as a fallback, but it’s spoofable. If this mapping is used for security-sensitive branching (e.g., parsing), consider content sniffing (Apache Tika) or magic-byte checks. As a minimal improvement, cover common types encountered in RAG pipelines.

Apply this small extension of mappings:
@@
     return when (ext) {
         "png" -> "image/png"
         "jpg", "jpeg" -> "image/jpeg"
+        "webp" -> "image/webp"
+        "svg" -> "image/svg+xml"
+        "heic" -> "image/heic"
         "pdf" -> "application/pdf"
         "txt" -> "text/plain"
         "csv" -> "text/csv"
         "md" -> "text/markdown"
         "json" -> "application/json"
+        "rtf" -> "application/rtf"
         "zip" -> "application/zip"
+        "7z" -> "application/x-7z-compressed"
         "doc" -> "application/msword"
         "docx" -> "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
         "xls" -> "application/vnd.ms-excel"
         "xlsx" -> "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
         "ppt" -> "application/vnd.ms-powerpoint"
         "pptx" -> "application/vnd.openxmlformats-officedocument.presentationml.presentation"
         else -> "application/octet-stream"
     }
If you want, I can wire in Tika with a graceful fallback to this map.
src/main/kotlin/simplerag/ragback/global/util/converter/MultipartJackson2HttpMessageConverter.kt (1)
9-25: Disable canRead as well; current converter may intercept application/octet-stream reads unexpectedly.

You’ve disabled writes, but reads remain enabled by default for application/octet-stream via AbstractJackson2HttpMessageConverter. This can hijack request body resolution and regress multipart/byte[] handling.

Apply:
@@
 @Component
 class MultipartJackson2HttpMessageConverter
     (objectMapper: ObjectMapper) :
     AbstractJackson2HttpMessageConverter(objectMapper, MediaType.APPLICATION_OCTET_STREAM) {
 
+    override fun canRead(clazz: Class<*>, mediaType: MediaType?): Boolean = false
+    override fun canRead(type: Type?, contextClass: Class<*>?, mediaType: MediaType?): Boolean = false
+    override fun canRead(mediaType: MediaType?): Boolean = false
+
     override fun canWrite(clazz: Class<*>, mediaType: MediaType?): Boolean {
         return false
     }
 
     override fun canWrite(type: Type?, clazz: Class<*>, mediaType: MediaType?): Boolean {
         return false
     }
 
     override fun canWrite(mediaType: MediaType?): Boolean {
         return false
     }
 }
Optionally also annotate with @order(Ordered.LOWEST_PRECEDENCE) for belt-and-braces.
src/main/kotlin/simplerag/ragback/global/util/s3/S3UtilImpl.kt (2)
25-27: Logger initialized with the wrong class

This logs under GlobalExceptionHandler instead of S3UtilImpl, making logs harder to filter.

Apply:
-    private val log = LoggerFactory.getLogger(GlobalExceptionHandler::class.java)
+    private val log = LoggerFactory.getLogger(S3UtilImpl::class.java)
Also remove the now-unused GlobalExceptionHandler import.

100-107: keyFromUrl should decode percent-encoded paths

URLs returned by clients/CDN may be encoded; deleting with an encoded key can fail. Decode safely.

Apply:
-    override fun keyFromUrl(url: String): String? {
-        val path = try {
-            URI(url).path // e.g. "/market/menu/uuid_name.jpg"
-        } catch (_: Exception) {
-            return null
-        }
-        return path.removePrefix("/").ifBlank { null }
-    }
+    override fun keyFromUrl(url: String): String? {
+        val rawPath = try {
+            URI(url).rawPath // keep encoding, decode ourselves
+        } catch (_: Exception) {
+            return null
+        }
+        if (rawPath.isNullOrBlank()) return null
+        return java.net.URLDecoder.decode(rawPath.removePrefix("/"), java.nio.charset.StandardCharsets.UTF_8)
+            .ifBlank { null }
+    }
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (1)
7-9: Add an index on the FK column to speed lookups by Index.

Read-paths will commonly fetch chunk embeddings per index. Add a DB index for indexes_id.

Apply:
-@Table(name = "chunk_embeddings")
+@Table(
+    name = "chunk_embeddings",
+    indexes = [jakarta.persistence.Index(name = "idx_chunk_embeddings_indexes_id", columnList = "indexes_id")]
+)
src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt (3)
55-69: Consider asserting that chunk embeddings are created when content exists.

Currently, createIndexTest only checks IDs/names. Adding an assertion on chunkEmbeddings count (or a repository count) would verify the end-to-end indexing pipeline.

I can add a fixture DataFile with a short text and assert that embeddings were persisted for it.

59-69: Embedding dimension mismatch in tests will cause failures

The IndexService.createIndex method explicitly checks that the embedder’s dim (1536) matches the requested model’s dim and throws INVALID_INPUT on mismatch. Since OpenAIEmbedder.dim = 1536 but the tests use EmbeddingModel.TEXT_EMBEDDING_3_LARGE (3072), any test invoking createIndex with the large model will now fail.

• File: src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt
Line ranges: 59–69, 76–83, 96–100, 126–133, 156–163, 178–185, 216–220, 241–245, 268–270, 291–295

Please update all occurrences of EmbeddingModel.TEXT_EMBEDDING_3_LARGE in these tests to EmbeddingModel.TEXT_EMBEDDING_3_SMALL (1536) to align with the current embedder configuration. If support for the 3072-dim model is intended, instead update OpenAIEmbedder.dim (and related wiring) to 3072.

Suggested diff for one occurrence (apply similarly across all):
-    IndexCreateRequest("test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_LARGE, true)
+    IndexCreateRequest("test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_SMALL, true)
72-84: Align overlap validation and tests with the “allow equal overlap” rule

The service’s validateOverlap currently throws on overlapSize >= chunkingSize, disallowing equal values, but the team has agreed that overlapSize may be equal to chunkingSize (i.e. only overlap > chunking should error). Please update both the implementation and the tests accordingly:

• In IndexService.kt (around line 97), change the guard to throw only when overlapSize > chunkingSize:
-    private fun validateOverlap(overlapSize: Int, chunkingSize: Int) {
-        if (overlapSize >= chunkingSize) throw IndexException(ErrorCode.OVERLAP_OVERFLOW)
-    }
+    private fun validateOverlap(overlapSize: Int, chunkingSize: Int) {
+        if (overlapSize >  chunkingSize) throw IndexException(ErrorCode.OVERLAP_OVERFLOW)
+    }
• In IndexServiceTest.kt (create‐index overlap test at lines 72–84), trigger the error only when overlap > chunking. For example:
-    val indexCreateRequest =
-        IndexCreateRequest("test", 1, 1, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_LARGE, true)
+    val indexCreateRequest =
+        IndexCreateRequest("test", 1, 2, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_LARGE, true)
(The error message "overlap 크기는 chunking 크기를 넘을 수 없습니다." remains appropriate, since it forbids only exceeding.)

• Apply the same update to the index‐update overlap test (lines 233–256): use an overlapSize one greater than chunkingSize to verify the exception.
src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt (2)
215-219: Use existing sha256Hex() util to compute the 12-char key prefix.

Reduces duplication and potential inconsistencies.
-        val hash12 = MessageDigest.getInstance("SHA-256")
-            .digest(bytes).joinToString("") { "%02x".format(it) }
-            .take(12)
+        val hash12 = sha256Hex(bytes).take(12)
282-284: Prefer assertFalse over equals(false) for boolean assertions.

Improves readability and assertion messages.
-        assertEquals(dataFiles.hasNext, false)
+        assertFalse(dataFiles.hasNext)
src/main/kotlin/simplerag/ragback/domain/index/service/IndexService.kt (1)
97-99: Adjust overlap validation to allow equality

The project’s learning indicates that overlapSize should be permitted to equal chunkingSize (i.e. enforce overlapSize ≤ chunkingSize) and that this check will live in the service layer as part of the chunking‐embedding save flow. The current implementation throws on >= and tests have been written to expect equality to fail, which conflicts with that decision.

Please apply the following mandatory updates:

In IndexService.validateOverlap, change the guard from >= to > so that equality is allowed.

Update any affected unit tests that currently assert an exception when overlapSize == chunkingSize to expect success instead.

Optionally, clarify the exception message to reflect “overlap 크기는 chunking 크기보다 작거나 같아야 합니다.” or similar.

Add a note in the upcoming chunking‐embedding save PR that this validation was introduced here.
--- a/src/main/kotlin/simplerag/ragback/domain/index/service/IndexService.kt
+++ b/src/main/kotlin/simplerag/ragback/domain/index/service/IndexService.kt
@@ -97,7 +97,7 @@ class IndexService {
     private fun validateOverlap(overlapSize: Int, chunkingSize: Int) {
-        if (overlapSize >= chunkingSize) throw IndexException(ErrorCode.OVERLAP_OVERFLOW)
+        if (overlapSize >  chunkingSize) throw IndexException(ErrorCode.OVERLAP_OVERFLOW)
     }
 }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 39da5d8 and 4d7b409.

📒 Files selected for processing (20)

build.gradle (1 hunks)
src/main/kotlin/simplerag/ragback/domain/document/service/DataFileService.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/dto/IndexRequestDTO.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/dto/IndexResponseDTO.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/embed/Embedder.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/embed/OpenAIEmbbeder.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (1 hunks)
src/main/kotlin/simplerag/ragback/domain/index/service/IndexService.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/storage/FakeS3Util.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/TextChunker.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/converter/FileConvertUtil.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/converter/MultipartJackson2HttpMessageConverter.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/loader/ContentLoader.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/loader/HttpContentLoader.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/s3/S3Type.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/s3/S3Util.kt (1 hunks)
src/main/kotlin/simplerag/ragback/global/util/s3/S3UtilImpl.kt (2 hunks)
src/main/resources/application-local.yml (1 hunks)
src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt (1 hunks)
src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: catturtle123
PR: Simple-RAG/BACK#6
File: src/main/kotlin/simplerag/ragback/domain/index/entity/Index.kt:16-22
Timestamp: 2025-08-17T13:12:02.200Z
Learning: catturtle123이 Index 엔티티의 overlapSize ≤ chunkingSize 검증을 엔티티 레벨이 아닌 서비스 레이어에서 구현하기로 결정했으며, chunking embedding 저장 관련 PR에서 이 검증 로직을 추가할 예정이므로 해당 PR에서 언급해야 함 (Simple-RAG/BACK 프로젝트)

📚 Learning: 2025-08-17T07:14:26.887Z

Learnt from: catturtle123
PR: Simple-RAG/BACK#6
File: src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt:6-22
Timestamp: 2025-08-17T07:14:26.887Z
Learning: ChunkEmbedding 엔티티에서 embedding 필드는 벡터 임베딩 저장용이고, content 필드는 원본 파일/텍스트 청크 저장용으로 설계됨 (catturtle123 프로젝트)

Applied to files:

src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt

🧬 Code graph analysis (11)

src/main/kotlin/simplerag/ragback/global/util/s3/S3Util.kt (2)

src/main/kotlin/simplerag/ragback/global/util/S3UtilImpl.kt (3)

s3 (18-110)

{ "file" } (89-89)

{ null } (108-108)

src/main/kotlin/simplerag/ragback/global/util/S3Util.kt (1)

upload (5-11)

src/main/kotlin/simplerag/ragback/global/storage/FakeS3Util.kt (1)

src/main/kotlin/simplerag/ragback/global/util/S3UtilImpl.kt (1)

s3 (18-110)

src/main/kotlin/simplerag/ragback/global/util/s3/S3Type.kt (3)

src/main/kotlin/simplerag/ragback/global/util/S3Type.kt (1)

label (4-8)

src/main/kotlin/simplerag/ragback/global/util/S3UtilImpl.kt (3)

s3 (18-110)

{ null } (108-108)

{ "file" } (89-89)

src/main/kotlin/simplerag/ragback/global/util/S3Util.kt (1)

upload (5-11)

src/main/kotlin/simplerag/ragback/domain/index/embed/Embedder.kt (2)

src/main/kotlin/simplerag/ragback/domain/index/entity/enums/EmbeddingModel.kt (1)

dim (3-49)

src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorStringConverter.kt (4)

{ "Embedding must not be empty; expected fixed dimension (e.g., 1536)" } (10-10)

{ "Embedding (FloatArray) must not be null" } (9-9)

{ (26-28)

{ (11-13)

src/main/kotlin/simplerag/ragback/global/util/converter/FileConvertUtil.kt (1)

src/main/kotlin/simplerag/ragback/global/util/FileConvertUtil.kt (1)

sha256 (8-11)

src/main/kotlin/simplerag/ragback/global/util/s3/S3UtilImpl.kt (1)

src/main/kotlin/simplerag/ragback/global/util/S3Util.kt (2)

upload (5-11)

deleteByUrl (8-8)

src/main/kotlin/simplerag/ragback/domain/index/dto/IndexResponseDTO.kt (2)

src/main/kotlin/simplerag/ragback/domain/index/converter/IndexConverter.kt (1)

toIndexPreviewResponse (29-34)

src/main/kotlin/simplerag/ragback/domain/index/entity/Index.kt (1)

name (10-57)

src/main/kotlin/simplerag/ragback/domain/index/embed/OpenAIEmbbeder.kt (2)

src/main/kotlin/simplerag/ragback/domain/index/entity/enums/EmbeddingModel.kt (1)

dim (3-49)

src/main/kotlin/simplerag/ragback/domain/index/entity/Index.kt (1)

name (10-57)

src/main/kotlin/simplerag/ragback/domain/index/dto/IndexRequestDTO.kt (2)

src/main/kotlin/simplerag/ragback/domain/index/entity/Index.kt (1)

name (10-57)

src/main/kotlin/simplerag/ragback/domain/document/dto/DataFileRequestDTO.kt (2)

min (7-11)

message (13-20)

src/main/kotlin/simplerag/ragback/domain/index/service/IndexService.kt (2)

src/main/kotlin/simplerag/ragback/domain/index/controller/IndexController.kt (1)

indexService (11-59)

src/main/kotlin/simplerag/ragback/domain/index/entity/Index.kt (2)

name (10-57)

update (48-55)

src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (2)

src/main/kotlin/simplerag/ragback/domain/index/repository/ChunkEmbeddingRepository.kt (1)

interface ChunkEmbeddingRepository : JpaRepository<ChunkEmbedding, Long> (6-6)

src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorStringConverter.kt (4)

{ "Embedding (FloatArray) must not be null" } (9-9)

{ "Embedding must not be empty; expected fixed dimension (e.g., 1536)" } (10-10)

{ (26-28)

autoApply (6-34)

🔇 Additional comments (14)

src/main/kotlin/simplerag/ragback/global/util/converter/FileConvertUtil.kt (2)

40-57: Streaming digest implementation is solid.

Buffered + DigestInputStream with an 8KB buffer, proper use blocks, and byte counting look correct and memory-safe for large files.

1-1: Confirm package relocation and import updates

All previous imports of the moved functions have been updated to the new simplerag.ragback.global.util.converter package, and no residual references to the old package remain. Specifically:

src/main/kotlin/simplerag/ragback/domain/document/service/DataFileService.kt now imports
• computeMetricsStreaming
• resolveContentType

src/main/kotlin/simplerag/ragback/global/storage/FakeS3Util.kt now imports sha256Hex

src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt now imports sha256Hex

The FileMetrics type is declared and consumed entirely within FileConvertUtil.kt—there are no external references requiring its import elsewhere.

Package move is fully verified; approving changes.

src/main/kotlin/simplerag/ragback/global/util/s3/S3Type.kt (1)

1-8: All dependent imports updated and S3Type usage verified

Verified that every external reference to S3Type imports from the new package simplerag.ragback.global.util.s3.S3Type (e.g. in FakeS3Util.kt, DataFileService.kt, DataFileServiceTest.kt).

S3Util.kt and S3UtilImpl.kt reside in the same package, so they correctly reference S3Type without an import.

No lingering imports from the old package path were found.

Enum contents remain unchanged; the leading/trailing slashes in ORIGINAL_FILE("/ORIGINAL/") are handled correctly by the existing buildKey trimming logic.

src/main/kotlin/simplerag/ragback/global/util/s3/S3Util.kt (1)

1-11: Import verification complete – no drift detected
All callers (services, fakes, and tests) now correctly reference the moved interface at simplerag.ragback.global.util.s3.S3Util:

src/main/kotlin/simplerag/ragback/domain/document/service/DataFileService.kt

src/main/kotlin/simplerag/ragback/global/storage/FakeS3Util.kt

src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt (uses FakeS3Util)

No further action required.
src/main/kotlin/simplerag/ragback/global/util/loader/ContentLoader.kt (1)
1-6: Overlap validation is implemented in IndexService

The validateOverlap guard is present and invoked in both index creation and update operations:
In src/main/kotlin/simplerag/ragback/domain/index/service/IndexService.kt:
• Line 29–30: validateOverlap(req.overlapSize, req.chunkingSize) on createIndex
• Line 83–84: validateOverlap(indexUpdateRequest.overlapSize, indexUpdateRequest.chunkingSize) on update
• Lines 97–99:
private fun validateOverlap(overlapSize: Int, chunkingSize: Int) {
    if (overlapSize >= chunkingSize) throw IndexException(ErrorCode.OVERLAP_OVERFLOW)
}
Additionally, TextChunker.chunkByCharsSeq enforces require(overlap in 0 until size) to guarantee overlapSize < chunkingSize before chunking.
No further changes are needed here.
src/main/resources/application-local.yml (1)
11-17: Mis-scoped Spring AI properties – adjust embedding model nesting

The model setting must live under spring.ai.openai.embedding.options, not at spring.ai.model.embedding, so that Spring AI’s auto-configuration picks it up (docs.spring.io). For example, update application-local.yml as follows:
 spring:
   ai:
     openai:
       api-key: ${OPENAI_KEY}
-    model:
-      embedding:
-        options:
-        model: text-embedding-3-small
+      embedding:
+        options:
+          model: text-embedding-3-small
• (Optional) You can still explicitly enable the OpenAI embedding auto-config via
spring:
  ai:
    model:
      embedding: openai
but it defaults to openai if unset (docs.spring.io).

Runtime check:
• Verify at startup logs that the active embedder reports text-embedding-3-small.
• Ensure your vector store expects 1 536-dimensional embeddings, matching text-embedding-3-small’s default output (airbyte.com, docs.spring.io).
build.gradle (1)
29-31: Please manually verify Spring AI BOM compatibility

We weren’t able to resolve the effective spring-ai and Spring Boot starter versions due to Gradle daemon connection issues in this environment. To ensure you won’t run into dependency convergence problems at runtime:

• Confirm which Spring Boot plugin version you’re using (e.g. 3.5.x) in your build.gradle.
• Run a dependency-insight report locally to see the resolved spring-ai BOM version:
./gradlew dependencyInsight \
  --dependency org.springframework.experimental:spring-ai \
  --configuration runtimeClasspath
• Verify that the BOM version (currently 1.0.1) is supported by your Spring Boot version. If it isn’t, bump or align springAiVersion to the matching spring-ai BOM release for Spring Boot 3.5.x.
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (2)

14-18: Dimension consistency: fixed column vs. per-record embeddingDim.

You persist embeddingDim alongside a fixed vector(1536) column. If you plan to support models with different dimensions (the issue mentions “adjust embedding size during chunking embedding”), a fixed column dimension will block this. If you intend to standardize on 1536 only, consider removing embeddingDim to avoid drift; otherwise, make the column flexible (schema per model, per-table, or a generic storage form) and validate embedding.size == embeddingDim.

Would you like me to propose a migration strategy for multi-dimension support (separate tables per model, or a generic storage with check constraints)?

11-12: Switch to text for content is appropriate.

Mapping content to text is a good fit for large chunks and avoids driver-specific @Lob handling quirks.

src/main/kotlin/simplerag/ragback/global/util/TextChunker.kt (2)

12-19: Chunking loop logic is correct and handles the tail chunk cleanly.

step = size - overlap, guarded to be ≥1; end is clamped; final chunk exits via break. Looks good.

5-6: Service-layer overlap validation aligns with chunker requirement

The existing validateOverlap implementation in IndexService.kt already throws an IndexException when overlapSize >= chunkingSize (lines 97–99), thereby rejecting overlap == chunkingSize to match the strict < size rule enforced by TextChunker.chunkByCharsSeq (lines 5–6). No further changes are needed.

src/main/kotlin/simplerag/ragback/global/storage/FakeS3Util.kt (1)

7-9: Import path refactor looks consistent with the package move.

Aligns with global.util.s3 and global.util.converter. No behavior changes.

src/main/kotlin/simplerag/ragback/domain/document/service/DataFileService.kt (1)

20-23: Imports updated to new packages; service logic unchanged.

The refactor maintains behavior and compiles against the reorganized utility packages.

src/main/kotlin/simplerag/ragback/domain/index/dto/IndexResponseDTO.kt (1)

22-22: Trailing comma addition is fine and future-proof for diffs.

No semantic changes; keeps diffs cleaner when adding fields.

build.gradle

coderabbitai · 2025-08-22T04:34:37Z

build.gradle

+    implementation 'org.springframework.ai:spring-ai-starter-vector-store-pgvector'

+    // test
    testImplementation "org.springframework.boot:spring-boot-testcontainers"
    testImplementation "org.testcontainers:postgresql"
    testImplementation "org.testcontainers:junit-jupiter"
-    testImplementation "com.pgvector:pgvector:0.1.6"
+    testImplementation 'org.springframework.ai:spring-ai-starter-vector-store-pgvector'



🧹 Nitpick (assertive)

Unnecessary duplicate dependency in tests

spring-ai-starter-vector-store-pgvector is already on the main classpath; duplicating in tests can mask classpath issues and slow resolution.

Apply:

- testImplementation 'org.springframework.ai:spring-ai-starter-vector-store-pgvector'

🤖 Prompt for AI Agents

In build.gradle around lines 50 to 57, the test classpath redundantly includes 'org.springframework.ai:spring-ai-starter-vector-store-pgvector' which is already declared on the main implementation classpath; remove the duplicate testImplementation entry to avoid masking classpath issues and speed dependency resolution, leaving only the testcontainers and junit testImplementation lines in that block.

coderabbitai · 2025-08-22T04:34:37Z

src/main/kotlin/simplerag/ragback/domain/index/dto/IndexRequestDTO.kt

+    @field:NotEmpty
+    val dataFileId: List<Long>,
+


🧹 Nitpick (assertive)

Enforce overlapSize ≤ chunkingSize in the service (per team decision)

Per retrieved learnings, the validation belongs in the service layer, not the entity. Ensure IndexService.createIndex guards this invariant and returns a 400 with a clear message when violated.

I can add a guard + unit test (valid/invalid pairs) if you want me to push a patch.

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/domain/index/dto/IndexRequestDTO.kt around lines 13 to 15, the DTO currently only declares dataFileId and the overlap/chunking invariant must be enforced in the service layer: update IndexService.createIndex to validate that request.overlapSize <= request.chunkingSize, and if the check fails return an HTTP 400 response with a clear error message (e.g., "overlapSize must be less than or equal to chunkingSize"); add unit tests for both valid and invalid pairs to assert success and the 400/error message on violation.

🧹 Nitpick (assertive)

Prefer @SiZe(min=1) for lists and align naming to plural ("dataFileIds")

Project already uses @SiZe(min=1) for lists (see DataFileBulkCreateRequest); use the same for consistency.

Name should reflect plurality.

Apply:

-import jakarta.validation.constraints.NotEmpty +import jakarta.validation.constraints.Size @@ - @field:NotEmpty - val dataFileId: List<Long>, + @field:Size(min = 1, message = "최소 하나 이상의 파일 ID가 필요합니다") + val dataFileIds: List<Long>,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@field:NotEmpty

val dataFileId: List<Long>,

// Replace the import

import jakarta.validation.constraints.Size

// Within IndexRequestDTO.kt

@field:Size(min = 1, message = "최소 하나 이상의 파일 ID가 필요합니다")

val dataFileIds: List<Long>,

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/domain/index/dto/IndexRequestDTO.kt around lines 13 to 15, replace the @field:NotEmpty annotation on the List property with @field:Size(min = 1) and rename the property from dataFileId to dataFileIds for plural consistency; update the import to javax.validation.constraints.Size (remove NotEmpty import), and update all usages/constructors/JSON mappings/tests that reference dataFileId to use dataFileIds so compilation and serialization remain correct.

coderabbitai · 2025-08-22T04:34:38Z

src/main/kotlin/simplerag/ragback/domain/index/embed/Embedder.kt

+interface Embedder {
+    val dim: Int
+    fun embed(text: String): FloatArray
+}


💡 Verification agent

🧩 Analysis chain

Verify implementor naming (“OpenAIEmbbeder” vs “OpenAIEmbedder”)

Typo in class/file names can confuse DI and readers.

🏁 Script executed:

#!/bin/bash # Find embedder implementations and spot typos rg -nP --type=kotlin -C2 '(class|interface)\s+OpenAIEm[bB]ed{1,2}er|Embedder' | sed -n '1,120p'

Length of output: 2492

Rename file to match class name
The file name OpenAIEmbbeder.kt contains a typo that doesn’t align with the class it declares (OpenAIEmbedder). This mismatch can confuse readers and tooling.

• src/main/kotlin/simplerag/ragback/domain/index/embed/OpenAIEmbbeder.kt → rename to → OpenAIEmbedder.kt

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/domain/index/embed/OpenAIEmbbeder.kt around lines 3 to 6 the file name contains a typo and does not match the declared class OpenAIEmbedder; rename the file to OpenAIEmbedder.kt, update any imports/usages/references across the project to the new filename, and ensure package declarations and CI/build files (if referencing the path) still resolve correctly so the class name and filename are consistent.

🧹 Nitpick (assertive)

Add batch embedding to improve throughput; keep single-call default

Batching reduces network overhead and cost. Provide a default method to avoid breaking implementors.

Apply:

interface Embedder { val dim: Int fun embed(text: String): FloatArray + fun embedAll(texts: List<String>): List<FloatArray> = texts.map(::embed) }

Optionally consider making these suspend and handling upstream backpressure later.

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/domain/index/embed/Embedder.kt around lines 3 to 6, add a default batch embedding method to the interface while keeping the existing single-item embed to avoid breaking implementors: declare a new function (e.g., fun embedBatch(texts: List<String>): List<FloatArray>) with a default implementation that iterates over texts and calls the existing embed(text: String) for each entry and returns the list of embeddings; keep the existing val dim and fun embed as-is; optionally mark both signatures as suspend later if you intend to introduce asynchronous/backpressure handling upstream.

coderabbitai · 2025-08-22T04:34:38Z

src/main/kotlin/simplerag/ragback/domain/index/embed/OpenAIEmbbeder.kt

+package simplerag.ragback.domain.index.embed
+
+import org.springframework.ai.openai.OpenAiEmbeddingModel
+import org.springframework.stereotype.Component
+
+@Component
+class OpenAIEmbedder(
+    private val openAiEmbeddingModel: OpenAiEmbeddingModel
+) : Embedder {
+    override val dim: Int = 1536
+    override fun embed(text: String): FloatArray =
+        openAiEmbeddingModel.embed(text)
+}


🛠️ Refactor suggestion

Hard-coded dim=1536 will reject 3072-dim models; derive dimension from the configured model. Also, filename typo.

If the configured OpenAI model is text-embedding-3-large (3072), current dim=1536 forces INVALID_INPUT.

Consider inferring dim from the configured modelId (or a bean property) using your EmbeddingModel enum.

Minor: file name has a typo (“Embbeder”). Consider renaming to OpenAiEmbedder.kt for consistency.

Apply this diff to make dim consistent with configuration (adjust imports if package paths differ):

package simplerag.ragback.domain.index.embed import org.springframework.ai.openai.OpenAiEmbeddingModel +import org.springframework.beans.factory.annotation.Value import org.springframework.stereotype.Component +import simplerag.ragback.domain.index.entity.enums.EmbeddingModel @Component -class OpenAIEmbedder( - private val openAiEmbeddingModel: OpenAiEmbeddingModel +class OpenAIEmbedder( + private val openAiEmbeddingModel: OpenAiEmbeddingModel, + @Value("\${spring.ai.openai.embedding.options.model:text-embedding-3-small}") + private val modelId: String ) : Embedder { - override val dim: Int = 1536 + override val dim: Int = + EmbeddingModel.findByModelId(modelId)?.dim + ?: error("Unsupported embedding modelId: $modelId") override fun embed(text: String): FloatArray = openAiEmbeddingModel.embed(text) }

If you will support multiple models per index concurrently, consider a factory to return an Embedder bound to req.embeddingModel instead of a single global bean.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

package simplerag.ragback.domain.index.embed

import org.springframework.ai.openai.OpenAiEmbeddingModel

import org.springframework.stereotype.Component

@Component

class OpenAIEmbedder(

private val openAiEmbeddingModel: OpenAiEmbeddingModel

) : Embedder {

override val dim: Int = 1536

override fun embed(text: String): FloatArray =

openAiEmbeddingModel.embed(text)

}

package simplerag.ragback.domain.index.embed

import org.springframework.ai.openai.OpenAiEmbeddingModel

import org.springframework.beans.factory.annotation.Value

import org.springframework.stereotype.Component

import simplerag.ragback.domain.index.entity.enums.EmbeddingModel

@Component

class OpenAIEmbedder(

private val openAiEmbeddingModel: OpenAiEmbeddingModel,

@Value("\${spring.ai.openai.embedding.options.model:text-embedding-3-small}")

private val modelId: String

) : Embedder {

override val dim: Int =

EmbeddingModel.findByModelId(modelId)?.dim

?: error("Unsupported embedding modelId: $modelId")

override fun embed(text: String): FloatArray =

openAiEmbeddingModel.embed(text)

}

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/domain/index/embed/OpenAIEmbbeder.kt lines 1-13, the embedder hardcodes dim=1536 which will break for 3072-dim models and the filename/class name contains a typo; change the implementation to derive the embedding dimension from the configured model (inject the configured modelId or an EmbeddingModel/enum bean), map modelId to its dimension (e.g., 1536 for text-embedding-3-small, 3072 for text-embedding-3-large) and set override val dim accordingly with a sensible fallback and error if unknown, replace the misspelled filename/class name (OpenAIEmbbeder -> OpenAiEmbedder) to match project conventions and update imports/usages; optionally consider providing a factory if multiple model types will be supported concurrently.

coderabbitai · 2025-08-22T04:34:39Z

src/main/kotlin/simplerag/ragback/global/util/s3/S3UtilImpl.kt

    override fun urlFromKey(key: String): String =
-        s3.utilities()
-            .getUrl { it.bucket(bucket).key(key) }
-            .toExternalForm()
+        "https://mukit-s3.s3.ap-northeast-2.amazonaws.com/" + key


⚠️ Potential issue

Hard-coded S3 URL will break across environments and mishandles URL encoding

Using a fixed bucket/region string prevents deploy-time changes and can yield invalid URLs for keys with spaces or special chars. Let the AWS SDK construct the URL (handles region and encoding).

Apply:

- override fun urlFromKey(key: String): String = - "https://mukit-s3.s3.ap-northeast-2.amazonaws.com/" + key + override fun urlFromKey(key: String): String = + s3.utilities().getUrl { it.bucket(bucket).key(key) }.toExternalForm()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

override fun urlFromKey(key: String): String =

s3.utilities()

.getUrl { it.bucket(bucket).key(key) }

.toExternalForm()

"https://mukit-s3.s3.ap-northeast-2.amazonaws.com/" + key

override fun urlFromKey(key: String): String =

s3.utilities().getUrl { it.bucket(bucket).key(key) }.toExternalForm()

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/global/util/s3/S3UtilImpl.kt around lines 60-61, the code returns a hard-coded S3 URL which embeds bucket/region and does not handle URL encoding; replace this with a call into the AWS SDK so the SDK builds a correct, encoded URL and you can avoid hard-coded bucket/region. Inject or read the configured bucket/region values from your app config, then call the S3 client utilities to build the URL (e.g., use S3Utilities.getUrl/GetUrlRequest or S3Client.utilities().getUrl(...) / S3Presigner for presigned URLs) with the bucket and key; return the SDK-produced URL string instead of concatenating strings so encoding and region resolution are handled correctly.

coderabbitai · 2025-08-22T04:34:39Z

src/main/kotlin/simplerag/ragback/global/util/TextChunker.kt

+object TextChunker {
+    fun chunkByCharsSeq(raw: String, size: Int, overlap: Int): Sequence<String> = sequence {


🧹 Nitpick (assertive)

Consider an optional word-boundary mode to reduce mid-word splits.

Character windows are fine for a first pass. If quality matters, add a whitespace-aware variant (e.g., try to extend to the next whitespace within a small slack).

I can add chunkByWordsSeq(...) with a small slack range if desired.

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/global/util/TextChunker.kt around lines 3-4, add a whitespace-aware variant (either an optional mode on chunkByCharsSeq or a new chunkByWordsSeq) that avoids splitting words: when computing each window end, allow a small "slack" (configurable Int) to look forward up to slack characters for the next whitespace and extend the end to that whitespace if found; ensure the produced chunk size does not exceed size+slack, preserve the requested overlap by computing the next window start = previousStart + size - overlap (or adjust start based on the extended end while maintaining overlap semantics), handle edge cases (no whitespace found within slack, end of string) by falling back to the original character-boundary end, and keep this as a Sequence<String> generator to match existing API.

coderabbitai · 2025-08-22T04:34:39Z

src/main/kotlin/simplerag/ragback/global/util/TextChunker.kt

+    private fun normalize(s: String): String =
+        s.replace("\r\n", "\n").replace("\r", "\n")
+            .replace(Regex("[ \t]+"), " ")
+            .replace(Regex("\\n{3,}"), "\n\n")
+            .trim()


🧹 Nitpick (assertive)

Avoid recompiling regexes on every call; precompile once.

Creating Regex instances per call adds overhead for large inputs. Hoist them to private val fields.

-object TextChunker { +object TextChunker { + private val SPACES = Regex("[ \\t]+") + private val MANY_NL = Regex("\\n{3,}") @@ - private fun normalize(s: String): String = - s.replace("\r\n", "\n").replace("\r", "\n") - .replace(Regex("[ \t]+"), " ") - .replace(Regex("\\n{3,}"), "\n\n") - .trim() + private fun normalize(s: String): String = + s.replace("\r\n", "\n").replace("\r", "\n") + .replace(SPACES, " ") + .replace(MANY_NL, "\n\n") + .trim()

🤖 Prompt for AI Agents

In src/main/kotlin/simplerag/ragback/global/util/TextChunker.kt around lines 22 to 26, the normalize function currently constructs Regex("[ \t]+") and Regex("\\n{3,}") on every call which is wasteful; hoist these regexes to precompiled private vals (either as private val fields in the class or inside a companion object/top-level private vals) like private val MULTI_WHITESPACE = Regex("[ \t]+") and private val MULTIPLE_BLANK_LINES = Regex("\\n{3,}") and then replace the inline Regex(...) usages in normalize with those precompiled vals so Regex objects are reused across calls.

coderabbitai · 2025-08-22T04:34:39Z

src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt

 @SpringBootTest
 @ActiveProfiles("test")
 class DataFileServiceTest(
-    private val dataFileService: DataFileService,
-    private val dataFileRepository: DataFileRepository,
-    private val tagRepository: TagRepository,
-    private val dataFileTagRepository: DataFileTagRepository,
-    private val s3Util: FakeS3Util
+    @Autowired val dataFileService: DataFileService,
+    @Autowired val dataFileRepository: DataFileRepository,
+    @Autowired val tagRepository: TagRepository,
+    @Autowired val dataFileTagRepository: DataFileTagRepository,
+    @Autowired val s3Util: FakeS3Util
 ) {


🧹 Nitpick (assertive)

Constructor injection in tests: add @Testconstructor(autowireMode = ALL) for clarity and portability.

Relying on @Autowired on constructor params alone can be fragile across Boot versions. Align with IndexServiceTest by enabling constructor autowiring explicitly.

@SpringBootTest @ActiveProfiles("test") +@org.springframework.test.context.TestConstructor(autowireMode = org.springframework.test.context.TestConstructor.AutowireMode.ALL) class DataFileServiceTest( - @Autowired val dataFileService: DataFileService, - @Autowired val dataFileRepository: DataFileRepository, - @Autowired val tagRepository: TagRepository, - @Autowired val dataFileTagRepository: DataFileTagRepository, - @Autowired val s3Util: FakeS3Util + @Autowired val dataFileService: DataFileService, + @Autowired val dataFileRepository: DataFileRepository, + @Autowired val tagRepository: TagRepository, + @Autowired val dataFileTagRepository: DataFileTagRepository, + @Autowired val s3Util: FakeS3Util )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@SpringBootTest

@ActiveProfiles("test")

class DataFileServiceTest(

private val dataFileService: DataFileService,

private val dataFileRepository: DataFileRepository,

private val tagRepository: TagRepository,

private val dataFileTagRepository: DataFileTagRepository,

private val s3Util: FakeS3Util

@Autowired val dataFileService: DataFileService,

@Autowired val dataFileRepository: DataFileRepository,

@Autowired val tagRepository: TagRepository,

@Autowired val dataFileTagRepository: DataFileTagRepository,

@Autowired val s3Util: FakeS3Util

) {

@SpringBootTest

@ActiveProfiles("test")

@org.springframework.test.context.TestConstructor(

autowireMode = org.springframework.test.context.TestConstructor.AutowireMode.ALL

)

class DataFileServiceTest(

@Autowired val dataFileService: DataFileService,

@Autowired val dataFileRepository: DataFileRepository,

@Autowired val tagRepository: TagRepository,

@Autowired val dataFileTagRepository: DataFileTagRepository,

@Autowired val s3Util: FakeS3Util

) {

// …

}

🤖 Prompt for AI Agents

In src/test/kotlin/simplerag/ragback/domain/document/service/DataFileServiceTest.kt around lines 31 to 39, the test class relies on constructor @Autowired parameters but lacks the explicit Spring Boot test constructor autowiring annotation; add @TestConstructor(autowireMode = TestConstructor.AutowireMode.ALL) above the class declaration (and import org.springframework.test.context.TestConstructor) so constructor injection is explicit and consistent with IndexServiceTest.

coderabbitai · 2025-08-22T04:34:39Z

src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt

+    @Autowired val indexService: IndexService,
+    @Autowired val indexRepository: IndexRepository,


🧹 Nitpick (assertive)

Drop redundant @Autowired on constructor params (you already enabled constructor autowiring).

Since the class is annotated with @Testconstructor(autowireMode = ALL), the @Autowired annotations on the primary constructor parameters are unnecessary noise.

Apply this diff:

-class IndexServiceTest( - @Autowired val indexService: IndexService, - @Autowired val indexRepository: IndexRepository, +class IndexServiceTest( + val indexService: IndexService, + val indexRepository: IndexRepository, )

🤖 Prompt for AI Agents

In src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt around lines 30 to 31, the primary constructor parameters are redundantly annotated with @Autowired despite @TestConstructor(autowireMode = ALL) enabling constructor autowiring; remove the @Autowired annotations from the constructor parameters (keep the parameter types and names intact) so the test uses constructor injection provided by the class-level configuration and avoid unnecessary annotations.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

build.gradle (1)
51-55: Remove duplicate vector-store dependency from test classpath

spring-ai-starter-vector-store-pgvector is already on the main classpath (Line 49). Duplicating it in tests can mask classpath issues and slow resolution.

Apply:
-    testImplementation 'org.springframework.ai:spring-ai-starter-vector-store-pgvector'

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4d7b409 and c930fa7.

📒 Files selected for processing (1)

build.gradle (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: catturtle123
PR: Simple-RAG/BACK#6
File: src/main/kotlin/simplerag/ragback/domain/index/entity/Index.kt:16-22
Timestamp: 2025-08-17T13:12:02.200Z
Learning: catturtle123이 Index 엔티티의 overlapSize ≤ chunkingSize 검증을 엔티티 레벨이 아닌 서비스 레이어에서 구현하기로 결정했으며, chunking embedding 저장 관련 PR에서 이 검증 로직을 추가할 예정이므로 해당 PR에서 언급해야 함 (Simple-RAG/BACK 프로젝트)

🔇 Additional comments (1)

build.gradle (1)
39-49: Sanity-check Spring AI BOM alignment

I see you’re importing the Spring AI BOM in build.gradle (lines 65–69) via
dependencyManagement {
    imports {
        mavenBom "org.springframework.ai:spring-ai-bom:${springAiVersion}"
    }
}
and you declare the two starters without explicit versions on lines 39 and 49 (plus a testImport at line 55):

implementation 'org.springframework.ai:spring-ai-starter-model-openai'

implementation 'org.springframework.ai:spring-ai-starter-vector-store-pgvector'

testImplementation 'org.springframework.ai:spring-ai-starter-vector-store-pgvector'

Please verify the following:

• Ensure springAiVersion is defined (for example, in gradle.properties).
• Set your JAVA_HOME to a valid JDK installation, then run:
./gradlew dependencies --configuration runtimeClasspath \
  | grep 'org.springframework.ai:spring-ai-starter'
• Confirm both starters resolve to version ${springAiVersion} (i.e. you see lines like
org.springframework.ai:spring-ai-starter-model-openai : ${springAiVersion}
and
org.springframework.ai:spring-ai-starter-vector-store-pgvector : ${springAiVersion}).

If you encounter any “unmanaged version” warnings or failures, bump springAiVersion to a release that includes both starters. Happy to help draft a compatibility matrix if needed.

coderabbitai · 2025-08-22T05:23:55Z

build.gradle

    implementation 'com.fasterxml.jackson.module:jackson-module-kotlin'
    implementation 'org.jetbrains.kotlin:kotlin-reflect'
-//    implementation 'org.springframework.ai:spring-ai-starter-model-openai'
+    implementation 'org.springframework.ai:spring-ai-starter-model-openai'


💡 Verification agent

🧩 Analysis chain

OpenAI starter added — ensure tests don’t require a real API key or make network calls

Adding spring-ai-starter-model-openai will create OpenAI beans at context startup. Make sure your test profile stubs or disables outbound calls so unit tests are hermetic.

Run this to verify there’s a test profile and it isolates OpenAI config:

If you want, I can provide a minimal application-test.yml that sets a dummy key and a @TestConfiguration that supplies a no-op EmbeddingModel bean for tests.

🏁 Script executed:

#!/bin/bash # 1) Check for test profile configs fd -a 'application-*.yml' -uu | sed -n 's#^#- #p' # 2) Look for OpenAI properties in test profile(s) rg -n --glob 'src/**/resources/**' -C2 'spring\.ai\.openai\.(api-key|base-url|embedding|chat)' # 3) Grep for any tests using real OpenAI clients rg -nP --type=kt --type=java -C2 '\bOpenAI|OpenAi|openai|EmbeddingModel|ChatClient'

Length of output: 317

🏁 Script executed:

#!/bin/bash # 1) Search for tests that bootstrap the Spring context (which would load OpenAI beans) rg -n '@SpringBootTest' -C2 --glob '**/*.java' --glob '**/*.kt' # 2) Search for any @TestConfiguration classes that might override the OpenAI beans rg -n '@TestConfiguration' -C2 --glob '**/*.java' --glob '**/*.kt' # 3) Confirm whether a test-specific application-test.yml exists fd -a 'application-test*.yml' -uu | sed -n 's#^#- #p'

Length of output: 1699

🏁 Script executed:

#!/bin/bash # Display contents of the test profile configuration sed -n '1,200p' src/test/resources/application-test.yml

Length of output: 235

Configure test profile to stub or disable OpenAI beans

Your application-test.yml (src/test/resources/application-test.yml) only configures the datasource and JPA—there are no overrides for spring.ai.openai.*, and no @TestConfiguration to stub out the OpenAI clients. Because your tests use @SpringBootTest with the “test” profile, they will attempt to instantiate the real OpenAI beans at startup, leading to missing-API-key errors or unwanted network calls.

• application-test.yml (lines 1–200) shows only:

spring: datasource: driver-class-name: org.postgresql.Driver jpa: database-platform: org.hibernate.dialect.PostgreSQLDialect hibernate: ddl-auto: create-drop

(no spring.ai.openai.* props)

• No @TestConfiguration or @MockBean was found to override EmbeddingModel, ChatCompletionModel, or the OpenAI client.
• Tests in src/test/kotlin/** annotated with @SpringBootTest + @ActiveProfiles("test") will load the real starter beans and fail (or hit the network).

To fix, you must:

Add dummy OpenAI props to your test profile:

spring: ai: openai: api-key: dummy base-url: http://localhost:8080 # …existing datasource/jpa…

Stub out the OpenAI beans, for example via a @TestConfiguration class in src/test/java or src/test/kotlin:

@TestConfiguration public class OpenAIMockConfig { @Bean EmbeddingModel embeddingModel() { return input -> Collections.emptyList(); } @Bean ChatCompletionModel chatCompletionModel() { return request -> ChatCompletionResponse.builder().choices(...).build(); } }

(Alternate) Use @MockBean in your test classes:

@SpringBootTest @ActiveProfiles("test") class YourServiceTest { @MockBean EmbeddingModel embeddingModel; @MockBean ChatCompletionModel chatCompletionModel; // … }

Let me know if you’d like a complete minimal example of application-test.yml plus test‐stub config.

🤖 Prompt for AI Agents

In build.gradle around line 39, the test profile doesn't override or stub OpenAI beans so @SpringBootTest with profile "test" will instantiate real OpenAI clients; add test-time configuration to prevent real API usage by (1) adding dummy OpenAI properties to src/test/resources/application-test.yml (spring.ai.openai.api-key and base-url) and (2) providing test-only beans that stub EmbeddingModel and ChatCompletionModel via a @TestConfiguration class in src/test/java or src/test/kotlin (or alternatively annotate tests with @MockBean for those interfaces) so tests run without network calls or missing-key errors.

coderabbitai · 2025-08-22T05:23:56Z

build.gradle

+    // postgresql
    implementation 'org.postgresql:postgresql'
+    implementation 'org.springframework.ai:spring-ai-starter-vector-store-pgvector'



🧹 Nitpick (assertive)

Prefer runtimeOnly for the JDBC driver to keep the compile classpath lean

The driver isn’t needed at compile time. Switching to runtimeOnly follows Spring Boot’s guidance and reduces classpath noise.

Apply:

- implementation 'org.postgresql:postgresql' + runtimeOnly 'org.postgresql:postgresql'

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// postgresql

implementation 'org.postgresql:postgresql'

implementation 'org.springframework.ai:spring-ai-starter-vector-store-pgvector'

// postgresql

runtimeOnly 'org.postgresql:postgresql'

implementation 'org.springframework.ai:spring-ai-starter-vector-store-pgvector'

🤖 Prompt for AI Agents

In build.gradle around lines 47 to 50, the PostgreSQL JDBC driver is declared as an implementation dependency even though it is only required at runtime; change the org.postgresql:postgresql dependency to runtimeOnly 'org.postgresql:postgresql' (leave the spring-ai-starter-vector-store-pgvector dependency as implementation) so the driver is not on the compile classpath and follows Spring Boot guidance.

catturtle123 added 2 commits August 21, 2025 21:58

🐛 Fix: Autowired 해결

6dd928e

✨ feature: indexing

fcd843b

catturtle123 self-assigned this Aug 22, 2025

catturtle123 added the feature New feature or request label Aug 22, 2025

catturtle123 linked an issue Aug 22, 2025 that may be closed by this pull request

✨ [Feature] 인덱스 임베딩 저장 #10

Closed

1 task

🚀 Chore: build.gradle 수정

4d7b409

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

🚀 Chore: build.gradle 수정

c930fa7

coderabbitai bot reviewed Aug 22, 2025

View reviewed changes

catturtle123 merged commit d991822 into develop Aug 22, 2025
1 check passed

catturtle123 deleted the feature/#10-indexing branch August 22, 2025 05:58

coderabbitai bot mentioned this pull request Aug 22, 2025

✨ Feature: add content extractor #20

Merged

4 tasks

-    @field:NotEmpty
-    val dataFileId: List<Long>,
+// Replace the import
+import jakarta.validation.constraints.Size
+// Within IndexRequestDTO.kt
+    @field:Size(min = 1, message = "최소 하나 이상의 파일 ID가 필요합니다")
+    val dataFileIds: List<Long>,

		object TextChunker {
		fun chunkByCharsSeq(raw: String, size: Int, overlap: Int): Sequence<String> = sequence {

		@Autowired val indexService: IndexService,
		@Autowired val indexRepository: IndexRepository,

✨ Feature: indexing #18

✨ Feature: indexing #18

Uh oh!

Conversation

catturtle123 commented Aug 22, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Overview

🔍 Related Issues

✨ Changes

📸 Screenshots / Test Results (Optional)

✅ Checklist

🗒️ Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Possibly related PRs

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

catturtle123 commented Aug 22, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 22, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)