Skip to content

Conversation

@catturtle123
Copy link
Contributor

@catturtle123 catturtle123 commented Aug 19, 2025

📌 PR 개요

entity 수정 및 모델 수 줄이기

🔍 관련 이슈

✨ 변경 사항

📸 스크린샷 / 동작 확인 결과 (선택)

필요하다면 이미지나 동영상 첨부.

✅ 체크리스트

  • 코드 스타일을 준수했습니다.
  • 불필요한 주석이나 콘솔 로그를 제거했습니다.
  • 관련 테스트를 작성/수정했습니다.
  • 로컬 환경에서 모든 기능이 정상 작동함을 확인했습니다.

🗒️ 추가 메모

기타 참고 사항을 적어주세요.

Summary by CodeRabbit

  • New Features

    • Store embeddings using Postgres’ native vector type for better performance and compatibility.
    • Added KO_SBERT_V1 embedding model option.
  • Refactor

    • Replaced custom embedding conversion with direct vector storage.
    • Streamlined the embedding model list by removing several legacy options.
  • Chores

    • Updated dependencies to enable vector support and adjusted PostgreSQL driver scope.

@catturtle123 catturtle123 self-assigned this Aug 19, 2025
@catturtle123 catturtle123 added the chore other work label Aug 19, 2025
@catturtle123 catturtle123 linked an issue Aug 19, 2025 that may be closed by this pull request
10 tasks
@coderabbitai
Copy link

coderabbitai bot commented Aug 19, 2025

Walkthrough

This pull request modifies the project's PostgreSQL and pgvector integration by updating dependency configurations, refactoring the ChunkEmbedding entity to use PGvector directly, and pruning the EmbeddingModel enum.

Changes

Cohort / File(s) Change Summary
Gradle Dependencies
build.gradle
- Changed PostgreSQL JDBC driver from runtimeOnly to implementation
- Added pgvector dependency com.pgvector:pgvector:0.1.6
Entity Refactoring
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt
- Replaced FloatArray embedding with direct PGvector type
- Removed converter-based mapping
- Added @table annotation
- Simplified embedding field declaration
Enum Modification
src/main/kotlin/simplerag/ragback/domain/index/entity/enums/EmbeddingModel.kt
- Removed multiple embedding model entries
- Introduced KO_SBERT_V1
- Reduced overall model set
Converter Removal
src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorStringConverter.kt
- Completely removed the FloatArray to PGvector string converter

Sequence Diagram

sequenceDiagram
    participant Entity as ChunkEmbedding
    participant Database as PostgreSQL
    participant PGVector as PGvector

    Entity->>PGVector: Direct embedding storage
    PGVector->>Database: Store vector type
    Database-->>PGVector: Native vector support
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Assessment against Linked Issues

Objective Addressed Explanation
Update Dependencies
Gradle Configuration Changes
Entity Modifications
No Runtime Behavior Changes

Out-of-Scope Changes

There are no out-of-scope changes detected in this pull request.

Possibly Related PRs

Poem

🐰 In the realm of vectors bright,
PostgreSQL dances with pure delight,
Embeddings now sleek and refined,
A rabbit's code, precisely designed!
Hop, hop, hooray for clean design! 🌟

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/#13-pgvector-fix

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (1)

35-43: 로드 시 차원 검증을 추가해 데이터 불일치 조기 감지

JPA가 필드에 직접 주입하므로, updateEmbedding의 차원 검증은 setter 경로에서만 동작합니다. DB→엔티티 로딩 시 _embedding.size != embeddingDim인 데이터가 들어오면 늦게 발견됩니다. @PostLoad 훅으로 조기 검증을 권장합니다.

예시(파일 내 임의 위치에 추가):

@PostLoad
private fun validateEmbeddingOnLoad() {
    require(_embedding.size == embeddingDim) {
        "Embedding dimension mismatch on load: expected=$embeddingDim, got=${_embedding.size}"
    }
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 55f2b92 and f0fd20e.

📒 Files selected for processing (5)
  • build.gradle (1 hunks)
  • src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (2 hunks)
  • src/main/kotlin/simplerag/ragback/domain/index/entity/enums/EmbeddingModel.kt (1 hunks)
  • src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorConverter.kt (1 hunks)
  • src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorStringConverter.kt (0 hunks)
💤 Files with no reviewable changes (1)
  • src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorStringConverter.kt
🧰 Additional context used
🧬 Code Graph Analysis (2)
src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorConverter.kt (1)
src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorStringConverter.kt (9)
  • autoApply (6-34)
  • convertToEntityAttribute (17-33)
  • convertToDatabaseColumn (8-15)
  • { "Embedding (FloatArray) must not be null" } (9-9)
  • { (26-28)
  • arr (25-29)
  • { (11-13)
  • { "Embedding must not be empty; expected fixed dimension (e.g., 1536)" } (10-10)
  • it (14-14)
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (1)
src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorStringConverter.kt (2)
  • autoApply (6-34)
  • { "Embedding (FloatArray) must not be null" } (9-9)
🔇 Additional comments (5)
build.gradle (1)

47-47: PGobject 사용으로 컴파일 타임 의존 필요 — scope 변경 타당

FloatArrayToPgVectorConverter에서 org.postgresql.util.PGobject를 직접 참조하므로, postgresql 드라이버를 implementation으로 승격한 결정은 합리적입니다. Spring Boot BOM이 버전을 관리하므로 별도 버전 고정도 불필요합니다.

src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorConverter.kt (2)

9-13: 입력 검증 적절

null/empty 및 NaN/Infinity 차단 로직이 명확합니다. 런타임 예외 메시지도 충분히 설명적입니다.


14-21: PGobject 매핑 접근 방식 적절

pgvector 컬럼에 type="vector"PGobject를 전달하는 방식은 Hibernate에서 SqlTypes.OTHER와 함께 안정적으로 동작합니다. 문자열 빌드도 불필요한 객체 생성을 최소화합니다.

src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (1)

18-21: vector 차원 미지정(.vector) 및 테스트 환경 Postgres 전환 필수

다차원(768/1024/1536 등) EmbeddingModel을 지원하려면, 현재 vector(3072)로 고정된 정의를 vector로 변경해야 하며, H2 기반 테스트 환경은 DDL 오류가 발생하므로 PostgreSQL Testcontainers로 전환이 필요합니다.

수정 위치:

  • src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt – 라인 20
    - @Column(name = "embedding", nullable = false, columnDefinition = "vector(3072)")
    + @Column(name = "embedding", nullable = false, columnDefinition = "vector")
  • build.gradle
    • testRuntimeOnly("com.h2database:h2") 제거
    • org.testcontainers:postgresql 등 Testcontainers 의존성 추가
  • src/test/resources/application-test.yml
    • H2 설정 제거 후 PostgreSQL Testcontainers URI 또는 프로퍼티로 교체
⛔ Skipped due to learnings
Learnt from: catturtle123
PR: Simple-RAG/BACK#6
File: src/main/kotlin/simplerag/ragback/domain/index/entity/Index.kt:16-22
Timestamp: 2025-08-17T13:12:02.200Z
Learning: catturtle123이 Index 엔티티의 overlapSize ≤ chunkingSize 검증을 엔티티 레벨이 아닌 서비스 레이어에서 구현하기로 결정했으며, chunking embedding 저장 관련 PR에서 이 검증 로직을 추가할 예정이므로 해당 PR에서 언급해야 함 (Simple-RAG/BACK 프로젝트)
src/main/kotlin/simplerag/ragback/domain/index/entity/enums/EmbeddingModel.kt (1)

15-17: KO_SBERT_V1 추가 적절

한국어 모델을 최소 구성으로 유지하면서 768차원 라인업을 확보했습니다. 다른 코드에서의 사용성에 영향은 없어 보입니다.

Comment on lines +18 to +23
// BGE
BGE_BASE_EN(768, "BAAI/bge-base-en-v1.5"),
BGE_LARGE_EN(1024, "BAAI/bge-large-en-v1.5"),
BGE_M3(1024, "BAAI/bge-m3"),
E5_SMALL(384, "intfloat/e5-small-v2"),
E5_BASE(768, "intfloat/e5-base-v2"),
E5_LARGE(1024, "intfloat/e5-large-v2"),

// Old word vectors
FASTTEXT_KO(300, "fasttext-ko-300d");
// E5
E5_BASE(768, "intfloat/e5-base-v2");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

모델 축소에 따른 참조 누락/컴파일 오류 점검 필요

여러 상수가 제거되었습니다. 레거시 참조가 남아 있으면 컴파일 실패 혹은 런타임 분기 누락이 발생할 수 있습니다. 자동 점검 스크립트를 제안합니다.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

REMOVED=(
  TEXT_EMBEDDING_3_LARGE
  MULTI_QA_MP_NET_BASE_DOT_V1
  PARAPHRASE_MULTILINGUAL_MINILM_L12_V2
  KOR_SROBERTA
  BM_KO_SMALL
  INSTRUCTOR_BASE
  INSTRUCTOR_XL
  MISTRAL_EMBED
  BGE_SMALL_EN
  BGE_LARGE_EN
  E5_SMALL
  E5_LARGE
  FASTTEXT_KO
)

echo "▶ 제거된 상수 참조 검색"
for sym in "${REMOVED[@]}"; do
  rg -n -C2 -P "\bEmbeddingModel\.$sym\b" || true
done

echo
echo "▶ 모델 ID 문자열 직접 사용 여부도 점검(선택)"
for sym in "${REMOVED[@]}"; do
  rg -n -C2 -F "$sym" || true
done

Length of output: 14631


테스트 코드: 제거된 모델 상수(TEXT_EMBEDDING_3_LARGE) 참조 교체 필요

EmbeddingModel.TEXT_EMBEDDING_3_LARGE가 삭제되어 아래 테스트들이 컴파일 오류를 일으킵니다. 적절한 신규 상수(예: EmbeddingModel.E5_BASE 또는 BGE_BASE_EN)로 교체하고, 테스트 로직이 여전히 의도한 대로 동작하는지 검증해주세요.

  • 수정 대상:

    • src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt
      IndexCreateRequest(..., EmbeddingModel.TEXT_EMBEDDING_3_LARGE, ...) 참조 (기본 37, 54, 76, 85, 109, 139, 161, 196, 221, 247, 271행 등)
  • 변경 예시:

    - IndexCreateRequest("test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.TEXT_EMBEDDING_3_LARGE, true)
    + IndexCreateRequest("test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.E5_BASE,         true)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// BGE
BGE_BASE_EN(768, "BAAI/bge-base-en-v1.5"),
BGE_LARGE_EN(1024, "BAAI/bge-large-en-v1.5"),
BGE_M3(1024, "BAAI/bge-m3"),
E5_SMALL(384, "intfloat/e5-small-v2"),
E5_BASE(768, "intfloat/e5-base-v2"),
E5_LARGE(1024, "intfloat/e5-large-v2"),
// Old word vectors
FASTTEXT_KO(300, "fasttext-ko-300d");
// E5
E5_BASE(768, "intfloat/e5-base-v2");
++ b/src/test/kotlin/simplerag/ragback/domain/index/service/IndexServiceTest.kt
@@ -37,7 +37,7 @@ class IndexServiceTest {
//
IndexCreateRequest("test", 1, 0, SimilarityMetric.COSINE, 1, EmbeddingModel.E5_BASE, true)
//
}

Comment on lines 24 to 28
override fun convertToEntityAttribute(dbData: PGobject?): FloatArray {
requireNotNull(dbData) { "db vector is null" }
val body = dbData.value?.trim()?.removePrefix("[")?.removeSuffix("]") ?: error("empty vector")
return body.split(',').map { it.trim().toFloat() }.toFloatArray()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

빈 벡터 및 파싱 오류 처리 보강 필요 (NaN/Infinity 재검증 포함)

현재 구현은

  • [] 같은 빈 벡터 문자열을 만나면 split(',')toFloat()에서 NumberFormatException이 발생합니다.
  • 파싱 실패 시 예외 메시지가 모호하고, 파싱 이후 NaN/Infinity에 대한 재검증이 없습니다.

아래와 같이 방어적 파싱과 명확한 예외를 추가해 주세요.

-    override fun convertToEntityAttribute(dbData: PGobject?): FloatArray {
-        requireNotNull(dbData) { "db vector is null" }
-        val body = dbData.value?.trim()?.removePrefix("[")?.removeSuffix("]") ?: error("empty vector")
-        return body.split(',').map { it.trim().toFloat() }.toFloatArray()
-    }
+    override fun convertToEntityAttribute(dbData: PGobject?): FloatArray {
+        requireNotNull(dbData) { "db vector is null" }
+        val raw = dbData.value ?: throw IllegalArgumentException("db vector has null value")
+        val body = raw.trim().removePrefix("[").removeSuffix("]").trim()
+        if (body.isEmpty()) {
+            // Defensive: avoid NumberFormatException on "[]"
+            return floatArrayOf()
+        }
+        val arr = try {
+            body.split(',')
+                .map { it.trim().toFloat() }
+                .toFloatArray()
+        } catch (e: NumberFormatException) {
+            throw IllegalArgumentException("Invalid vector literal for pgvector: '$raw'", e)
+        }
+        require(arr.all { it.isFinite() }) { "NaN/Infinity not allowed (db → entity)" }
+        return arr
+    }
🤖 Prompt for AI Agents
In
src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorConverter.kt
around lines 24 to 28, replace the current naive parsing with defensive parsing:
treat "[]" or an empty trimmed body as an empty FloatArray; split only when
non-empty and skip any extra empty tokens; convert each token to Float in a
try/catch that throws a clear IllegalArgumentException indicating the offending
token and its index on parse failure; after conversion, validate each float to
ensure it is finite (not NaN or Infinity) and throw an IllegalArgumentException
indicating the index and invalid value if found; preserve existing null checks
on dbData and provide explicit messages for null/empty cases.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
build.gradle (2)

34-39: Duplicate dependency: jackson-module-kotlin is declared twice.

Remove one of the duplicates to avoid noise and potential resolution ambiguity.

-    implementation("com.fasterxml.jackson.module:jackson-module-kotlin")
...
-    implementation 'com.fasterxml.jackson.module:jackson-module-kotlin'
+    implementation 'com.fasterxml.jackson.module:jackson-module-kotlin'

73-77: Redundant allOpen block when using kotlin-jpa plugin.

The kotlin-jpa plugin already configures all-open for JPA. Keeping both is harmless but redundant. Consider removing the manual block for simplicity.

-allOpen {
-    annotation 'jakarta.persistence.Entity'
-    annotation 'jakarta.persistence.MappedSuperclass'
-    annotation 'jakarta.persistence.Embeddable'
-}
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (2)

12-15: Consider lazy-loading LOB content.

Large content fields often don’t need to be fetched with every embedding read. Mark as LAZY to reduce I/O if your access patterns allow.

-    @Lob
+    @Lob
+    @Basic(fetch = FetchType.LAZY)
     val content: String,

Note: Validate that your JPA provider and usage patterns truly defer LOB loading; some providers may still fetch eagerly depending on dialect/settings.


19-21: Action Required: Enforce embeddingDim Invariant at Entity Boundary

A quick grep (rg -nP 'new\s+ChunkEmbedding\(|ChunkEmbedding\(' -g 'src/**') found no direct constructor calls outside of the class declaration itself. This means instantiation (and persistence) happens solely through JPA, so service-layer checks are your only current safeguard—but they’re easy to bypass in the future. To prevent bad data ingestion, add a self-defensive lifecycle validator on the entity:

Locations to update:

  • File: src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt
    • Add the following methods in the class body (outside existing property/constructor declarations):
 class ChunkEmbedding(
     @Column(name = "embedding", columnDefinition = "vector", nullable = false)
     val embedding: PGvector,
 
     @Column(name = "embedding_dim", nullable = false)
     val embeddingDim: Int,
 
     // … other fields …
 ) {
+    @PrePersist
+    @PreUpdate
+    fun validateEmbedding() {
+        val vecSize = embedding.vector?.size ?: -1
+        require(vecSize == embeddingDim) {
+            "Invalid embeddingDim: expected $embeddingDim, but got PGvector of length $vecSize"
+        }
+    }
 }

This ensures that every persist or update operation will immediately fail if the embeddingDim doesn’t match the actual vector length, guarding your database against mismatched data.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: ASSERTIVE
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f0fd20e and a307fcd.

📒 Files selected for processing (2)
  • build.gradle (2 hunks)
  • src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (1)
src/main/kotlin/simplerag/ragback/global/util/FloatArrayToPgVectorStringConverter.kt (6)
  • { "Embedding (FloatArray) must not be null" } (9-9)
  • autoApply (6-34)
  • { "Embedding must not be empty; expected fixed dimension (e.g., 1536)" } (10-10)
  • { (26-28)
  • convertToEntityAttribute (17-33)
  • { (11-13)
🔇 Additional comments (5)
build.gradle (2)

47-47: Switching the PostgreSQL driver to implementation is appropriate here.

Because the codebase now references PGobject-derived types (via PGvector), the driver must be on the compile classpath. Keeping it as runtimeOnly would risk compilation/linkage issues.


2-6: Kotlin plugin versions align with Spring Boot BOM

  • Verified that Spring Boot 3.5.4’s dependency-management BOM declares <kotlin.version>1.9.25</kotlin.version>, matching all Kotlin plugin versions in build.gradle.
  • No other Kotlin versions (1.x or 2.x) were found elsewhere in the codebase.
  • Future-proofing: if you upgrade to Kotlin 2.x, remember to bump every Kotlin plugin declaration in lockstep.
src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt (3)

16-18: Annotate PGvector field with JdbcTypeCode and enforce NOT NULL

We ran a search for any existing converters or @Convert annotations targeting PGvector and found none. It’s safe to add the explicit JDBC type mapping and non-null constraint.

• File: src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt
• Lines: 16–18

Suggested change:

 import com.pgvector.PGvector
+import org.hibernate.annotations.JdbcTypeCode
+import org.hibernate.type.SqlTypes

     @Column(name = "embedding", columnDefinition = "vector")
-    var embedding: PGvector,
+    @JdbcTypeCode(SqlTypes.OTHER)
+    @Column(name = "embedding", columnDefinition = "vector", nullable = false)
+    var embedding: PGvector,

This ensures Hibernate uses PreparedStatement#setObject for the vector type and enforces a NOT NULL constraint at the database level.


9-10: Action required: confirm pgvector dimension and index strategy

  • The ChunkEmbedding entity at
    src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt
    declares
    @Column(name = "embedding", columnDefinition = "vector"), which omits a fixed dimension.
  • pgvector’s docs require a fixed size (e.g., vector(768) or vector(1536)) to support IVFFlat/HNSW indexes.
  • No migrations were found under src/main/resources/db/migration (or elsewhere) that:
    • Define a fixed-dimension vector column
    • Create an IVFFlat or HNSW index
    • Install the vector extension explicitly

Please verify your database schema or migration scripts to ensure you:

  • Set a fixed vector dimension per table (for example, columnDefinition = "vector(768)")
  • Create the appropriate similarity index (IVFFlat/HNSW) once the dimension is fixed

If you plan to run large-scale similarity searches, standardizing on a single embedding size (or splitting tables by model/dimension) is crucial for efficient indexing.


22-24: Naming consistency confirmed – no change needed

The Index entity is mapped to the table named "indexes", and its primary key column is explicitly annotated as @Column(name = "indexes_id"). All related entities (ChunkEmbedding, DataFileIndex, and Model) correctly use @JoinColumn(name = "indexes_id"), matching the table and column naming. No adjustments are required here.

Comment on lines +47 to 49
implementation 'org.postgresql:postgresql' // ← 변경
testRuntimeOnly("com.h2database:h2")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

H2 in tests will break with the pgvector type; switch to PostgreSQL Testcontainers.

The JPA columnDefinition "vector" won’t exist in H2, so schema generation and queries involving the embedding column will fail under the current test profile. Replace H2 with Testcontainers-based PostgreSQL for tests and ensure the extension is enabled.

Apply this diff to dependencies:

-    testRuntimeOnly("com.h2database:h2")
+    // Prefer PostgreSQL Testcontainers for pgvector
+    testImplementation 'org.testcontainers:postgresql'
+    testImplementation 'org.testcontainers:junit-jupiter'

And add a Flyway migration (outside this file) to enable the extension during tests and dev:

-- src/main/resources/db/migration/V1__enable_pgvector.sql
CREATE EXTENSION IF NOT EXISTS vector;

Run this to find any H2-specific configs that need removal/migration in test resources:


🏁 Script executed:

#!/bin/bash
rg -nC2 -i 'h2|jdbc:h2|spring\.datasource\..*h2|driver.*h2' --glob 'src/**/resources/**'

Length of output: 794


Replace H2 with PostgreSQL Testcontainers for Tests

Your test setup still relies on H2—even though you’ve switched your main runtime to PostgreSQL—so any “vector” columns (pgvector) won’t exist in H2 and will cause schema-generation or query failures. I found H2 config in your test resources:

  • src/test/resources/application-test.yml
    url: jdbc:h2:mem:simplerag;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;…
    driver-class-name: org.h2.Driver
    jpa.database-platform: org.hibernate.dialect.H2Dialect
    hibernate.ddl-auto: create-drop

Please update as follows:

• In build.gradle, replace the H2 test runtime with Testcontainers dependencies:

-    testRuntimeOnly("com.h2database:h2")
+    // Use PostgreSQL Testcontainers for pgvector support
+    testImplementation "org.testcontainers:postgresql"
+    testImplementation "org.testcontainers:junit-jupiter"

• In src/test/resources/application-test.yml, remove all H2 settings and configure a Testcontainers-backed PostgreSQL datasource. For example:

-  datasource:
-    url: jdbc:h2:mem:simplerag;MODE=PostgreSQL;DB_CLOSE_DELAY=-1;DB_CLOSE_ON_EXIT=FALSE
-    driver-class-name: org.h2.Driver
-    username: sa
-    password:
-  jpa:
-    database-platform: org.hibernate.dialect.H2Dialect
-    hibernate:
-      ddl-auto: create-drop
+  datasource:
+    # PostgreSQL Testcontainer will supply URL, driver, and credentials
+    driver-class-name: org.postgresql.Driver
+  jpa:
+    database-platform: org.hibernate.dialect.PostgreSQLDialect
+    hibernate:
+      ddl-auto: none

• Add a Flyway migration to enable pgvector in your test (and dev) databases:

-- src/main/resources/db/migration/V1__enable_pgvector.sql
CREATE EXTENSION IF NOT EXISTS vector;

With these changes, your tests will spin up a real PostgreSQL instance (with pgvector support) instead of H2, ensuring your schema and queries remain compatible.

🤖 Prompt for AI Agents
In build.gradle around lines 47-49, the project still declares H2 as the test
runtime which breaks pgvector-dependent schema and queries; replace the H2
testRuntimeOnly entry with the Testcontainers PostgreSQL and Testcontainers
JUnit/Jupiter dependencies (and the Testcontainers BOM if used) so tests run
against a real PostgreSQL container. Then update
src/test/resources/application-test.yml to remove the H2
URL/driver/dialect/ddl-auto settings and instead configure datasource values
that will be populated from Testcontainers (e.g., property placeholders or
system/env properties set by your test harness) or use Spring’s Testcontainers
support to supply the JDBC URL, username and password at test runtime. Finally
add a Flyway migration file under src/main/resources/db/migration (e.g.,
V1__enable_pgvector.sql) that runs CREATE EXTENSION IF NOT EXISTS vector; so the
pgvector extension exists in both dev and test databases.

Comment on lines +57 to +58
// pgvector
implementation("com.pgvector:pgvector:0.1.6")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick (assertive)

pgvector dependency added: ensure Hibernate type mapping and DB extension are in place.

Adding com.pgvector:pgvector is necessary, but by itself Hibernate won’t know how to bind PGvector unless you map it as OTHER (or provide a custom type/AttributeConverter). See my entity comment for @JdbcTypeCode(SqlTypes.OTHER). Also make sure a Flyway/Liquibase migration creates the vector extension.

I can add a minimal custom type or converter if you prefer that approach over @JdbcTypeCode.

🤖 Prompt for AI Agents
In build.gradle around lines 57-58, you added the pgvector dependency but
Hibernate won’t bind PGvector automatically; update the entity mapping to use
@JdbcTypeCode(SqlTypes.OTHER) or implement and register a custom Hibernate Type
/ JPA AttributeConverter for pgvector to ensure correct JDBC binding, and add a
Flyway or Liquibase migration that runs CREATE EXTENSION IF NOT EXISTS vector;
to ensure the DB has the pgvector extension before migrations/usage.

@@ -1,8 +1,8 @@
package simplerag.ragback.domain.index.entity

import com.pgvector.PGvector
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Ensure the pgvector extension exists in all environments.

Add a migration to run CREATE EXTENSION IF NOT EXISTS vector and verify permissions in non-dev DBs. Without it, application startup will fail on schema validation.

I can draft the Flyway migration and an index creation script tailored to your chosen dimension.


🏁 Script executed:

#!/bin/bash
# Confirm presence of extension-enabling migration
rg -n 'CREATE EXTENSION IF NOT EXISTS vector' src/main/resources/db/migration || echo "Missing migration for pgvector extension"

Length of output: 242


Add a Flyway migration to enable the pgvector extension

We’ve confirmed that there is no existing Flyway migration for CREATE EXTENSION IF NOT EXISTS vector (the db/migration folder is missing entirely), so application startup will fail schema validation in environments where the vector extension isn’t already present.

• Create a new SQL migration under src/main/resources/db/migration/, for example
V2__enable_vector_extension.sql

-- enable PostgreSQL vector extension for embedding support
CREATE EXTENSION IF NOT EXISTS vector;

• (Optional) Immediately follow with an index creation, e.g.

-- index on embedding vector column for fast similarity searches
CREATE INDEX IF NOT EXISTS idx_chunk_embedding_vector
  ON chunks USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

• Ensure non-dev databases have sufficient privileges to run CREATE EXTENSION. Update your deployment docs or grant scripts accordingly.

Once the migration is in place, Flyway will apply it automatically before your application’s schema validation check.

🤖 Prompt for AI Agents
In src/main/kotlin/simplerag/ragback/domain/index/entity/ChunkEmbedding.kt
around line 3, there is an import for PGvector but no Flyway migration to enable
the PostgreSQL pgvector extension which causes startup schema validation
failures in environments missing the extension; add a SQL migration file under
src/main/resources/db/migration/, e.g. V2__enable_vector_extension.sql,
containing a CREATE EXTENSION IF NOT EXISTS vector; statement (optionally
followed by a CREATE INDEX IF NOT EXISTS ... for the embedding column using
ivfflat and vector_l2_ops), and update deployment/DB provisioning docs or grant
scripts so non-dev DBs have the privileges to run CREATE EXTENSION.

@catturtle123 catturtle123 merged commit 8148871 into develop Aug 21, 2025
1 check passed
@catturtle123 catturtle123 deleted the chore/#13-pgvector-fix branch August 21, 2025 08:25
@coderabbitai coderabbitai bot mentioned this pull request Aug 22, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore other work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀 [Chore] pgvector 엔티티 수정

2 participants