Skip to content

[feat] Add extraction result schema and worker extraction interfaceย #2

@KyungminPark-steck

Description

@KyungminPark-steck

๐Ÿ“– ๋ฐฐ๊ฒฝ

processing job ๊ฒฐ๊ณผ์— LLM ๊ธฐ๋ฐ˜ ์ถ”์ถœ ๊ฒฐ๊ณผ๋ฅผ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ตœ์ข…์ ์œผ๋กœ๋Š” ์ธ์Šคํƒ€๊ทธ๋žจ ๊ฒŒ์‹œ๊ธ€/๋ฆด์Šค caption์—์„œ ๊ฐ€๊ฒŒ๋ช…, ์ฃผ์†Œ, ๊ทผ๊ฑฐ ๋ฌธ์žฅ, confidence ์„ฑ๊ฒฉ์˜ certainty ๋“ฑ์„ ์ถ”์ถœํ•˜๊ณ , ์ดํ›„ HF endpoint ํ˜ธ์ถœ, fallback ๋ชจ๋ธ, DB ์ €์žฅ, API ์‘๋‹ต๊นŒ์ง€ ์—ฐ๊ฒฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ ์ „์— API ์‘๋‹ต ํ•„๋“œ์™€ ๋‚ด๋ถ€ ๋„๋ฉ”์ธ ๋ชจ๋ธ, worker ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ถ”์ถœ ๋‹จ๊ณ„ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ๋จผ์ € ์ •๋ฆฌํ•ด ๋‘์–ด์•ผ ์‹ค์ œ LLM endpoint ์—†์ด๋„ ํ…Œ์ŠคํŠธ ๊ฐ€๋Šฅํ•œ ๊ตฌ์กฐ๋กœ ๊ฐœ๋ฐœ์„ ์ด์–ด๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐ŸŽฏ ๋ชฉํ‘œ

job result ์‘๋‹ต์— extraction_result๊ฐ€ ํฌํ•จ๋  ์ˆ˜ ์žˆ๋„๋ก schema์™€ ๋„๋ฉ”์ธ ๋ชจ๋ธ์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

worker ํŒŒ์ดํ”„๋ผ์ธ์—๋Š” mock ๊ฐ€๋Šฅํ•œ extraction client ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ถ”๊ฐ€ํ•ด, ์‹ค์ œ HF endpoint ์—ฐ๊ฒฐ ์ „์—๋„ ์ถ”์ถœ ๋‹จ๊ณ„์˜ ํ๋ฆ„์„ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ› ๏ธ ๊ตฌํ˜„ ๋‚ด์šฉ

  • API result ์‘๋‹ต schema์— extraction_result ํ•„๋“œ ์ถ”๊ฐ€
  • ์ถ”์ถœ ๊ฒฐ๊ณผ ๋„๋ฉ”์ธ ๋ชจ๋ธ ์ถ”๊ฐ€
    • ExtractionResult
    • ExtractionCertainty
  • LLM ์‘๋‹ต ํŒŒ์‹ฑ์šฉ schema ์ถ”๊ฐ€
    • ExtractionLLMResponse
  • worker processor์— extraction client ์ธํ„ฐํŽ˜์ด์Šค ์ถ”๊ฐ€
    • caption์ด ์žˆ๋Š” ๊ฒฝ์šฐ extraction client ํ˜ธ์ถœ
    • ์ถ”์ถœ ์‹คํŒจ ์‹œ job ์ „์ฒด ์‹คํŒจ๋กœ ์ฒ˜๋ฆฌํ•˜์ง€ ์•Š๊ณ  ๊ธฐ์กด crawl ๊ฒฐ๊ณผ๋Š” ์ €์žฅ ๊ฐ€๋Šฅํ•˜๋„๋ก ์ฒ˜๋ฆฌ
  • HF extraction client ์ดˆ์•ˆ ์ถ”๊ฐ€
    • endpoint URL/token ๊ธฐ๋ฐ˜ ํ˜ธ์ถœ ๊ตฌ์กฐ
    • HF ์‘๋‹ต payload์—์„œ generated text ์ถ”์ถœ
    • generated text์—์„œ JSON object ํŒŒ์‹ฑ
    • schema validation ํ›„ domain model๋กœ ๋ณ€ํ™˜
  • ์„ค์ •๊ฐ’ ์ถ”๊ฐ€
    • HF_EXTRACTION_ENDPOINT_URL
    • HF_EXTRACTION_API_TOKEN
    • HF_EXTRACTION_MODEL_NAME
    • HF_EXTRACTION_TIMEOUT_SECONDS
    • HF_EXTRACTION_MAX_NEW_TOKENS
  • ๊ด€๋ จ ํ…Œ์ŠคํŠธ ์ถ”๊ฐ€ ๋ฐ ๊ธฐ์กด ํ…Œ์ŠคํŠธ ์ˆ˜์ •
    • extraction schema validation
    • HF extraction client ์‘๋‹ต ํŒŒ์‹ฑ
    • worker processor๊ฐ€ caption์„ extraction client์— ์ „๋‹ฌํ•˜๋Š”์ง€ ๊ฒ€์ฆ
    • job result response๊ฐ€ extraction_result๋ฅผ ํฌํ•จํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๊ฒ€์ฆ

โš ๏ธ ๊ณ ๋ ค์‚ฌํ•ญ

  • ์ด๋ฒˆ ์ด์Šˆ์—์„œ๋Š” ์‹ค์ œ HF endpoint ์šด์˜ ์—ฐ๊ฒฐ์€ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฒˆ ์ด์Šˆ์—์„œ๋Š” DB ์ปฌ๋Ÿผ ์ถ”๊ฐ€ ๋ฐ migration์€ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค.
  • ์‹ค์ œ endpoint ์žฅ์• ๊ฐ€ job ์ „์ฒด ์‹คํŒจ๋กœ ๋ฒˆ์ง€์ง€ ์•Š๋„๋ก, extraction ์‹คํŒจ ์‹œ extraction_result๋Š” ๋น„์›Œ๋‘๊ณ  crawl ๊ฒฐ๊ณผ ์ €์žฅ์€ ๊ณ„์† ๊ฐ€๋Šฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์™ธ๋ถ€ API ํ•„๋“œ๋ช…์€ ํ™•์žฅ์„ฑ์„ ๊ณ ๋ คํ•ด extraction_result๋กœ ํ†ต์ผํ•ฉ๋‹ˆ๋‹ค.
  • .env๋Š” ๋กœ์ปฌ ๊ฐœ๋ฐœ์šฉ์œผ๋กœ๋งŒ ์‚ฌ์šฉํ•˜๋ฉฐ Git์— ํฌํ•จํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

โœ… ์™„๋ฃŒ ์กฐ๊ฑด

  • JobResultResponse์—์„œ extraction_result ํ•„๋“œ๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.
  • worker processor๊ฐ€ caption์„ extraction client์— ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ๋‹ค.
  • extraction client๋Š” mock/fake๋กœ ๋Œ€์ฒด ๊ฐ€๋Šฅํ•ด ์‹ค์ œ HF endpoint ์—†์ด ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๋‹ค.
  • HF extraction client ์ดˆ์•ˆ์ด generated text/JSON/schema validation ํ๋ฆ„์„ ์ฒ˜๋ฆฌํ•œ๋‹ค.
  • ๊ด€๋ จ ํ…Œ์ŠคํŠธ๊ฐ€ ์ถ”๊ฐ€๋˜์–ด ์ „์ฒด pytest๊ฐ€ ํ†ต๊ณผํ•œ๋‹ค.

ํ…Œ์ŠคํŠธ ๋ช…๋ น:
.\.venv\Scripts\python.exe -m pytest

Metadata

Metadata

Assignees

No one assigned

    Labels

    feat์ƒˆ๋กœ์šด ๊ธฐ๋Šฅ ์ถ”๊ฐ€

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions