feat: Tool mode support for File and Save File#9898
Conversation
WalkthroughRenames “File” components to “Read File,” renames “Save File” to “Write File,” updates descriptions, and adds tool_mode=True to multiple Output declarations across components and starter project templates. News Aggregator also adds "txt" to file_format and updates code block and code_hash. No core logic or control flow changes. Changes
Sequence Diagram(s)Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
edwinjosechittilappilly
left a comment
There was a problem hiding this comment.
Hi Eric can we catch up on this PR?
In tool mode what would be the input to the file component?
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (8.12%) is below the target coverage (10.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #9898 +/- ##
==========================================
- Coverage 22.87% 22.86% -0.01%
==========================================
Files 1086 1086
Lines 39739 39739
Branches 5418 5418
==========================================
- Hits 9089 9088 -1
- Misses 30495 30496 +1
Partials 155 155
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/backend/base/langflow/initial_setup/starter_projects/Vector Store RAG.json (4)
2766-3514: Add timeout and better failure surfacing for Docling subprocess.
subprocess.runis invoked without a timeout; a hung Docling conversion will deadlock the component and the UX. Capture and handleTimeoutExpired, and surfacestderrwhen present.Apply this diff inside
_process_docling_in_subprocess:- proc = subprocess.run( # noqa: S603 + proc = subprocess.run( # noqa: S603 [sys.executable, "-u", "-c", child_script], input=json.dumps(args).encode("utf-8"), capture_output=True, check=False, + timeout=300, # fail fast; make configurable if needed ) + # Handle timeout explicitly + # (wrap the run in try/except if you prefer structured handling)And augment error reporting:
- if not proc.stdout: + if not proc.stdout: err_msg = proc.stderr.decode("utf-8", errors="replace") or "no output from child process" return Data(data={"error": f"Docling subprocess error: {err_msg}", "file_path": file_path})
2766-3514: Avoid passing None into rollup_data.If Docling returns
None, the advanced branch appends[advanced_data]which may be[None]. Filter outNonebefore callingrollup_data.- return self.rollup_data(file_list, [advanced_data]) + rows = [advanced_data] if advanced_data is not None else [] + return self.rollup_data(file_list, rows)
3503-3514: Fix brittle markdown return (result.text[0]).
load_files()may return a DataFrame of Data, a single Data, or Message depending on path. Indexingresult.text[0]will break when the shape differs. Build the string robustly.def load_files_markdown(self) -> Message: """Load files using advanced Docling processing and export to Markdown format.""" self.markdown = True - result = self.load_files() - return Message(text=str(result.text[0])) + result = self.load_files() + try: + from lfx.schema.dataframe import DataFrame + from lfx.schema.data import Data + from lfx.schema.message import Message as LFMessage + except Exception: + DataFrame = Data = LFMessage = object # type: ignore + + if isinstance(result, DataFrame): + texts = [getattr(row, "text", "") for row in result if getattr(row, "text", "")] + return Message(text="\n\n".join(texts)) + if isinstance(result, Data): + return Message(text=getattr(result, "text", "")) + if isinstance(result, LFMessage): + return result + return Message(text=str(result))
2683-2686: Starter node metadata is inconsistent (name/description/module).
display_namestill shows “File” and the older description; the code setsdisplay_name = "Read File"and new description.metadata.modulepoints tolangflow.components.datastax.astradb.AstraDBVectorStoreComponentfor the File node, which is incorrect and will break code-loading.- "description": "Loads content from one or more files as a DataFrame.", - "display_name": "File", + "description": "Loads content from files with optional advanced document processing and export using Docling.", + "display_name": "Read File", @@ - "module": "langflow.components.datastax.astradb.AstraDBVectorStoreComponent" + "module": "lfx.components.data.file.FileComponent"Also applies to: 2703-2726
src/backend/base/langflow/initial_setup/starter_projects/News Aggregator.json (1)
1250-1275: Add openpyxl to dependencies or keep/guard Excel exportRuntime check shows openpyxl is missing ("No module named 'openpyxl'"); add "openpyxl" to the dependencies block in src/backend/base/langflow/initial_setup/starter_projects/News Aggregator.json (around lines 1250–1275) or ensure the Excel-export code path is guarded and documented.
🧹 Nitpick comments (16)
src/lfx/src/lfx/components/processing/save_file.py (1)
53-54: tool_mode on “File Path” — return a bare path for tool consumers.Tooling parses outputs better if the Message text is just the path. Keep human-friendly confirmation in
self.status.Apply:
- # Return the final file path and confirmation message + # Return the final file path only (tool-friendly); keep human-friendly confirmation in status final_path = Path.cwd() / file_path if not file_path.is_absolute() else file_path - return Message(text=f"{confirmation} at {final_path}") + self.status = Message(text=confirmation) + return Message(text=str(final_path))Alternative: add a second non-tool output “Confirmation” for the verbose message.
src/backend/base/langflow/initial_setup/starter_projects/Vector Store RAG.json (3)
2766-3514: Correctexport_formatfor structured branch.You set
EXPORT_FORMAT = "Markdown"and reuse it for structured output, which mislabels the payload.- if result.get("mode") == "markdown": + if result.get("mode") == "markdown": exported_content = str(result.get("text", "")) return Data( text=exported_content, data={"exported_content": exported_content, "export_format": self.EXPORT_FORMAT, **meta}, ) rows = list(result.get("doc", [])) - return Data(data={"doc": rows, "export_format": self.EXPORT_FORMAT, **meta}) + return Data(data={"doc": rows, "export_format": "Structured", **meta})
2766-3514: Relax path “sanitization” (not shelling out).Since you pass args as a list and data via stdin, characters like
$/&/`don’t reach a shell. The current check can block legitimate paths without improving safety.- if not isinstance(args["file_path"], str) or any(c in args["file_path"] for c in [";", "|", "&", "$", "`"]): - return Data(data={"error": "Unsafe file path detected.", "file_path": args["file_path"]}) + if not isinstance(args["file_path"], str) or not args["file_path"].strip(): + return Data(data={"error": "Invalid file path.", "file_path": args.get("file_path")})
2766-3514: Honor the “Processing Concurrency” without the deprecated flag.You still gate concurrency by
use_multithreading. If the knob is deprecated, compute fromconcurrency_multithreadingalone.- concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading) + concurrency = max(1, int(self.concurrency_multithreading or 1))src/backend/base/langflow/initial_setup/starter_projects/Text Sentiment Analysis.json (2)
2474-2474: Make load_files_markdown resilient to return shape.
result.text[0]assumes list-like; will fail iftextis str/None or empty list.Apply this diff:
def load_files_markdown(self) -> Message: """Load files using advanced Docling processing and export to Markdown format.""" self.markdown = True result = self.load_files() - return Message(text=str(result.text[0])) + text = getattr(result, "text", "") + if isinstance(text, (list, tuple)): + text = text[0] if text else "" + return Message(text=str(text))
2416-2417: Align node display name with component rename (“Read File”).Starter node still shows “File”; change to “Read File” for consistency with the embedded class and PR intent.
Apply this diff:
- "display_name": "File", + "display_name": "Read File",src/backend/base/langflow/initial_setup/starter_projects/Document Q&A.json (4)
1333-1333: Make load_files_markdown resilient to return shape.Harden against
textnot being list-like.Apply this diff:
def load_files_markdown(self) -> Message: """Load files using advanced Docling processing and export to Markdown format.""" self.markdown = True result = self.load_files() - return Message(text=str(result.text[0])) + text = getattr(result, "text", "") + if isinstance(text, (list, tuple)): + text = text[0] if text else "" + return Message(text=str(text))
1274-1275: Rename node display name to “Read File”.Keep starter labels consistent with the component rename.
- "display_name": "File", + "display_name": "Read File",
750-753: Update docs note to reflect “Read File” rename.Text still says “In the File component”. Consider “In the Read File component”.
301-302: Fix prompt grammar.“What is this document is about?” → “What is this document about?”
- "value": "What is this document is about?" + "value": "What is this document about?"src/backend/base/langflow/initial_setup/starter_projects/Portfolio Website Code Generator.json (1)
1382-1540: tool_mode on Raw Content added — looks good; align node label with new component nameThe embedded FileComponent code now exposes Raw Content with tool_mode=True. However, the surrounding node still shows display_name "File" (Line 1325) while the component’s display_name is "Read File", which will look inconsistent in the UI. Please sync the node label.
Apply this JSON tweak:
- "display_name": "File", + "display_name": "Read File",src/backend/base/langflow/initial_setup/starter_projects/News Aggregator.json (3)
1236-1238: Rename mismatch: “Save File” node label vs code “Write File”The component class is now “Write File” but the node display_name is still “Save File”. Align for consistency.
- "display_name": "Save File", + "display_name": "Write File",
1315-1470: SaveToFile: great tool_mode output; drop duplicate import and harden Excel path
- Duplicate import of get_user_by_id inside _upload_file (imported in the first try and again later). Remove the first occurrence.
- to_excel requires openpyxl; add a clear error if it’s missing.
@@ - try: - from langflow.api.v2.files import upload_user_file - from langflow.services.database.models.user.crud import get_user_by_id + try: + from langflow.api.v2.files import upload_user_file @@ - elif fmt == "excel": - dataframe.to_excel(path, index=False, engine="openpyxl") + elif fmt == "excel": + try: + dataframe.to_excel(path, index=False, engine="openpyxl") + except ModuleNotFoundError as e: + raise ModuleNotFoundError("openpyxl is required for Excel export. Please install it.") from e
1315-1470: Optional: return a structured path instead of a plain messageConsider returning Data(data={"file_path": str(final_path)}) or setting Message.properties with the path so tools can consume it programmatically.
src/lfx/src/lfx/components/data/file.py (2)
215-259: All dynamic outputs now tool_mode=True — good; dedupe tabular extension checksThe logic is correct. Minor DRY: define TABULAR_EXTS once and reuse in update_build_config and update_outputs.
class FileComponent(BaseFileComponent): + TABULAR_EXTS = (".csv", ".xlsx", ".parquet") @@ - allow_advanced = file_count == 1 and not file_path.endswith((".csv", ".xlsx", ".parquet")) + allow_advanced = file_count == 1 and not file_path.endswith(self.TABULAR_EXTS) @@ - if file_path.endswith((".csv", ".xlsx", ".parquet")): + if file_path.endswith(self.TABULAR_EXTS): frontend_node["outputs"].append(
483-497: Path “sanitization” not needed for subprocess argsYou’re not using shell=True; argv is a list, so command injection isn’t a concern. This check can be dropped or replaced with a simple existence/permission check if you want stricter validation.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
src/backend/base/langflow/initial_setup/starter_projects/Document Q&A.json(1 hunks)src/backend/base/langflow/initial_setup/starter_projects/News Aggregator.json(3 hunks)src/backend/base/langflow/initial_setup/starter_projects/Portfolio Website Code Generator.json(1 hunks)src/backend/base/langflow/initial_setup/starter_projects/Text Sentiment Analysis.json(1 hunks)src/backend/base/langflow/initial_setup/starter_projects/Vector Store RAG.json(1 hunks)src/lfx/src/lfx/base/data/base_file.py(1 hunks)src/lfx/src/lfx/components/data/file.py(3 hunks)src/lfx/src/lfx/components/processing/save_file.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (42)
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 39/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 38/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 27/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 34/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 35/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 40/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 20/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 32/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 21/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 33/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 31/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 16/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 30/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 25/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 28/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 29/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 22/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 24/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 18/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 19/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 17/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 9/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 12/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 1/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 2/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 15/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 5/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 13/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 14/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 8/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 10/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 7/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 11/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 6/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 3/40
- GitHub Check: Run Frontend Tests / Playwright Tests - Shard 4/40
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 3
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 2
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 5
- GitHub Check: Run Backend Tests / Unit Tests - Python 3.10 - Group 1
- GitHub Check: Test Starter Templates
- GitHub Check: Optimize new Python code in this PR
🔇 Additional comments (4)
src/lfx/src/lfx/base/data/base_file.py (1)
183-184: ```shell
#!/bin/bash
set -euo pipefailecho "PWD: $(pwd)"
echo
echo "== Files containing 'class Output' =="
rg -nP --hidden -S 'class\s+Output\b' || trueecho
echo "== Full context (up to 200 lines) around each 'class Output' match =="
files=$(rg -lP --hidden -S 'class\s+Output\b' || true)
if [ -n "$files" ]; then
for f in $files; do
echo "---- $f ----"
rg -nP -U -C200 'class\s+Output\b' "$f" || true
echo
done
else
echo "(no files found)"
fiecho
echo "== All occurrences of 'tool_mode' in repo =="
rg -nP --hidden -S '\btool_mode\b' || trueecho
echo "== All 'Output(' calls under src/ (lines that do NOT contain 'tool_mode=' on the same line) =="
rg -nP --hidden -S 'Output(' src || true
echo
rg -nP --hidden -S 'Output(' src | rg -nP -v 'tool_mode\s*=' || trueecho
echo "Done."</blockquote></details> <details> <summary>src/backend/base/langflow/initial_setup/starter_projects/Vector Store RAG.json (1)</summary><blockquote> `2766-3514`: **Tool mode propagation: looks good.** All dynamically added outputs in `update_outputs` set `tool_mode=True`, matching the PR goal. Please confirm Save/Write File template JSONs also mark the “File Path” output with `tool_mode: true`. </blockquote></details> <details> <summary>src/lfx/src/lfx/components/data/file.py (2)</summary><blockquote> `31-34`: **Rename to “Read File” — LGTM** Public name/docstring updated cleanly. --- `163-164`: **Expose Raw Content in tool mode — LGTM** This makes the default output usable in tool flows. </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| "title_case": false, | ||
| "type": "code", | ||
| "value": "\"\"\"Enhanced file component with clearer structure and Docling isolation.\n\nNotes:\n-----\n- Functionality is preserved with minimal behavioral changes.\n- ALL Docling parsing/export runs in a separate OS process to prevent memory\n growth and native library state from impacting the main Langflow process.\n- Standard text/structured parsing continues to use existing BaseFileComponent\n utilities (and optional threading via `parallel_load_data`).\n\"\"\"\n\nfrom __future__ import annotations\n\nimport json\nimport subprocess\nimport sys\nimport textwrap\nfrom copy import deepcopy\nfrom typing import Any\n\nfrom lfx.base.data.base_file import BaseFileComponent\nfrom lfx.base.data.utils import TEXT_FILE_TYPES, parallel_load_data, parse_text_file_to_data\nfrom lfx.inputs.inputs import DropdownInput, MessageTextInput, StrInput\nfrom lfx.io import BoolInput, FileInput, IntInput, Output\nfrom lfx.schema import DataFrame # noqa: TC001\nfrom lfx.schema.data import Data\nfrom lfx.schema.message import Message\n\n\nclass FileComponent(BaseFileComponent):\n \"\"\"File component with optional Docling processing (isolated in a subprocess).\"\"\"\n\n display_name = \"File\"\n description = \"Loads content from files with optional advanced document processing and export using Docling.\"\n documentation: str = \"https://docs.langflow.org/components-data#file\"\n icon = \"file-text\"\n name = \"File\"\n\n # Docling-supported/compatible extensions; TEXT_FILE_TYPES are supported by the base loader.\n VALID_EXTENSIONS = [\n \"adoc\",\n \"asciidoc\",\n \"asc\",\n \"bmp\",\n \"csv\",\n \"dotx\",\n \"dotm\",\n \"docm\",\n \"docx\",\n \"htm\",\n \"html\",\n \"jpeg\",\n \"json\",\n \"md\",\n \"pdf\",\n \"png\",\n \"potx\",\n \"ppsx\",\n \"pptm\",\n \"potm\",\n \"ppsm\",\n \"pptx\",\n \"tiff\",\n \"txt\",\n \"xls\",\n \"xlsx\",\n \"xhtml\",\n \"xml\",\n \"webp\",\n *TEXT_FILE_TYPES,\n ]\n\n # Fixed export settings used when markdown export is requested.\n EXPORT_FORMAT = \"Markdown\"\n IMAGE_MODE = \"placeholder\"\n\n _base_inputs = deepcopy(BaseFileComponent.get_base_inputs())\n\n for input_item in _base_inputs:\n if isinstance(input_item, FileInput) and input_item.name == \"path\":\n input_item.real_time_refresh = True\n break\n\n inputs = [\n *_base_inputs,\n BoolInput(\n name=\"advanced_mode\",\n display_name=\"Advanced Parser\",\n value=False,\n real_time_refresh=True,\n info=(\n \"Enable advanced document processing and export with Docling for PDFs, images, and office documents. \"\n \"Available only for single file processing.\"\n ),\n show=False,\n ),\n DropdownInput(\n name=\"pipeline\",\n display_name=\"Pipeline\",\n info=\"Docling pipeline to use\",\n options=[\"standard\", \"vlm\"],\n value=\"standard\",\n advanced=True,\n ),\n DropdownInput(\n name=\"ocr_engine\",\n display_name=\"OCR Engine\",\n info=\"OCR engine to use. Only available when pipeline is set to 'standard'.\",\n options=[\"\", \"easyocr\"],\n value=\"\",\n show=False,\n advanced=True,\n ),\n StrInput(\n name=\"md_image_placeholder\",\n display_name=\"Image placeholder\",\n info=\"Specify the image placeholder for markdown exports.\",\n value=\"<!-- image -->\",\n advanced=True,\n show=False,\n ),\n StrInput(\n name=\"md_page_break_placeholder\",\n display_name=\"Page break placeholder\",\n info=\"Add this placeholder between pages in the markdown output.\",\n value=\"\",\n advanced=True,\n show=False,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n show=False,\n ),\n # Deprecated input retained for backward-compatibility.\n BoolInput(\n name=\"use_multithreading\",\n display_name=\"[Deprecated] Use Multithreading\",\n advanced=True,\n value=True,\n info=\"Set 'Processing Concurrency' greater than 1 to enable multithreading.\",\n ),\n IntInput(\n name=\"concurrency_multithreading\",\n display_name=\"Processing Concurrency\",\n advanced=True,\n info=\"When multiple files are being processed, the number of files to process concurrently.\",\n value=1,\n ),\n BoolInput(\n name=\"markdown\",\n display_name=\"Markdown Export\",\n info=\"Export processed documents to Markdown format. Only available when advanced mode is enabled.\",\n value=False,\n show=False,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\"),\n ]\n\n # ------------------------------ UI helpers --------------------------------------\n\n def _path_value(self, template: dict) -> list[str]:\n \"\"\"Return the list of currently selected file paths from the template.\"\"\"\n return template.get(\"path\", {}).get(\"file_path\", [])\n\n def update_build_config(\n self,\n build_config: dict[str, Any],\n field_value: Any,\n field_name: str | None = None,\n ) -> dict[str, Any]:\n \"\"\"Show/hide Advanced Parser and related fields based on selection context.\"\"\"\n if field_name == \"path\":\n paths = self._path_value(build_config)\n file_path = paths[0] if paths else \"\"\n file_count = len(field_value) if field_value else 0\n\n # Advanced mode only for single (non-tabular) file\n allow_advanced = file_count == 1 and not file_path.endswith((\".csv\", \".xlsx\", \".parquet\"))\n build_config[\"advanced_mode\"][\"show\"] = allow_advanced\n if not allow_advanced:\n build_config[\"advanced_mode\"][\"value\"] = False\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = False\n\n elif field_name == \"advanced_mode\":\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = bool(field_value)\n\n return build_config\n\n def update_outputs(self, frontend_node: dict[str, Any], field_name: str, field_value: Any) -> dict[str, Any]: # noqa: ARG002\n \"\"\"Dynamically show outputs based on file count/type and advanced mode.\"\"\"\n if field_name not in [\"path\", \"advanced_mode\"]:\n return frontend_node\n\n template = frontend_node.get(\"template\", {})\n paths = self._path_value(template)\n if not paths:\n return frontend_node\n\n frontend_node[\"outputs\"] = []\n if len(paths) == 1:\n file_path = paths[0] if field_name == \"path\" else frontend_node[\"template\"][\"path\"][\"file_path\"][0]\n if file_path.endswith((\".csv\", \".xlsx\", \".parquet\")):\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Content\", name=\"dataframe\", method=\"load_files_structured\"),\n )\n elif file_path.endswith(\".json\"):\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Content\", name=\"json\", method=\"load_files_json\"),\n )\n\n advanced_mode = frontend_node.get(\"template\", {}).get(\"advanced_mode\", {}).get(\"value\", False)\n if advanced_mode:\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Output\", name=\"advanced\", method=\"load_files_advanced\"),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Markdown\", name=\"markdown\", method=\"load_files_markdown\"),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\"),\n )\n else:\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\"),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\"),\n )\n else:\n # Multiple files => DataFrame output; advanced parser disabled\n frontend_node[\"outputs\"].append(Output(display_name=\"Files\", name=\"dataframe\", method=\"load_files\"))\n\n return frontend_node\n\n # ------------------------------ Core processing ----------------------------------\n\n def _is_docling_compatible(self, file_path: str) -> bool:\n \"\"\"Lightweight extension gate for Docling-compatible types.\"\"\"\n docling_exts = (\n \".adoc\",\n \".asciidoc\",\n \".asc\",\n \".bmp\",\n \".csv\",\n \".dotx\",\n \".dotm\",\n \".docm\",\n \".docx\",\n \".htm\",\n \".html\",\n \".jpeg\",\n \".json\",\n \".md\",\n \".pdf\",\n \".png\",\n \".potx\",\n \".ppsx\",\n \".pptm\",\n \".potm\",\n \".ppsm\",\n \".pptx\",\n \".tiff\",\n \".txt\",\n \".xls\",\n \".xlsx\",\n \".xhtml\",\n \".xml\",\n \".webp\",\n )\n return file_path.lower().endswith(docling_exts)\n\n def _process_docling_in_subprocess(self, file_path: str) -> Data | None:\n \"\"\"Run Docling in a separate OS process and map the result to a Data object.\n\n We avoid multiprocessing pickling by launching `python -c \"<script>\"` and\n passing JSON config via stdin. The child prints a JSON result to stdout.\n \"\"\"\n if not file_path:\n return None\n\n args: dict[str, Any] = {\n \"file_path\": file_path,\n \"markdown\": bool(self.markdown),\n \"image_mode\": str(self.IMAGE_MODE),\n \"md_image_placeholder\": str(self.md_image_placeholder),\n \"md_page_break_placeholder\": str(self.md_page_break_placeholder),\n \"pipeline\": str(self.pipeline),\n \"ocr_engine\": str(self.ocr_engine) if getattr(self, \"ocr_engine\", \"\") else None,\n }\n\n # The child is a tiny, self-contained script to keep memory/state isolated.\n child_script = textwrap.dedent(\n r\"\"\"\n import json, sys\n\n def try_imports():\n # Strategy 1: latest layout\n try:\n from docling.datamodel.base_models import ConversionStatus, InputFormat # type: ignore\n from docling.document_converter import DocumentConverter # type: ignore\n from docling_core.types.doc import ImageRefMode # type: ignore\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"latest\"\n except Exception:\n pass\n # Strategy 2: alternative layout\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n try:\n from docling_core.types import ConversionStatus, InputFormat # type: ignore\n except Exception:\n try:\n from docling.datamodel import ConversionStatus, InputFormat # type: ignore\n except Exception:\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n try:\n from docling_core.types.doc import ImageRefMode # type: ignore\n except Exception:\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"alternative\"\n except Exception:\n pass\n # Strategy 3: basic converter only\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"basic\"\n except Exception as e:\n raise ImportError(f\"Docling imports failed: {e}\") from e\n\n def create_converter(strategy, input_format, DocumentConverter, pipeline, ocr_engine):\n if strategy == \"latest\" and pipeline == \"standard\":\n try:\n from docling.datamodel.pipeline_options import PdfPipelineOptions # type: ignore\n from docling.document_converter import PdfFormatOption # type: ignore\n pipe = PdfPipelineOptions()\n if ocr_engine:\n try:\n from docling.models.factories import get_ocr_factory # type: ignore\n pipe.do_ocr = True\n fac = get_ocr_factory(allow_external_plugins=False)\n pipe.ocr_options = fac.create_options(kind=ocr_engine)\n except Exception:\n pipe.do_ocr = False\n fmt = {}\n if hasattr(input_format, \"PDF\"):\n fmt[getattr(input_format, \"PDF\")] = PdfFormatOption(pipeline_options=pipe)\n if hasattr(input_format, \"IMAGE\"):\n fmt[getattr(input_format, \"IMAGE\")] = PdfFormatOption(pipeline_options=pipe)\n return DocumentConverter(format_options=fmt)\n except Exception:\n return DocumentConverter()\n return DocumentConverter()\n\n def export_markdown(document, ImageRefMode, image_mode, img_ph, pg_ph):\n try:\n mode = getattr(ImageRefMode, image_mode.upper(), image_mode)\n return document.export_to_markdown(\n image_mode=mode,\n image_placeholder=img_ph,\n page_break_placeholder=pg_ph,\n )\n except Exception:\n try:\n return document.export_to_text()\n except Exception:\n return str(document)\n\n def to_rows(doc_dict):\n rows = []\n for t in doc_dict.get(\"texts\", []):\n prov = t.get(\"prov\") or []\n page_no = None\n if prov and isinstance(prov, list) and isinstance(prov[0], dict):\n page_no = prov[0].get(\"page_no\")\n rows.append({\n \"page_no\": page_no,\n \"label\": t.get(\"label\"),\n \"text\": t.get(\"text\"),\n \"level\": t.get(\"level\"),\n })\n return rows\n\n def main():\n cfg = json.loads(sys.stdin.read())\n file_path = cfg[\"file_path\"]\n markdown = cfg[\"markdown\"]\n image_mode = cfg[\"image_mode\"]\n img_ph = cfg[\"md_image_placeholder\"]\n pg_ph = cfg[\"md_page_break_placeholder\"]\n pipeline = cfg[\"pipeline\"]\n ocr_engine = cfg.get(\"ocr_engine\")\n meta = {\"file_path\": file_path}\n\n try:\n ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, strategy = try_imports()\n converter = create_converter(strategy, InputFormat, DocumentConverter, pipeline, ocr_engine)\n try:\n res = converter.convert(file_path)\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling conversion error: {e}\", \"meta\": meta}))\n return\n\n ok = False\n if hasattr(res, \"status\"):\n try:\n ok = (res.status == ConversionStatus.SUCCESS) or (str(res.status).lower() == \"success\")\n except Exception:\n ok = (str(res.status).lower() == \"success\")\n if not ok and hasattr(res, \"document\"):\n ok = getattr(res, \"document\", None) is not None\n if not ok:\n print(json.dumps({\"ok\": False, \"error\": \"Docling conversion failed\", \"meta\": meta}))\n return\n\n doc = getattr(res, \"document\", None)\n if doc is None:\n print(json.dumps({\"ok\": False, \"error\": \"Docling produced no document\", \"meta\": meta}))\n return\n\n if markdown:\n text = export_markdown(doc, ImageRefMode, image_mode, img_ph, pg_ph)\n print(json.dumps({\"ok\": True, \"mode\": \"markdown\", \"text\": text, \"meta\": meta}))\n return\n\n # structured\n try:\n doc_dict = doc.export_to_dict()\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling export_to_dict failed: {e}\", \"meta\": meta}))\n return\n\n rows = to_rows(doc_dict)\n print(json.dumps({\"ok\": True, \"mode\": \"structured\", \"doc\": rows, \"meta\": meta}))\n except Exception as e:\n print(\n json.dumps({\n \"ok\": False,\n \"error\": f\"Docling processing error: {e}\",\n \"meta\": {\"file_path\": file_path},\n })\n )\n\n if __name__ == \"__main__\":\n main()\n \"\"\"\n )\n\n # Validate file_path to avoid command injection or unsafe input\n if not isinstance(args[\"file_path\"], str) or any(c in args[\"file_path\"] for c in [\";\", \"|\", \"&\", \"$\", \"`\"]):\n return Data(data={\"error\": \"Unsafe file path detected.\", \"file_path\": args[\"file_path\"]})\n\n proc = subprocess.run( # noqa: S603\n [sys.executable, \"-u\", \"-c\", child_script],\n input=json.dumps(args).encode(\"utf-8\"),\n capture_output=True,\n check=False,\n )\n\n if not proc.stdout:\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\") or \"no output from child process\"\n return Data(data={\"error\": f\"Docling subprocess error: {err_msg}\", \"file_path\": file_path})\n\n try:\n result = json.loads(proc.stdout.decode(\"utf-8\"))\n except Exception as e: # noqa: BLE001\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\")\n return Data(\n data={\"error\": f\"Invalid JSON from Docling subprocess: {e}. stderr={err_msg}\", \"file_path\": file_path},\n )\n\n if not result.get(\"ok\"):\n return Data(data={\"error\": result.get(\"error\", \"Unknown Docling error\"), **result.get(\"meta\", {})})\n\n meta = result.get(\"meta\", {})\n if result.get(\"mode\") == \"markdown\":\n exported_content = str(result.get(\"text\", \"\"))\n return Data(\n text=exported_content,\n data={\"exported_content\": exported_content, \"export_format\": self.EXPORT_FORMAT, **meta},\n )\n\n rows = list(result.get(\"doc\", []))\n return Data(data={\"doc\": rows, \"export_format\": self.EXPORT_FORMAT, **meta})\n\n def process_files(\n self,\n file_list: list[BaseFileComponent.BaseFile],\n ) -> list[BaseFileComponent.BaseFile]:\n \"\"\"Process input files.\n\n - Single file + advanced_mode => Docling in a separate process.\n - Otherwise => standard parsing in current process (optionally threaded).\n \"\"\"\n if not file_list:\n msg = \"No files to process.\"\n raise ValueError(msg)\n\n def process_file_standard(file_path: str, *, silent_errors: bool = False) -> Data | None:\n try:\n return parse_text_file_to_data(file_path, silent_errors=silent_errors)\n except FileNotFoundError as e:\n self.log(f\"File not found: {file_path}. Error: {e}\")\n if not silent_errors:\n raise\n return None\n except Exception as e:\n self.log(f\"Unexpected error processing {file_path}: {e}\")\n if not silent_errors:\n raise\n return None\n\n # Advanced path: only for a single Docling-compatible file\n if len(file_list) == 1:\n file_path = str(file_list[0].path)\n if self.advanced_mode and self._is_docling_compatible(file_path):\n advanced_data: Data | None = self._process_docling_in_subprocess(file_path)\n\n # --- UNNEST: expand each element in `doc` to its own Data row\n payload = getattr(advanced_data, \"data\", {}) or {}\n doc_rows = payload.get(\"doc\")\n if isinstance(doc_rows, list):\n rows: list[Data | None] = [\n Data(\n data={\n \"file_path\": file_path,\n **(item if isinstance(item, dict) else {\"value\": item}),\n },\n )\n for item in doc_rows\n ]\n return self.rollup_data(file_list, rows)\n\n # If not structured, keep as-is (e.g., markdown export or error dict)\n return self.rollup_data(file_list, [advanced_data])\n\n # Standard multi-file (or single non-advanced) path\n concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading)\n file_paths = [str(f.path) for f in file_list]\n self.log(f\"Starting parallel processing of {len(file_paths)} files with concurrency: {concurrency}.\")\n my_data = parallel_load_data(\n file_paths,\n silent_errors=self.silent_errors,\n load_function=process_file_standard,\n max_concurrency=concurrency,\n )\n return self.rollup_data(file_list, my_data)\n\n # ------------------------------ Output helpers -----------------------------------\n\n def load_files_advanced(self) -> DataFrame:\n \"\"\"Load files using advanced Docling processing and export to an advanced format.\"\"\"\n self.markdown = False\n return self.load_files()\n\n def load_files_markdown(self) -> Message:\n \"\"\"Load files using advanced Docling processing and export to Markdown format.\"\"\"\n self.markdown = True\n result = self.load_files()\n return Message(text=str(result.text[0]))\n" | ||
| "value": "\"\"\"Enhanced file component with clearer structure and Docling isolation.\n\nNotes:\n-----\n- Functionality is preserved with minimal behavioral changes.\n- ALL Docling parsing/export runs in a separate OS process to prevent memory\n growth and native library state from impacting the main Langflow process.\n- Standard text/structured parsing continues to use existing BaseFileComponent\n utilities (and optional threading via `parallel_load_data`).\n\"\"\"\n\nfrom __future__ import annotations\n\nimport json\nimport subprocess\nimport sys\nimport textwrap\nfrom copy import deepcopy\nfrom typing import Any\n\nfrom lfx.base.data.base_file import BaseFileComponent\nfrom lfx.base.data.utils import TEXT_FILE_TYPES, parallel_load_data, parse_text_file_to_data\nfrom lfx.inputs.inputs import DropdownInput, MessageTextInput, StrInput\nfrom lfx.io import BoolInput, FileInput, IntInput, Output\nfrom lfx.schema import DataFrame # noqa: TC001\nfrom lfx.schema.data import Data\nfrom lfx.schema.message import Message\n\n\nclass FileComponent(BaseFileComponent):\n \"\"\"Read File component with optional Docling processing (isolated in a subprocess).\"\"\"\n\n display_name = \"Read File\"\n description = \"Loads content from files with optional advanced document processing and export using Docling.\"\n documentation: str = \"https://docs.langflow.org/components-data#file\"\n icon = \"file-text\"\n name = \"File\"\n\n # Docling-supported/compatible extensions; TEXT_FILE_TYPES are supported by the base loader.\n VALID_EXTENSIONS = [\n \"adoc\",\n \"asciidoc\",\n \"asc\",\n \"bmp\",\n \"csv\",\n \"dotx\",\n \"dotm\",\n \"docm\",\n \"docx\",\n \"htm\",\n \"html\",\n \"jpeg\",\n \"json\",\n \"md\",\n \"pdf\",\n \"png\",\n \"potx\",\n \"ppsx\",\n \"pptm\",\n \"potm\",\n \"ppsm\",\n \"pptx\",\n \"tiff\",\n \"txt\",\n \"xls\",\n \"xlsx\",\n \"xhtml\",\n \"xml\",\n \"webp\",\n *TEXT_FILE_TYPES,\n ]\n\n # Fixed export settings used when markdown export is requested.\n EXPORT_FORMAT = \"Markdown\"\n IMAGE_MODE = \"placeholder\"\n\n _base_inputs = deepcopy(BaseFileComponent.get_base_inputs())\n\n for input_item in _base_inputs:\n if isinstance(input_item, FileInput) and input_item.name == \"path\":\n input_item.real_time_refresh = True\n break\n\n inputs = [\n *_base_inputs,\n BoolInput(\n name=\"advanced_mode\",\n display_name=\"Advanced Parser\",\n value=False,\n real_time_refresh=True,\n info=(\n \"Enable advanced document processing and export with Docling for PDFs, images, and office documents. \"\n \"Available only for single file processing.\"\n ),\n show=False,\n ),\n DropdownInput(\n name=\"pipeline\",\n display_name=\"Pipeline\",\n info=\"Docling pipeline to use\",\n options=[\"standard\", \"vlm\"],\n value=\"standard\",\n advanced=True,\n ),\n DropdownInput(\n name=\"ocr_engine\",\n display_name=\"OCR Engine\",\n info=\"OCR engine to use. Only available when pipeline is set to 'standard'.\",\n options=[\"\", \"easyocr\"],\n value=\"\",\n show=False,\n advanced=True,\n ),\n StrInput(\n name=\"md_image_placeholder\",\n display_name=\"Image placeholder\",\n info=\"Specify the image placeholder for markdown exports.\",\n value=\"<!-- image -->\",\n advanced=True,\n show=False,\n ),\n StrInput(\n name=\"md_page_break_placeholder\",\n display_name=\"Page break placeholder\",\n info=\"Add this placeholder between pages in the markdown output.\",\n value=\"\",\n advanced=True,\n show=False,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n show=False,\n ),\n # Deprecated input retained for backward-compatibility.\n BoolInput(\n name=\"use_multithreading\",\n display_name=\"[Deprecated] Use Multithreading\",\n advanced=True,\n value=True,\n info=\"Set 'Processing Concurrency' greater than 1 to enable multithreading.\",\n ),\n IntInput(\n name=\"concurrency_multithreading\",\n display_name=\"Processing Concurrency\",\n advanced=True,\n info=\"When multiple files are being processed, the number of files to process concurrently.\",\n value=1,\n ),\n BoolInput(\n name=\"markdown\",\n display_name=\"Markdown Export\",\n info=\"Export processed documents to Markdown format. Only available when advanced mode is enabled.\",\n value=False,\n show=False,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n ]\n\n # ------------------------------ UI helpers --------------------------------------\n\n def _path_value(self, template: dict) -> list[str]:\n \"\"\"Return the list of currently selected file paths from the template.\"\"\"\n return template.get(\"path\", {}).get(\"file_path\", [])\n\n def update_build_config(\n self,\n build_config: dict[str, Any],\n field_value: Any,\n field_name: str | None = None,\n ) -> dict[str, Any]:\n \"\"\"Show/hide Advanced Parser and related fields based on selection context.\"\"\"\n if field_name == \"path\":\n paths = self._path_value(build_config)\n file_path = paths[0] if paths else \"\"\n file_count = len(field_value) if field_value else 0\n\n # Advanced mode only for single (non-tabular) file\n allow_advanced = file_count == 1 and not file_path.endswith((\".csv\", \".xlsx\", \".parquet\"))\n build_config[\"advanced_mode\"][\"show\"] = allow_advanced\n if not allow_advanced:\n build_config[\"advanced_mode\"][\"value\"] = False\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = False\n\n elif field_name == \"advanced_mode\":\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = bool(field_value)\n\n return build_config\n\n def update_outputs(self, frontend_node: dict[str, Any], field_name: str, field_value: Any) -> dict[str, Any]: # noqa: ARG002\n \"\"\"Dynamically show outputs based on file count/type and advanced mode.\"\"\"\n if field_name not in [\"path\", \"advanced_mode\"]:\n return frontend_node\n\n template = frontend_node.get(\"template\", {})\n paths = self._path_value(template)\n if not paths:\n return frontend_node\n\n frontend_node[\"outputs\"] = []\n if len(paths) == 1:\n file_path = paths[0] if field_name == \"path\" else frontend_node[\"template\"][\"path\"][\"file_path\"][0]\n if file_path.endswith((\".csv\", \".xlsx\", \".parquet\")):\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Content\",\n name=\"dataframe\",\n method=\"load_files_structured\",\n tool_mode=True,\n ),\n )\n elif file_path.endswith(\".json\"):\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Content\", name=\"json\", method=\"load_files_json\", tool_mode=True),\n )\n\n advanced_mode = frontend_node.get(\"template\", {}).get(\"advanced_mode\", {}).get(\"value\", False)\n if advanced_mode:\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Output\",\n name=\"advanced\",\n method=\"load_files_advanced\",\n tool_mode=True,\n ),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Markdown\", name=\"markdown\", method=\"load_files_markdown\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n # Multiple files => DataFrame output; advanced parser disabled\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Files\",\n name=\"dataframe\",\n method=\"load_files\",\n tool_mode=True,\n )\n )\n\n return frontend_node\n\n # ------------------------------ Core processing ----------------------------------\n\n def _is_docling_compatible(self, file_path: str) -> bool:\n \"\"\"Lightweight extension gate for Docling-compatible types.\"\"\"\n docling_exts = (\n \".adoc\",\n \".asciidoc\",\n \".asc\",\n \".bmp\",\n \".csv\",\n \".dotx\",\n \".dotm\",\n \".docm\",\n \".docx\",\n \".htm\",\n \".html\",\n \".jpeg\",\n \".json\",\n \".md\",\n \".pdf\",\n \".png\",\n \".potx\",\n \".ppsx\",\n \".pptm\",\n \".potm\",\n \".ppsm\",\n \".pptx\",\n \".tiff\",\n \".txt\",\n \".xls\",\n \".xlsx\",\n \".xhtml\",\n \".xml\",\n \".webp\",\n )\n return file_path.lower().endswith(docling_exts)\n\n def _process_docling_in_subprocess(self, file_path: str) -> Data | None:\n \"\"\"Run Docling in a separate OS process and map the result to a Data object.\n\n We avoid multiprocessing pickling by launching `python -c \"<script>\"` and\n passing JSON config via stdin. The child prints a JSON result to stdout.\n \"\"\"\n if not file_path:\n return None\n\n args: dict[str, Any] = {\n \"file_path\": file_path,\n \"markdown\": bool(self.markdown),\n \"image_mode\": str(self.IMAGE_MODE),\n \"md_image_placeholder\": str(self.md_image_placeholder),\n \"md_page_break_placeholder\": str(self.md_page_break_placeholder),\n \"pipeline\": str(self.pipeline),\n \"ocr_engine\": str(self.ocr_engine) if getattr(self, \"ocr_engine\", \"\") else None,\n }\n\n # The child is a tiny, self-contained script to keep memory/state isolated.\n child_script = textwrap.dedent(\n r\"\"\"\n import json, sys\n\n def try_imports():\n # Strategy 1: latest layout\n try:\n from docling.datamodel.base_models import ConversionStatus, InputFormat # type: ignore\n from docling.document_converter import DocumentConverter # type: ignore\n from docling_core.types.doc import ImageRefMode # type: ignore\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"latest\"\n except Exception:\n pass\n # Strategy 2: alternative layout\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n try:\n from docling_core.types import ConversionStatus, InputFormat # type: ignore\n except Exception:\n try:\n from docling.datamodel import ConversionStatus, InputFormat # type: ignore\n except Exception:\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n try:\n from docling_core.types.doc import ImageRefMode # type: ignore\n except Exception:\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"alternative\"\n except Exception:\n pass\n # Strategy 3: basic converter only\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"basic\"\n except Exception as e:\n raise ImportError(f\"Docling imports failed: {e}\") from e\n\n def create_converter(strategy, input_format, DocumentConverter, pipeline, ocr_engine):\n if strategy == \"latest\" and pipeline == \"standard\":\n try:\n from docling.datamodel.pipeline_options import PdfPipelineOptions # type: ignore\n from docling.document_converter import PdfFormatOption # type: ignore\n pipe = PdfPipelineOptions()\n if ocr_engine:\n try:\n from docling.models.factories import get_ocr_factory # type: ignore\n pipe.do_ocr = True\n fac = get_ocr_factory(allow_external_plugins=False)\n pipe.ocr_options = fac.create_options(kind=ocr_engine)\n except Exception:\n pipe.do_ocr = False\n fmt = {}\n if hasattr(input_format, \"PDF\"):\n fmt[getattr(input_format, \"PDF\")] = PdfFormatOption(pipeline_options=pipe)\n if hasattr(input_format, \"IMAGE\"):\n fmt[getattr(input_format, \"IMAGE\")] = PdfFormatOption(pipeline_options=pipe)\n return DocumentConverter(format_options=fmt)\n except Exception:\n return DocumentConverter()\n return DocumentConverter()\n\n def export_markdown(document, ImageRefMode, image_mode, img_ph, pg_ph):\n try:\n mode = getattr(ImageRefMode, image_mode.upper(), image_mode)\n return document.export_to_markdown(\n image_mode=mode,\n image_placeholder=img_ph,\n page_break_placeholder=pg_ph,\n )\n except Exception:\n try:\n return document.export_to_text()\n except Exception:\n return str(document)\n\n def to_rows(doc_dict):\n rows = []\n for t in doc_dict.get(\"texts\", []):\n prov = t.get(\"prov\") or []\n page_no = None\n if prov and isinstance(prov, list) and isinstance(prov[0], dict):\n page_no = prov[0].get(\"page_no\")\n rows.append({\n \"page_no\": page_no,\n \"label\": t.get(\"label\"),\n \"text\": t.get(\"text\"),\n \"level\": t.get(\"level\"),\n })\n return rows\n\n def main():\n cfg = json.loads(sys.stdin.read())\n file_path = cfg[\"file_path\"]\n markdown = cfg[\"markdown\"]\n image_mode = cfg[\"image_mode\"]\n img_ph = cfg[\"md_image_placeholder\"]\n pg_ph = cfg[\"md_page_break_placeholder\"]\n pipeline = cfg[\"pipeline\"]\n ocr_engine = cfg.get(\"ocr_engine\")\n meta = {\"file_path\": file_path}\n\n try:\n ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, strategy = try_imports()\n converter = create_converter(strategy, InputFormat, DocumentConverter, pipeline, ocr_engine)\n try:\n res = converter.convert(file_path)\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling conversion error: {e}\", \"meta\": meta}))\n return\n\n ok = False\n if hasattr(res, \"status\"):\n try:\n ok = (res.status == ConversionStatus.SUCCESS) or (str(res.status).lower() == \"success\")\n except Exception:\n ok = (str(res.status).lower() == \"success\")\n if not ok and hasattr(res, \"document\"):\n ok = getattr(res, \"document\", None) is not None\n if not ok:\n print(json.dumps({\"ok\": False, \"error\": \"Docling conversion failed\", \"meta\": meta}))\n return\n\n doc = getattr(res, \"document\", None)\n if doc is None:\n print(json.dumps({\"ok\": False, \"error\": \"Docling produced no document\", \"meta\": meta}))\n return\n\n if markdown:\n text = export_markdown(doc, ImageRefMode, image_mode, img_ph, pg_ph)\n print(json.dumps({\"ok\": True, \"mode\": \"markdown\", \"text\": text, \"meta\": meta}))\n return\n\n # structured\n try:\n doc_dict = doc.export_to_dict()\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling export_to_dict failed: {e}\", \"meta\": meta}))\n return\n\n rows = to_rows(doc_dict)\n print(json.dumps({\"ok\": True, \"mode\": \"structured\", \"doc\": rows, \"meta\": meta}))\n except Exception as e:\n print(\n json.dumps({\n \"ok\": False,\n \"error\": f\"Docling processing error: {e}\",\n \"meta\": {\"file_path\": file_path},\n })\n )\n\n if __name__ == \"__main__\":\n main()\n \"\"\"\n )\n\n # Validate file_path to avoid command injection or unsafe input\n if not isinstance(args[\"file_path\"], str) or any(c in args[\"file_path\"] for c in [\";\", \"|\", \"&\", \"$\", \"`\"]):\n return Data(data={\"error\": \"Unsafe file path detected.\", \"file_path\": args[\"file_path\"]})\n\n proc = subprocess.run( # noqa: S603\n [sys.executable, \"-u\", \"-c\", child_script],\n input=json.dumps(args).encode(\"utf-8\"),\n capture_output=True,\n check=False,\n )\n\n if not proc.stdout:\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\") or \"no output from child process\"\n return Data(data={\"error\": f\"Docling subprocess error: {err_msg}\", \"file_path\": file_path})\n\n try:\n result = json.loads(proc.stdout.decode(\"utf-8\"))\n except Exception as e: # noqa: BLE001\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\")\n return Data(\n data={\"error\": f\"Invalid JSON from Docling subprocess: {e}. stderr={err_msg}\", \"file_path\": file_path},\n )\n\n if not result.get(\"ok\"):\n return Data(data={\"error\": result.get(\"error\", \"Unknown Docling error\"), **result.get(\"meta\", {})})\n\n meta = result.get(\"meta\", {})\n if result.get(\"mode\") == \"markdown\":\n exported_content = str(result.get(\"text\", \"\"))\n return Data(\n text=exported_content,\n data={\"exported_content\": exported_content, \"export_format\": self.EXPORT_FORMAT, **meta},\n )\n\n rows = list(result.get(\"doc\", []))\n return Data(data={\"doc\": rows, \"export_format\": self.EXPORT_FORMAT, **meta})\n\n def process_files(\n self,\n file_list: list[BaseFileComponent.BaseFile],\n ) -> list[BaseFileComponent.BaseFile]:\n \"\"\"Process input files.\n\n - Single file + advanced_mode => Docling in a separate process.\n - Otherwise => standard parsing in current process (optionally threaded).\n \"\"\"\n if not file_list:\n msg = \"No files to process.\"\n raise ValueError(msg)\n\n def process_file_standard(file_path: str, *, silent_errors: bool = False) -> Data | None:\n try:\n return parse_text_file_to_data(file_path, silent_errors=silent_errors)\n except FileNotFoundError as e:\n self.log(f\"File not found: {file_path}. Error: {e}\")\n if not silent_errors:\n raise\n return None\n except Exception as e:\n self.log(f\"Unexpected error processing {file_path}: {e}\")\n if not silent_errors:\n raise\n return None\n\n # Advanced path: only for a single Docling-compatible file\n if len(file_list) == 1:\n file_path = str(file_list[0].path)\n if self.advanced_mode and self._is_docling_compatible(file_path):\n advanced_data: Data | None = self._process_docling_in_subprocess(file_path)\n\n # --- UNNEST: expand each element in `doc` to its own Data row\n payload = getattr(advanced_data, \"data\", {}) or {}\n doc_rows = payload.get(\"doc\")\n if isinstance(doc_rows, list):\n rows: list[Data | None] = [\n Data(\n data={\n \"file_path\": file_path,\n **(item if isinstance(item, dict) else {\"value\": item}),\n },\n )\n for item in doc_rows\n ]\n return self.rollup_data(file_list, rows)\n\n # If not structured, keep as-is (e.g., markdown export or error dict)\n return self.rollup_data(file_list, [advanced_data])\n\n # Standard multi-file (or single non-advanced) path\n concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading)\n file_paths = [str(f.path) for f in file_list]\n self.log(f\"Starting parallel processing of {len(file_paths)} files with concurrency: {concurrency}.\")\n my_data = parallel_load_data(\n file_paths,\n silent_errors=self.silent_errors,\n load_function=process_file_standard,\n max_concurrency=concurrency,\n )\n return self.rollup_data(file_list, my_data)\n\n # ------------------------------ Output helpers -----------------------------------\n\n def load_files_advanced(self) -> DataFrame:\n \"\"\"Load files using advanced Docling processing and export to an advanced format.\"\"\"\n self.markdown = False\n return self.load_files()\n\n def load_files_markdown(self) -> Message:\n \"\"\"Load files using advanced Docling processing and export to Markdown format.\"\"\"\n self.markdown = True\n result = self.load_files()\n return Message(text=str(result.text[0]))\n" |
There was a problem hiding this comment.
Add timeout and robust error handling to Docling subprocess call.
Same concern as the other starter: protect against hangs and surface a clear timeout error.
Apply this diff inside the embedded FileComponent code:
- proc = subprocess.run( # noqa: S603
- [sys.executable, "-u", "-c", child_script],
- input=json.dumps(args).encode("utf-8"),
- capture_output=True,
- check=False,
- )
+ try:
+ proc = subprocess.run( # noqa: S603
+ [sys.executable, "-u", "-c", child_script],
+ input=json.dumps(args).encode("utf-8"),
+ capture_output=True,
+ check=False,
+ timeout=getattr(self, "subprocess_timeout", 120),
+ )
+ except subprocess.TimeoutExpired:
+ return Data(data={"error": "Docling subprocess timed out", "file_path": file_path})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "value": "\"\"\"Enhanced file component with clearer structure and Docling isolation.\n\nNotes:\n-----\n- Functionality is preserved with minimal behavioral changes.\n- ALL Docling parsing/export runs in a separate OS process to prevent memory\n growth and native library state from impacting the main Langflow process.\n- Standard text/structured parsing continues to use existing BaseFileComponent\n utilities (and optional threading via `parallel_load_data`).\n\"\"\"\n\nfrom __future__ import annotations\n\nimport json\nimport subprocess\nimport sys\nimport textwrap\nfrom copy import deepcopy\nfrom typing import Any\n\nfrom lfx.base.data.base_file import BaseFileComponent\nfrom lfx.base.data.utils import TEXT_FILE_TYPES, parallel_load_data, parse_text_file_to_data\nfrom lfx.inputs.inputs import DropdownInput, MessageTextInput, StrInput\nfrom lfx.io import BoolInput, FileInput, IntInput, Output\nfrom lfx.schema import DataFrame # noqa: TC001\nfrom lfx.schema.data import Data\nfrom lfx.schema.message import Message\n\n\nclass FileComponent(BaseFileComponent):\n \"\"\"Read File component with optional Docling processing (isolated in a subprocess).\"\"\"\n\n display_name = \"Read File\"\n description = \"Loads content from files with optional advanced document processing and export using Docling.\"\n documentation: str = \"https://docs.langflow.org/components-data#file\"\n icon = \"file-text\"\n name = \"File\"\n\n # Docling-supported/compatible extensions; TEXT_FILE_TYPES are supported by the base loader.\n VALID_EXTENSIONS = [\n \"adoc\",\n \"asciidoc\",\n \"asc\",\n \"bmp\",\n \"csv\",\n \"dotx\",\n \"dotm\",\n \"docm\",\n \"docx\",\n \"htm\",\n \"html\",\n \"jpeg\",\n \"json\",\n \"md\",\n \"pdf\",\n \"png\",\n \"potx\",\n \"ppsx\",\n \"pptm\",\n \"potm\",\n \"ppsm\",\n \"pptx\",\n \"tiff\",\n \"txt\",\n \"xls\",\n \"xlsx\",\n \"xhtml\",\n \"xml\",\n \"webp\",\n *TEXT_FILE_TYPES,\n ]\n\n # Fixed export settings used when markdown export is requested.\n EXPORT_FORMAT = \"Markdown\"\n IMAGE_MODE = \"placeholder\"\n\n _base_inputs = deepcopy(BaseFileComponent.get_base_inputs())\n\n for input_item in _base_inputs:\n if isinstance(input_item, FileInput) and input_item.name == \"path\":\n input_item.real_time_refresh = True\n break\n\n inputs = [\n *_base_inputs,\n BoolInput(\n name=\"advanced_mode\",\n display_name=\"Advanced Parser\",\n value=False,\n real_time_refresh=True,\n info=(\n \"Enable advanced document processing and export with Docling for PDFs, images, and office documents. \"\n \"Available only for single file processing.\"\n ),\n show=False,\n ),\n DropdownInput(\n name=\"pipeline\",\n display_name=\"Pipeline\",\n info=\"Docling pipeline to use\",\n options=[\"standard\", \"vlm\"],\n value=\"standard\",\n advanced=True,\n ),\n DropdownInput(\n name=\"ocr_engine\",\n display_name=\"OCR Engine\",\n info=\"OCR engine to use. Only available when pipeline is set to 'standard'.\",\n options=[\"\", \"easyocr\"],\n value=\"\",\n show=False,\n advanced=True,\n ),\n StrInput(\n name=\"md_image_placeholder\",\n display_name=\"Image placeholder\",\n info=\"Specify the image placeholder for markdown exports.\",\n value=\"<!-- image -->\",\n advanced=True,\n show=False,\n ),\n StrInput(\n name=\"md_page_break_placeholder\",\n display_name=\"Page break placeholder\",\n info=\"Add this placeholder between pages in the markdown output.\",\n value=\"\",\n advanced=True,\n show=False,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n show=False,\n ),\n # Deprecated input retained for backward-compatibility.\n BoolInput(\n name=\"use_multithreading\",\n display_name=\"[Deprecated] Use Multithreading\",\n advanced=True,\n value=True,\n info=\"Set 'Processing Concurrency' greater than 1 to enable multithreading.\",\n ),\n IntInput(\n name=\"concurrency_multithreading\",\n display_name=\"Processing Concurrency\",\n advanced=True,\n info=\"When multiple files are being processed, the number of files to process concurrently.\",\n value=1,\n ),\n BoolInput(\n name=\"markdown\",\n display_name=\"Markdown Export\",\n info=\"Export processed documents to Markdown format. Only available when advanced mode is enabled.\",\n value=False,\n show=False,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n ]\n\n # ------------------------------ UI helpers --------------------------------------\n\n def _path_value(self, template: dict) -> list[str]:\n \"\"\"Return the list of currently selected file paths from the template.\"\"\"\n return template.get(\"path\", {}).get(\"file_path\", [])\n\n def update_build_config(\n self,\n build_config: dict[str, Any],\n field_value: Any,\n field_name: str | None = None,\n ) -> dict[str, Any]:\n \"\"\"Show/hide Advanced Parser and related fields based on selection context.\"\"\"\n if field_name == \"path\":\n paths = self._path_value(build_config)\n file_path = paths[0] if paths else \"\"\n file_count = len(field_value) if field_value else 0\n\n # Advanced mode only for single (non-tabular) file\n allow_advanced = file_count == 1 and not file_path.endswith((\".csv\", \".xlsx\", \".parquet\"))\n build_config[\"advanced_mode\"][\"show\"] = allow_advanced\n if not allow_advanced:\n build_config[\"advanced_mode\"][\"value\"] = False\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = False\n\n elif field_name == \"advanced_mode\":\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = bool(field_value)\n\n return build_config\n\n def update_outputs(self, frontend_node: dict[str, Any], field_name: str, field_value: Any) -> dict[str, Any]: # noqa: ARG002\n \"\"\"Dynamically show outputs based on file count/type and advanced mode.\"\"\"\n if field_name not in [\"path\", \"advanced_mode\"]:\n return frontend_node\n\n template = frontend_node.get(\"template\", {})\n paths = self._path_value(template)\n if not paths:\n return frontend_node\n\n frontend_node[\"outputs\"] = []\n if len(paths) == 1:\n file_path = paths[0] if field_name == \"path\" else frontend_node[\"template\"][\"path\"][\"file_path\"][0]\n if file_path.endswith((\".csv\", \".xlsx\", \".parquet\")):\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Content\",\n name=\"dataframe\",\n method=\"load_files_structured\",\n tool_mode=True,\n ),\n )\n elif file_path.endswith(\".json\"):\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Content\", name=\"json\", method=\"load_files_json\", tool_mode=True),\n )\n\n advanced_mode = frontend_node.get(\"template\", {}).get(\"advanced_mode\", {}).get(\"value\", False)\n if advanced_mode:\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Output\",\n name=\"advanced\",\n method=\"load_files_advanced\",\n tool_mode=True,\n ),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Markdown\", name=\"markdown\", method=\"load_files_markdown\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n # Multiple files => DataFrame output; advanced parser disabled\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Files\",\n name=\"dataframe\",\n method=\"load_files\",\n tool_mode=True,\n )\n )\n\n return frontend_node\n\n # ------------------------------ Core processing ----------------------------------\n\n def _is_docling_compatible(self, file_path: str) -> bool:\n \"\"\"Lightweight extension gate for Docling-compatible types.\"\"\"\n docling_exts = (\n \".adoc\",\n \".asciidoc\",\n \".asc\",\n \".bmp\",\n \".csv\",\n \".dotx\",\n \".dotm\",\n \".docm\",\n \".docx\",\n \".htm\",\n \".html\",\n \".jpeg\",\n \".json\",\n \".md\",\n \".pdf\",\n \".png\",\n \".potx\",\n \".ppsx\",\n \".pptm\",\n \".potm\",\n \".ppsm\",\n \".pptx\",\n \".tiff\",\n \".txt\",\n \".xls\",\n \".xlsx\",\n \".xhtml\",\n \".xml\",\n \".webp\",\n )\n return file_path.lower().endswith(docling_exts)\n\n def _process_docling_in_subprocess(self, file_path: str) -> Data | None:\n \"\"\"Run Docling in a separate OS process and map the result to a Data object.\n\n We avoid multiprocessing pickling by launching `python -c \"<script>\"` and\n passing JSON config via stdin. The child prints a JSON result to stdout.\n \"\"\"\n if not file_path:\n return None\n\n args: dict[str, Any] = {\n \"file_path\": file_path,\n \"markdown\": bool(self.markdown),\n \"image_mode\": str(self.IMAGE_MODE),\n \"md_image_placeholder\": str(self.md_image_placeholder),\n \"md_page_break_placeholder\": str(self.md_page_break_placeholder),\n \"pipeline\": str(self.pipeline),\n \"ocr_engine\": str(self.ocr_engine) if getattr(self, \"ocr_engine\", \"\") else None,\n }\n\n # The child is a tiny, self-contained script to keep memory/state isolated.\n child_script = textwrap.dedent(\n r\"\"\"\n import json, sys\n\n def try_imports():\n # Strategy 1: latest layout\n try:\n from docling.datamodel.base_models import ConversionStatus, InputFormat # type: ignore\n from docling.document_converter import DocumentConverter # type: ignore\n from docling_core.types.doc import ImageRefMode # type: ignore\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"latest\"\n except Exception:\n pass\n # Strategy 2: alternative layout\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n try:\n from docling_core.types import ConversionStatus, InputFormat # type: ignore\n except Exception:\n try:\n from docling.datamodel import ConversionStatus, InputFormat # type: ignore\n except Exception:\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n try:\n from docling_core.types.doc import ImageRefMode # type: ignore\n except Exception:\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"alternative\"\n except Exception:\n pass\n # Strategy 3: basic converter only\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"basic\"\n except Exception as e:\n raise ImportError(f\"Docling imports failed: {e}\") from e\n\n def create_converter(strategy, input_format, DocumentConverter, pipeline, ocr_engine):\n if strategy == \"latest\" and pipeline == \"standard\":\n try:\n from docling.datamodel.pipeline_options import PdfPipelineOptions # type: ignore\n from docling.document_converter import PdfFormatOption # type: ignore\n pipe = PdfPipelineOptions()\n if ocr_engine:\n try:\n from docling.models.factories import get_ocr_factory # type: ignore\n pipe.do_ocr = True\n fac = get_ocr_factory(allow_external_plugins=False)\n pipe.ocr_options = fac.create_options(kind=ocr_engine)\n except Exception:\n pipe.do_ocr = False\n fmt = {}\n if hasattr(input_format, \"PDF\"):\n fmt[getattr(input_format, \"PDF\")] = PdfFormatOption(pipeline_options=pipe)\n if hasattr(input_format, \"IMAGE\"):\n fmt[getattr(input_format, \"IMAGE\")] = PdfFormatOption(pipeline_options=pipe)\n return DocumentConverter(format_options=fmt)\n except Exception:\n return DocumentConverter()\n return DocumentConverter()\n\n def export_markdown(document, ImageRefMode, image_mode, img_ph, pg_ph):\n try:\n mode = getattr(ImageRefMode, image_mode.upper(), image_mode)\n return document.export_to_markdown(\n image_mode=mode,\n image_placeholder=img_ph,\n page_break_placeholder=pg_ph,\n )\n except Exception:\n try:\n return document.export_to_text()\n except Exception:\n return str(document)\n\n def to_rows(doc_dict):\n rows = []\n for t in doc_dict.get(\"texts\", []):\n prov = t.get(\"prov\") or []\n page_no = None\n if prov and isinstance(prov, list) and isinstance(prov[0], dict):\n page_no = prov[0].get(\"page_no\")\n rows.append({\n \"page_no\": page_no,\n \"label\": t.get(\"label\"),\n \"text\": t.get(\"text\"),\n \"level\": t.get(\"level\"),\n })\n return rows\n\n def main():\n cfg = json.loads(sys.stdin.read())\n file_path = cfg[\"file_path\"]\n markdown = cfg[\"markdown\"]\n image_mode = cfg[\"image_mode\"]\n img_ph = cfg[\"md_image_placeholder\"]\n pg_ph = cfg[\"md_page_break_placeholder\"]\n pipeline = cfg[\"pipeline\"]\n ocr_engine = cfg.get(\"ocr_engine\")\n meta = {\"file_path\": file_path}\n\n try:\n ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, strategy = try_imports()\n converter = create_converter(strategy, InputFormat, DocumentConverter, pipeline, ocr_engine)\n try:\n res = converter.convert(file_path)\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling conversion error: {e}\", \"meta\": meta}))\n return\n\n ok = False\n if hasattr(res, \"status\"):\n try:\n ok = (res.status == ConversionStatus.SUCCESS) or (str(res.status).lower() == \"success\")\n except Exception:\n ok = (str(res.status).lower() == \"success\")\n if not ok and hasattr(res, \"document\"):\n ok = getattr(res, \"document\", None) is not None\n if not ok:\n print(json.dumps({\"ok\": False, \"error\": \"Docling conversion failed\", \"meta\": meta}))\n return\n\n doc = getattr(res, \"document\", None)\n if doc is None:\n print(json.dumps({\"ok\": False, \"error\": \"Docling produced no document\", \"meta\": meta}))\n return\n\n if markdown:\n text = export_markdown(doc, ImageRefMode, image_mode, img_ph, pg_ph)\n print(json.dumps({\"ok\": True, \"mode\": \"markdown\", \"text\": text, \"meta\": meta}))\n return\n\n # structured\n try:\n doc_dict = doc.export_to_dict()\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling export_to_dict failed: {e}\", \"meta\": meta}))\n return\n\n rows = to_rows(doc_dict)\n print(json.dumps({\"ok\": True, \"mode\": \"structured\", \"doc\": rows, \"meta\": meta}))\n except Exception as e:\n print(\n json.dumps({\n \"ok\": False,\n \"error\": f\"Docling processing error: {e}\",\n \"meta\": {\"file_path\": file_path},\n })\n )\n\n if __name__ == \"__main__\":\n main()\n \"\"\"\n )\n\n # Validate file_path to avoid command injection or unsafe input\n if not isinstance(args[\"file_path\"], str) or any(c in args[\"file_path\"] for c in [\";\", \"|\", \"&\", \"$\", \"`\"]):\n return Data(data={\"error\": \"Unsafe file path detected.\", \"file_path\": args[\"file_path\"]})\n\n proc = subprocess.run( # noqa: S603\n [sys.executable, \"-u\", \"-c\", child_script],\n input=json.dumps(args).encode(\"utf-8\"),\n capture_output=True,\n check=False,\n )\n\n if not proc.stdout:\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\") or \"no output from child process\"\n return Data(data={\"error\": f\"Docling subprocess error: {err_msg}\", \"file_path\": file_path})\n\n try:\n result = json.loads(proc.stdout.decode(\"utf-8\"))\n except Exception as e: # noqa: BLE001\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\")\n return Data(\n data={\"error\": f\"Invalid JSON from Docling subprocess: {e}. stderr={err_msg}\", \"file_path\": file_path},\n )\n\n if not result.get(\"ok\"):\n return Data(data={\"error\": result.get(\"error\", \"Unknown Docling error\"), **result.get(\"meta\", {})})\n\n meta = result.get(\"meta\", {})\n if result.get(\"mode\") == \"markdown\":\n exported_content = str(result.get(\"text\", \"\"))\n return Data(\n text=exported_content,\n data={\"exported_content\": exported_content, \"export_format\": self.EXPORT_FORMAT, **meta},\n )\n\n rows = list(result.get(\"doc\", []))\n return Data(data={\"doc\": rows, \"export_format\": self.EXPORT_FORMAT, **meta})\n\n def process_files(\n self,\n file_list: list[BaseFileComponent.BaseFile],\n ) -> list[BaseFileComponent.BaseFile]:\n \"\"\"Process input files.\n\n - Single file + advanced_mode => Docling in a separate process.\n - Otherwise => standard parsing in current process (optionally threaded).\n \"\"\"\n if not file_list:\n msg = \"No files to process.\"\n raise ValueError(msg)\n\n def process_file_standard(file_path: str, *, silent_errors: bool = False) -> Data | None:\n try:\n return parse_text_file_to_data(file_path, silent_errors=silent_errors)\n except FileNotFoundError as e:\n self.log(f\"File not found: {file_path}. Error: {e}\")\n if not silent_errors:\n raise\n return None\n except Exception as e:\n self.log(f\"Unexpected error processing {file_path}: {e}\")\n if not silent_errors:\n raise\n return None\n\n # Advanced path: only for a single Docling-compatible file\n if len(file_list) == 1:\n file_path = str(file_list[0].path)\n if self.advanced_mode and self._is_docling_compatible(file_path):\n advanced_data: Data | None = self._process_docling_in_subprocess(file_path)\n\n # --- UNNEST: expand each element in `doc` to its own Data row\n payload = getattr(advanced_data, \"data\", {}) or {}\n doc_rows = payload.get(\"doc\")\n if isinstance(doc_rows, list):\n rows: list[Data | None] = [\n Data(\n data={\n \"file_path\": file_path,\n **(item if isinstance(item, dict) else {\"value\": item}),\n },\n )\n for item in doc_rows\n ]\n return self.rollup_data(file_list, rows)\n\n # If not structured, keep as-is (e.g., markdown export or error dict)\n return self.rollup_data(file_list, [advanced_data])\n\n # Standard multi-file (or single non-advanced) path\n concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading)\n file_paths = [str(f.path) for f in file_list]\n self.log(f\"Starting parallel processing of {len(file_paths)} files with concurrency: {concurrency}.\")\n my_data = parallel_load_data(\n file_paths,\n silent_errors=self.silent_errors,\n load_function=process_file_standard,\n max_concurrency=concurrency,\n )\n return self.rollup_data(file_list, my_data)\n\n # ------------------------------ Output helpers -----------------------------------\n\n def load_files_advanced(self) -> DataFrame:\n \"\"\"Load files using advanced Docling processing and export to an advanced format.\"\"\"\n self.markdown = False\n return self.load_files()\n\n def load_files_markdown(self) -> Message:\n \"\"\"Load files using advanced Docling processing and export to Markdown format.\"\"\"\n self.markdown = True\n result = self.load_files()\n return Message(text=str(result.text[0]))\n" | |
| # Validate file_path to avoid command injection or unsafe input | |
| if not isinstance(args["file_path"], str) or any(c in args["file_path"] for c in [";", "|", "&", "$", "`"]): | |
| return Data(data={"error": "Unsafe file path detected.", "file_path": args["file_path"]}) | |
| try: | |
| proc = subprocess.run( # noqa: S603 | |
| [sys.executable, "-u", "-c", child_script], | |
| input=json.dumps(args).encode("utf-8"), | |
| capture_output=True, | |
| check=False, | |
| timeout=getattr(self, "subprocess_timeout", 120), | |
| ) | |
| except subprocess.TimeoutExpired: | |
| return Data(data={"error": "Docling subprocess timed out", "file_path": file_path}) | |
| if not proc.stdout: | |
| err_msg = proc.stderr.decode("utf-8", errors="replace") or "no output from child process" | |
| return Data(data={"error": f"Docling subprocess error: {err_msg}", "file_path": file_path}) |
🤖 Prompt for AI Agents
In src/backend/base/langflow/initial_setup/starter_projects/Document Q&A.json
around line 1333, the Docling subprocess invocation using subprocess.run has no
timeout or explicit handling for hung/long-running children; add a timeout
argument (reasonable default, e.g., 60s) to subprocess.run and wrap the call in
try/except to catch subprocess.TimeoutExpired and other exceptions; on
TimeoutExpired decode any partial output/stderr and return a Data object with a
clear timeout error and file_path, and on other exceptions return a Data object
with the subprocess error message and stderr included so the caller gets robust,
surfaced errors instead of hanging indefinitely.
Guard None from _process_docling_in_subprocess before rollup.
Avoid passing [None] to rollup_data when the child returns nothing.
Apply this diff:
if self.advanced_mode and self._is_docling_compatible(file_path):
advanced_data: Data | None = self._process_docling_in_subprocess(file_path)
+ if advanced_data is None:
+ return self.rollup_data(
+ file_list,
+ [Data(data={"error": "Docling returned no data", "file_path": file_path})],
+ )
# --- UNNEST: expand each element in `doc` to its own Data row
payload = getattr(advanced_data, "data", {}) or {}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "value": "\"\"\"Enhanced file component with clearer structure and Docling isolation.\n\nNotes:\n-----\n- Functionality is preserved with minimal behavioral changes.\n- ALL Docling parsing/export runs in a separate OS process to prevent memory\n growth and native library state from impacting the main Langflow process.\n- Standard text/structured parsing continues to use existing BaseFileComponent\n utilities (and optional threading via `parallel_load_data`).\n\"\"\"\n\nfrom __future__ import annotations\n\nimport json\nimport subprocess\nimport sys\nimport textwrap\nfrom copy import deepcopy\nfrom typing import Any\n\nfrom lfx.base.data.base_file import BaseFileComponent\nfrom lfx.base.data.utils import TEXT_FILE_TYPES, parallel_load_data, parse_text_file_to_data\nfrom lfx.inputs.inputs import DropdownInput, MessageTextInput, StrInput\nfrom lfx.io import BoolInput, FileInput, IntInput, Output\nfrom lfx.schema import DataFrame # noqa: TC001\nfrom lfx.schema.data import Data\nfrom lfx.schema.message import Message\n\n\nclass FileComponent(BaseFileComponent):\n \"\"\"Read File component with optional Docling processing (isolated in a subprocess).\"\"\"\n\n display_name = \"Read File\"\n description = \"Loads content from files with optional advanced document processing and export using Docling.\"\n documentation: str = \"https://docs.langflow.org/components-data#file\"\n icon = \"file-text\"\n name = \"File\"\n\n # Docling-supported/compatible extensions; TEXT_FILE_TYPES are supported by the base loader.\n VALID_EXTENSIONS = [\n \"adoc\",\n \"asciidoc\",\n \"asc\",\n \"bmp\",\n \"csv\",\n \"dotx\",\n \"dotm\",\n \"docm\",\n \"docx\",\n \"htm\",\n \"html\",\n \"jpeg\",\n \"json\",\n \"md\",\n \"pdf\",\n \"png\",\n \"potx\",\n \"ppsx\",\n \"pptm\",\n \"potm\",\n \"ppsm\",\n \"pptx\",\n \"tiff\",\n \"txt\",\n \"xls\",\n \"xlsx\",\n \"xhtml\",\n \"xml\",\n \"webp\",\n *TEXT_FILE_TYPES,\n ]\n\n # Fixed export settings used when markdown export is requested.\n EXPORT_FORMAT = \"Markdown\"\n IMAGE_MODE = \"placeholder\"\n\n _base_inputs = deepcopy(BaseFileComponent.get_base_inputs())\n\n for input_item in _base_inputs:\n if isinstance(input_item, FileInput) and input_item.name == \"path\":\n input_item.real_time_refresh = True\n break\n\n inputs = [\n *_base_inputs,\n BoolInput(\n name=\"advanced_mode\",\n display_name=\"Advanced Parser\",\n value=False,\n real_time_refresh=True,\n info=(\n \"Enable advanced document processing and export with Docling for PDFs, images, and office documents. \"\n \"Available only for single file processing.\"\n ),\n show=False,\n ),\n DropdownInput(\n name=\"pipeline\",\n display_name=\"Pipeline\",\n info=\"Docling pipeline to use\",\n options=[\"standard\", \"vlm\"],\n value=\"standard\",\n advanced=True,\n ),\n DropdownInput(\n name=\"ocr_engine\",\n display_name=\"OCR Engine\",\n info=\"OCR engine to use. Only available when pipeline is set to 'standard'.\",\n options=[\"\", \"easyocr\"],\n value=\"\",\n show=False,\n advanced=True,\n ),\n StrInput(\n name=\"md_image_placeholder\",\n display_name=\"Image placeholder\",\n info=\"Specify the image placeholder for markdown exports.\",\n value=\"<!-- image -->\",\n advanced=True,\n show=False,\n ),\n StrInput(\n name=\"md_page_break_placeholder\",\n display_name=\"Page break placeholder\",\n info=\"Add this placeholder between pages in the markdown output.\",\n value=\"\",\n advanced=True,\n show=False,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n show=False,\n ),\n # Deprecated input retained for backward-compatibility.\n BoolInput(\n name=\"use_multithreading\",\n display_name=\"[Deprecated] Use Multithreading\",\n advanced=True,\n value=True,\n info=\"Set 'Processing Concurrency' greater than 1 to enable multithreading.\",\n ),\n IntInput(\n name=\"concurrency_multithreading\",\n display_name=\"Processing Concurrency\",\n advanced=True,\n info=\"When multiple files are being processed, the number of files to process concurrently.\",\n value=1,\n ),\n BoolInput(\n name=\"markdown\",\n display_name=\"Markdown Export\",\n info=\"Export processed documents to Markdown format. Only available when advanced mode is enabled.\",\n value=False,\n show=False,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n ]\n\n # ------------------------------ UI helpers --------------------------------------\n\n def _path_value(self, template: dict) -> list[str]:\n \"\"\"Return the list of currently selected file paths from the template.\"\"\"\n return template.get(\"path\", {}).get(\"file_path\", [])\n\n def update_build_config(\n self,\n build_config: dict[str, Any],\n field_value: Any,\n field_name: str | None = None,\n ) -> dict[str, Any]:\n \"\"\"Show/hide Advanced Parser and related fields based on selection context.\"\"\"\n if field_name == \"path\":\n paths = self._path_value(build_config)\n file_path = paths[0] if paths else \"\"\n file_count = len(field_value) if field_value else 0\n\n # Advanced mode only for single (non-tabular) file\n allow_advanced = file_count == 1 and not file_path.endswith((\".csv\", \".xlsx\", \".parquet\"))\n build_config[\"advanced_mode\"][\"show\"] = allow_advanced\n if not allow_advanced:\n build_config[\"advanced_mode\"][\"value\"] = False\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = False\n\n elif field_name == \"advanced_mode\":\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = bool(field_value)\n\n return build_config\n\n def update_outputs(self, frontend_node: dict[str, Any], field_name: str, field_value: Any) -> dict[str, Any]: # noqa: ARG002\n \"\"\"Dynamically show outputs based on file count/type and advanced mode.\"\"\"\n if field_name not in [\"path\", \"advanced_mode\"]:\n return frontend_node\n\n template = frontend_node.get(\"template\", {})\n paths = self._path_value(template)\n if not paths:\n return frontend_node\n\n frontend_node[\"outputs\"] = []\n if len(paths) == 1:\n file_path = paths[0] if field_name == \"path\" else frontend_node[\"template\"][\"path\"][\"file_path\"][0]\n if file_path.endswith((\".csv\", \".xlsx\", \".parquet\")):\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Content\",\n name=\"dataframe\",\n method=\"load_files_structured\",\n tool_mode=True,\n ),\n )\n elif file_path.endswith(\".json\"):\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Content\", name=\"json\", method=\"load_files_json\", tool_mode=True),\n )\n\n advanced_mode = frontend_node.get(\"template\", {}).get(\"advanced_mode\", {}).get(\"value\", False)\n if advanced_mode:\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Output\",\n name=\"advanced\",\n method=\"load_files_advanced\",\n tool_mode=True,\n ),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Markdown\", name=\"markdown\", method=\"load_files_markdown\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n # Multiple files => DataFrame output; advanced parser disabled\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Files\",\n name=\"dataframe\",\n method=\"load_files\",\n tool_mode=True,\n )\n )\n\n return frontend_node\n\n # ------------------------------ Core processing ----------------------------------\n\n def _is_docling_compatible(self, file_path: str) -> bool:\n \"\"\"Lightweight extension gate for Docling-compatible types.\"\"\"\n docling_exts = (\n \".adoc\",\n \".asciidoc\",\n \".asc\",\n \".bmp\",\n \".csv\",\n \".dotx\",\n \".dotm\",\n \".docm\",\n \".docx\",\n \".htm\",\n \".html\",\n \".jpeg\",\n \".json\",\n \".md\",\n \".pdf\",\n \".png\",\n \".potx\",\n \".ppsx\",\n \".pptm\",\n \".potm\",\n \".ppsm\",\n \".pptx\",\n \".tiff\",\n \".txt\",\n \".xls\",\n \".xlsx\",\n \".xhtml\",\n \".xml\",\n \".webp\",\n )\n return file_path.lower().endswith(docling_exts)\n\n def _process_docling_in_subprocess(self, file_path: str) -> Data | None:\n \"\"\"Run Docling in a separate OS process and map the result to a Data object.\n\n We avoid multiprocessing pickling by launching `python -c \"<script>\"` and\n passing JSON config via stdin. The child prints a JSON result to stdout.\n \"\"\"\n if not file_path:\n return None\n\n args: dict[str, Any] = {\n \"file_path\": file_path,\n \"markdown\": bool(self.markdown),\n \"image_mode\": str(self.IMAGE_MODE),\n \"md_image_placeholder\": str(self.md_image_placeholder),\n \"md_page_break_placeholder\": str(self.md_page_break_placeholder),\n \"pipeline\": str(self.pipeline),\n \"ocr_engine\": str(self.ocr_engine) if getattr(self, \"ocr_engine\", \"\") else None,\n }\n\n # The child is a tiny, self-contained script to keep memory/state isolated.\n child_script = textwrap.dedent(\n r\"\"\"\n import json, sys\n\n def try_imports():\n # Strategy 1: latest layout\n try:\n from docling.datamodel.base_models import ConversionStatus, InputFormat # type: ignore\n from docling.document_converter import DocumentConverter # type: ignore\n from docling_core.types.doc import ImageRefMode # type: ignore\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"latest\"\n except Exception:\n pass\n # Strategy 2: alternative layout\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n try:\n from docling_core.types import ConversionStatus, InputFormat # type: ignore\n except Exception:\n try:\n from docling.datamodel import ConversionStatus, InputFormat # type: ignore\n except Exception:\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n try:\n from docling_core.types.doc import ImageRefMode # type: ignore\n except Exception:\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"alternative\"\n except Exception:\n pass\n # Strategy 3: basic converter only\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"basic\"\n except Exception as e:\n raise ImportError(f\"Docling imports failed: {e}\") from e\n\n def create_converter(strategy, input_format, DocumentConverter, pipeline, ocr_engine):\n if strategy == \"latest\" and pipeline == \"standard\":\n try:\n from docling.datamodel.pipeline_options import PdfPipelineOptions # type: ignore\n from docling.document_converter import PdfFormatOption # type: ignore\n pipe = PdfPipelineOptions()\n if ocr_engine:\n try:\n from docling.models.factories import get_ocr_factory # type: ignore\n pipe.do_ocr = True\n fac = get_ocr_factory(allow_external_plugins=False)\n pipe.ocr_options = fac.create_options(kind=ocr_engine)\n except Exception:\n pipe.do_ocr = False\n fmt = {}\n if hasattr(input_format, \"PDF\"):\n fmt[getattr(input_format, \"PDF\")] = PdfFormatOption(pipeline_options=pipe)\n if hasattr(input_format, \"IMAGE\"):\n fmt[getattr(input_format, \"IMAGE\")] = PdfFormatOption(pipeline_options=pipe)\n return DocumentConverter(format_options=fmt)\n except Exception:\n return DocumentConverter()\n return DocumentConverter()\n\n def export_markdown(document, ImageRefMode, image_mode, img_ph, pg_ph):\n try:\n mode = getattr(ImageRefMode, image_mode.upper(), image_mode)\n return document.export_to_markdown(\n image_mode=mode,\n image_placeholder=img_ph,\n page_break_placeholder=pg_ph,\n )\n except Exception:\n try:\n return document.export_to_text()\n except Exception:\n return str(document)\n\n def to_rows(doc_dict):\n rows = []\n for t in doc_dict.get(\"texts\", []):\n prov = t.get(\"prov\") or []\n page_no = None\n if prov and isinstance(prov, list) and isinstance(prov[0], dict):\n page_no = prov[0].get(\"page_no\")\n rows.append({\n \"page_no\": page_no,\n \"label\": t.get(\"label\"),\n \"text\": t.get(\"text\"),\n \"level\": t.get(\"level\"),\n })\n return rows\n\n def main():\n cfg = json.loads(sys.stdin.read())\n file_path = cfg[\"file_path\"]\n markdown = cfg[\"markdown\"]\n image_mode = cfg[\"image_mode\"]\n img_ph = cfg[\"md_image_placeholder\"]\n pg_ph = cfg[\"md_page_break_placeholder\"]\n pipeline = cfg[\"pipeline\"]\n ocr_engine = cfg.get(\"ocr_engine\")\n meta = {\"file_path\": file_path}\n\n try:\n ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, strategy = try_imports()\n converter = create_converter(strategy, InputFormat, DocumentConverter, pipeline, ocr_engine)\n try:\n res = converter.convert(file_path)\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling conversion error: {e}\", \"meta\": meta}))\n return\n\n ok = False\n if hasattr(res, \"status\"):\n try:\n ok = (res.status == ConversionStatus.SUCCESS) or (str(res.status).lower() == \"success\")\n except Exception:\n ok = (str(res.status).lower() == \"success\")\n if not ok and hasattr(res, \"document\"):\n ok = getattr(res, \"document\", None) is not None\n if not ok:\n print(json.dumps({\"ok\": False, \"error\": \"Docling conversion failed\", \"meta\": meta}))\n return\n\n doc = getattr(res, \"document\", None)\n if doc is None:\n print(json.dumps({\"ok\": False, \"error\": \"Docling produced no document\", \"meta\": meta}))\n return\n\n if markdown:\n text = export_markdown(doc, ImageRefMode, image_mode, img_ph, pg_ph)\n print(json.dumps({\"ok\": True, \"mode\": \"markdown\", \"text\": text, \"meta\": meta}))\n return\n\n # structured\n try:\n doc_dict = doc.export_to_dict()\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling export_to_dict failed: {e}\", \"meta\": meta}))\n return\n\n rows = to_rows(doc_dict)\n print(json.dumps({\"ok\": True, \"mode\": \"structured\", \"doc\": rows, \"meta\": meta}))\n except Exception as e:\n print(\n json.dumps({\n \"ok\": False,\n \"error\": f\"Docling processing error: {e}\",\n \"meta\": {\"file_path\": file_path},\n })\n )\n\n if __name__ == \"__main__\":\n main()\n \"\"\"\n )\n\n # Validate file_path to avoid command injection or unsafe input\n if not isinstance(args[\"file_path\"], str) or any(c in args[\"file_path\"] for c in [\";\", \"|\", \"&\", \"$\", \"`\"]):\n return Data(data={\"error\": \"Unsafe file path detected.\", \"file_path\": args[\"file_path\"]})\n\n proc = subprocess.run( # noqa: S603\n [sys.executable, \"-u\", \"-c\", child_script],\n input=json.dumps(args).encode(\"utf-8\"),\n capture_output=True,\n check=False,\n )\n\n if not proc.stdout:\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\") or \"no output from child process\"\n return Data(data={\"error\": f\"Docling subprocess error: {err_msg}\", \"file_path\": file_path})\n\n try:\n result = json.loads(proc.stdout.decode(\"utf-8\"))\n except Exception as e: # noqa: BLE001\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\")\n return Data(\n data={\"error\": f\"Invalid JSON from Docling subprocess: {e}. stderr={err_msg}\", \"file_path\": file_path},\n )\n\n if not result.get(\"ok\"):\n return Data(data={\"error\": result.get(\"error\", \"Unknown Docling error\"), **result.get(\"meta\", {})})\n\n meta = result.get(\"meta\", {})\n if result.get(\"mode\") == \"markdown\":\n exported_content = str(result.get(\"text\", \"\"))\n return Data(\n text=exported_content,\n data={\"exported_content\": exported_content, \"export_format\": self.EXPORT_FORMAT, **meta},\n )\n\n rows = list(result.get(\"doc\", []))\n return Data(data={\"doc\": rows, \"export_format\": self.EXPORT_FORMAT, **meta})\n\n def process_files(\n self,\n file_list: list[BaseFileComponent.BaseFile],\n ) -> list[BaseFileComponent.BaseFile]:\n \"\"\"Process input files.\n\n - Single file + advanced_mode => Docling in a separate process.\n - Otherwise => standard parsing in current process (optionally threaded).\n \"\"\"\n if not file_list:\n msg = \"No files to process.\"\n raise ValueError(msg)\n\n def process_file_standard(file_path: str, *, silent_errors: bool = False) -> Data | None:\n try:\n return parse_text_file_to_data(file_path, silent_errors=silent_errors)\n except FileNotFoundError as e:\n self.log(f\"File not found: {file_path}. Error: {e}\")\n if not silent_errors:\n raise\n return None\n except Exception as e:\n self.log(f\"Unexpected error processing {file_path}: {e}\")\n if not silent_errors:\n raise\n return None\n\n # Advanced path: only for a single Docling-compatible file\n if len(file_list) == 1:\n file_path = str(file_list[0].path)\n if self.advanced_mode and self._is_docling_compatible(file_path):\n advanced_data: Data | None = self._process_docling_in_subprocess(file_path)\n\n # --- UNNEST: expand each element in `doc` to its own Data row\n payload = getattr(advanced_data, \"data\", {}) or {}\n doc_rows = payload.get(\"doc\")\n if isinstance(doc_rows, list):\n rows: list[Data | None] = [\n Data(\n data={\n \"file_path\": file_path,\n **(item if isinstance(item, dict) else {\"value\": item}),\n },\n )\n for item in doc_rows\n ]\n return self.rollup_data(file_list, rows)\n\n # If not structured, keep as-is (e.g., markdown export or error dict)\n return self.rollup_data(file_list, [advanced_data])\n\n # Standard multi-file (or single non-advanced) path\n concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading)\n file_paths = [str(f.path) for f in file_list]\n self.log(f\"Starting parallel processing of {len(file_paths)} files with concurrency: {concurrency}.\")\n my_data = parallel_load_data(\n file_paths,\n silent_errors=self.silent_errors,\n load_function=process_file_standard,\n max_concurrency=concurrency,\n )\n return self.rollup_data(file_list, my_data)\n\n # ------------------------------ Output helpers -----------------------------------\n\n def load_files_advanced(self) -> DataFrame:\n \"\"\"Load files using advanced Docling processing and export to an advanced format.\"\"\"\n self.markdown = False\n return self.load_files()\n\n def load_files_markdown(self) -> Message:\n \"\"\"Load files using advanced Docling processing and export to Markdown format.\"\"\"\n self.markdown = True\n result = self.load_files()\n return Message(text=str(result.text[0]))\n" | |
| """Enhanced file component with clearer structure and Docling isolation. | |
| Notes: | |
| ----- | |
| - Functionality is preserved with minimal behavioral changes. | |
| - ALL Docling parsing/export runs in a separate OS process to prevent memory | |
| growth and native library state from impacting the main Langflow process. | |
| - Standard text/structured parsing continues to use existing BaseFileComponent | |
| utilities (and optional threading via `parallel_load_data`). | |
| """ | |
| from __future__ import annotations | |
| import json | |
| import subprocess | |
| import sys | |
| import textwrap | |
| from copy import deepcopy | |
| from typing import Any | |
| from lfx.base.data.base_file import BaseFileComponent | |
| from lfx.base.data.utils import TEXT_FILE_TYPES, parallel_load_data, parse_text_file_to_data | |
| from lfx.inputs.inputs import DropdownInput, MessageTextInput, StrInput | |
| from lfx.io import BoolInput, FileInput, IntInput, Output | |
| from lfx.schema import DataFrame # noqa: TC001 | |
| from lfx.schema.data import Data | |
| from lfx.schema.message import Message | |
| class FileComponent(BaseFileComponent): | |
| """Read File component with optional Docling processing (isolated in a subprocess).""" | |
| display_name = "Read File" | |
| description = "Loads content from files with optional advanced document processing and export using Docling." | |
| documentation: str = "https://docs.langflow.org/components-data#file" | |
| icon = "file-text" | |
| name = "File" | |
| # Docling-supported/compatible extensions; TEXT_FILE_TYPES are supported by the base loader. | |
| VALID_EXTENSIONS = [ | |
| "adoc", | |
| "asciidoc", | |
| "asc", | |
| "bmp", | |
| "csv", | |
| "dotx", | |
| "dotm", | |
| "docm", | |
| "docx", | |
| "htm", | |
| "html", | |
| "jpeg", | |
| "json", | |
| "md", | |
| "pdf", | |
| "png", | |
| "potx", | |
| "ppsx", | |
| "pptm", | |
| "potm", | |
| "ppsm", | |
| "pptx", | |
| "tiff", | |
| "txt", | |
| "xls", | |
| "xlsx", | |
| "xhtml", | |
| "xml", | |
| "webp", | |
| *TEXT_FILE_TYPES, | |
| ] | |
| # Fixed export settings used when markdown export is requested. | |
| EXPORT_FORMAT = "Markdown" | |
| IMAGE_MODE = "placeholder" | |
| _base_inputs = deepcopy(BaseFileComponent.get_base_inputs()) | |
| for input_item in _base_inputs: | |
| if isinstance(input_item, FileInput) and input_item.name == "path": | |
| input_item.real_time_refresh = True | |
| break | |
| inputs = [ | |
| *_base_inputs, | |
| BoolInput( | |
| name="advanced_mode", | |
| display_name="Advanced Parser", | |
| value=False, | |
| real_time_refresh=True, | |
| info=( | |
| "Enable advanced document processing and export with Docling for PDFs, images, and office documents. " | |
| "Available only for single file processing." | |
| ), | |
| show=False, | |
| ), | |
| DropdownInput( | |
| name="pipeline", | |
| display_name="Pipeline", | |
| info="Docling pipeline to use", | |
| options=["standard", "vlm"], | |
| value="standard", | |
| advanced=True, | |
| ), | |
| DropdownInput( | |
| name="ocr_engine", | |
| display_name="OCR Engine", | |
| info="OCR engine to use. Only available when pipeline is set to 'standard'.", | |
| options=["", "easyocr"], | |
| value="", | |
| show=False, | |
| advanced=True, | |
| ), | |
| StrInput( | |
| name="md_image_placeholder", | |
| display_name="Image placeholder", | |
| info="Specify the image placeholder for markdown exports.", | |
| value="<!-- image -->", | |
| advanced=True, | |
| show=False, | |
| ), | |
| StrInput( | |
| name="md_page_break_placeholder", | |
| display_name="Page break placeholder", | |
| info="Add this placeholder between pages in the markdown output.", | |
| value="", | |
| advanced=True, | |
| show=False, | |
| ), | |
| MessageTextInput( | |
| name="doc_key", | |
| display_name="Doc Key", | |
| info="The key to use for the DoclingDocument column.", | |
| value="doc", | |
| advanced=True, | |
| show=False, | |
| ), | |
| # Deprecated input retained for backward-compatibility. | |
| BoolInput( | |
| name="use_multithreading", | |
| display_name="[Deprecated] Use Multithreading", | |
| advanced=True, | |
| value=True, | |
| info="Set 'Processing Concurrency' greater than 1 to enable multithreading.", | |
| ), | |
| IntInput( | |
| name="concurrency_multithreading", | |
| display_name="Processing Concurrency", | |
| advanced=True, | |
| info="When multiple files are being processed, the number of files to process concurrently.", | |
| value=1, | |
| ), | |
| BoolInput( | |
| name="markdown", | |
| display_name="Markdown Export", | |
| info="Export processed documents to Markdown format. Only available when advanced mode is enabled.", | |
| value=False, | |
| show=False, | |
| ), | |
| ] | |
| outputs = [ | |
| Output(display_name="Raw Content", name="message", method="load_files_message", tool_mode=True), | |
| ] | |
| # ------------------------------ UI helpers -------------------------------------- | |
| def _path_value(self, template: dict) -> list[str]: | |
| """Return the list of currently selected file paths from the template.""" | |
| return template.get("path", {}).get("file_path", []) | |
| def update_build_config( | |
| self, | |
| build_config: dict[str, Any], | |
| field_value: Any, | |
| field_name: str | None = None, | |
| ) -> dict[str, Any]: | |
| """Show/hide Advanced Parser and related fields based on selection context.""" | |
| if field_name == "path": | |
| paths = self._path_value(build_config) | |
| file_path = paths[0] if paths else "" | |
| file_count = len(field_value) if field_value else 0 | |
| # Advanced mode only for single (non-tabular) file | |
| allow_advanced = file_count == 1 and not file_path.endswith((".csv", ".xlsx", ".parquet")) | |
| build_config["advanced_mode"]["show"] = allow_advanced | |
| if not allow_advanced: | |
| build_config["advanced_mode"]["value"] = False | |
| for f in ("pipeline", "ocr_engine", "doc_key", "md_image_placeholder", "md_page_break_placeholder"): | |
| if f in build_config: | |
| build_config[f]["show"] = False | |
| elif field_name == "advanced_mode": | |
| for f in ("pipeline", "ocr_engine", "doc_key", "md_image_placeholder", "md_page_break_placeholder"): | |
| if f in build_config: | |
| build_config[f]["show"] = bool(field_value) | |
| return build_config | |
| def update_outputs(self, frontend_node: dict[str, Any], field_name: str, field_value: Any) -> dict[str, Any]: # noqa: ARG002 | |
| """Dynamically show outputs based on file count/type and advanced mode.""" | |
| if field_name not in ["path", "advanced_mode"]: | |
| return frontend_node | |
| template = frontend_node.get("template", {}) | |
| paths = self._path_value(template) | |
| if not paths: | |
| return frontend_node | |
| frontend_node["outputs"] = [] | |
| if len(paths) == 1: | |
| file_path = paths[0] if field_name == "path" else frontend_node["template"]["path"]["file_path"][0] | |
| if file_path.endswith((".csv", ".xlsx", ".parquet")): | |
| frontend_node["outputs"].append( | |
| Output( | |
| display_name="Structured Content", | |
| name="dataframe", | |
| method="load_files_structured", | |
| tool_mode=True, | |
| ), | |
| ) | |
| elif file_path.endswith(".json"): | |
| frontend_node["outputs"].append( | |
| Output(display_name="Structured Content", name="json", method="load_files_json", tool_mode=True), | |
| ) | |
| advanced_mode = frontend_node.get("template", {}).get("advanced_mode", {}).get("value", False) | |
| if advanced_mode: | |
| frontend_node["outputs"].append( | |
| Output( | |
| display_name="Structured Output", | |
| name="advanced", | |
| method="load_files_advanced", | |
| tool_mode=True, | |
| ), | |
| ) | |
| frontend_node["outputs"].append( | |
| Output(display_name="Markdown", name="markdown", method="load_files_markdown", tool_mode=True), | |
| ) | |
| frontend_node["outputs"].append( | |
| Output(display_name="File Path", name="path", method="load_files_path", tool_mode=True), | |
| ) | |
| else: | |
| frontend_node["outputs"].append( | |
| Output(display_name="Raw Content", name="message", method="load_files_message", tool_mode=True), | |
| ) | |
| frontend_node["outputs"].append( | |
| Output(display_name="File Path", name="path", method="load_files_path", tool_mode=True), | |
| ) | |
| else: | |
| # Multiple files => DataFrame output; advanced parser disabled | |
| frontend_node["outputs"].append( | |
| Output( | |
| display_name="Files", | |
| name="dataframe", | |
| method="load_files", | |
| tool_mode=True, | |
| ) | |
| ) | |
| return frontend_node | |
| # ------------------------------ Core processing ---------------------------------- | |
| def _is_docling_compatible(self, file_path: str) -> bool: | |
| """Lightweight extension gate for Docling-compatible types.""" | |
| docling_exts = ( | |
| ".adoc", | |
| ".asciidoc", | |
| ".asc", | |
| ".bmp", | |
| ".csv", | |
| ".dotx", | |
| ".dotm", | |
| ".docm", | |
| ".docx", | |
| ".htm", | |
| ".html", | |
| ".jpeg", | |
| ".json", | |
| ".md", | |
| ".pdf", | |
| ".png", | |
| ".potx", | |
| ".ppsx", | |
| ".pptm", | |
| ".potm", | |
| ".ppsm", | |
| ".pptx", | |
| ".tiff", | |
| ".txt", | |
| ".xls", | |
| ".xlsx", | |
| ".xhtml", | |
| ".xml", | |
| ".webp", | |
| ) | |
| return file_path.lower().endswith(docling_exts) | |
| def _process_docling_in_subprocess(self, file_path: str) -> Data | None: | |
| """Run Docling in a separate OS process and map the result to a Data object. | |
| We avoid multiprocessing pickling by launching `python -c \"<script>\"` and | |
| passing JSON config via stdin. The child prints a JSON result to stdout. | |
| """ | |
| if not file_path: | |
| return None | |
| args: dict[str, Any] = { | |
| "file_path": file_path, | |
| "markdown": bool(self.markdown), | |
| "image_mode": str(self.IMAGE_MODE), | |
| "md_image_placeholder": str(self.md_image_placeholder), | |
| "md_page_break_placeholder": str(self.md_page_break_placeholder), | |
| "pipeline": str(self.pipeline), | |
| "ocr_engine": str(self.ocr_engine) if getattr(self, "ocr_engine", "") else None, | |
| } | |
| # The child is a tiny, self-contained script to keep memory/state isolated. | |
| child_script = textwrap.dedent( | |
| r""" | |
| import json, sys | |
| def try_imports(): | |
| # Strategy 1: latest layout | |
| try: | |
| from docling.datamodel.base_models import ConversionStatus, InputFormat # type: ignore | |
| from docling.document_converter import DocumentConverter # type: ignore | |
| from docling_core.types.doc import ImageRefMode # type: ignore | |
| return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, "latest" | |
| except Exception: | |
| pass | |
| # Strategy 2: alternative layout | |
| try: | |
| from docling.document_converter import DocumentConverter # type: ignore | |
| try: | |
| from docling_core.types import ConversionStatus, InputFormat # type: ignore | |
| except Exception: | |
| try: | |
| from docling.datamodel import ConversionStatus, InputFormat # type: ignore | |
| except Exception: | |
| class ConversionStatus: SUCCESS = "success" | |
| class InputFormat: | |
| PDF="pdf"; IMAGE="image" | |
| try: | |
| from docling_core.types.doc import ImageRefMode # type: ignore | |
| except Exception: | |
| class ImageRefMode: | |
| PLACEHOLDER="placeholder"; EMBEDDED="embedded" | |
| return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, "alternative" | |
| except Exception: | |
| pass | |
| # Strategy 3: basic converter only | |
| try: | |
| from docling.document_converter import DocumentConverter # type: ignore | |
| class ConversionStatus: SUCCESS = "success" | |
| class InputFormat: | |
| PDF="pdf"; IMAGE="image" | |
| class ImageRefMode: | |
| PLACEHOLDER="placeholder"; EMBEDDED="embedded" | |
| return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, "basic" | |
| except Exception as e: | |
| raise ImportError(f"Docling imports failed: {e}") from e | |
| def create_converter(strategy, input_format, DocumentConverter, pipeline, ocr_engine): | |
| if strategy == "latest" and pipeline == "standard": | |
| try: | |
| from docling.datamodel.pipeline_options import PdfPipelineOptions # type: ignore | |
| from docling.document_converter import PdfFormatOption # type: ignore | |
| pipe = PdfPipelineOptions() | |
| if ocr_engine: | |
| try: | |
| from docling.models.factories import get_ocr_factory # type: ignore | |
| pipe.do_ocr = True | |
| fac = get_ocr_factory(allow_external_plugins=False) | |
| pipe.ocr_options = fac.create_options(kind=ocr_engine) | |
| except Exception: | |
| pipe.do_ocr = False | |
| fmt = {} | |
| if hasattr(input_format, "PDF"): | |
| fmt[getattr(input_format, "PDF")] = PdfFormatOption(pipeline_options=pipe) | |
| if hasattr(input_format, "IMAGE"): | |
| fmt[getattr(input_format, "IMAGE")] = PdfFormatOption(pipeline_options=pipe) | |
| return DocumentConverter(format_options=fmt) | |
| except Exception: | |
| return DocumentConverter() | |
| return DocumentConverter() | |
| def export_markdown(document, ImageRefMode, image_mode, img_ph, pg_ph): | |
| try: | |
| mode = getattr(ImageRefMode, image_mode.upper(), image_mode) | |
| return document.export_to_markdown( | |
| image_mode=mode, | |
| image_placeholder=img_ph, | |
| page_break_placeholder=pg_ph, | |
| ) | |
| except Exception: | |
| try: | |
| return document.export_to_text() | |
| except Exception: | |
| return str(document) | |
| def to_rows(doc_dict): | |
| rows = [] | |
| for t in doc_dict.get("texts", []): | |
| prov = t.get("prov") or [] | |
| page_no = None | |
| if prov and isinstance(prov, list) and isinstance(prov[0], dict): | |
| page_no = prov[0].get("page_no") | |
| rows.append({ | |
| "page_no": page_no, | |
| "label": t.get("label"), | |
| "text": t.get("text"), | |
| "level": t.get("level"), | |
| }) | |
| return rows | |
| def main(): | |
| cfg = json.loads(sys.stdin.read()) | |
| file_path = cfg["file_path"] | |
| markdown = cfg["markdown"] | |
| image_mode = cfg["image_mode"] | |
| img_ph = cfg["md_image_placeholder"] | |
| pg_ph = cfg["md_page_break_placeholder"] | |
| pipeline = cfg["pipeline"] | |
| ocr_engine = cfg.get("ocr_engine") | |
| meta = {"file_path": file_path} | |
| try: | |
| ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, strategy = try_imports() | |
| converter = create_converter(strategy, InputFormat, DocumentConverter, pipeline, ocr_engine) | |
| try: | |
| res = converter.convert(file_path) | |
| except Exception as e: | |
| print(json.dumps({"ok": False, "error": f"Docling conversion error: {e}", "meta": meta})) | |
| return | |
| ok = False | |
| if hasattr(res, "status"): | |
| try: | |
| ok = (res.status == ConversionStatus.SUCCESS) or (str(res.status).lower() == "success") | |
| except Exception: | |
| ok = (str(res.status).lower() == "success") | |
| if not ok and hasattr(res, "document"): | |
| ok = getattr(res, "document", None) is not None | |
| if not ok: | |
| print(json.dumps({"ok": False, "error": "Docling conversion failed", "meta": meta})) | |
| return | |
| doc = getattr(res, "document", None) | |
| if doc is None: | |
| print(json.dumps({"ok": False, "error": "Docling produced no document", "meta": meta})) | |
| return | |
| if markdown: | |
| text = export_markdown(doc, ImageRefMode, image_mode, img_ph, pg_ph) | |
| print(json.dumps({"ok": True, "mode": "markdown", "text": text, "meta": meta})) | |
| return | |
| # structured | |
| try: | |
| doc_dict = doc.export_to_dict() | |
| except Exception as e: | |
| print(json.dumps({"ok": False, "error": f"Docling export_to_dict failed: {e}", "meta": meta})) | |
| return | |
| rows = to_rows(doc_dict) | |
| print(json.dumps({"ok": True, "mode": "structured", "doc": rows, "meta": meta})) | |
| except Exception as e: | |
| print( | |
| json.dumps({ | |
| "ok": False, | |
| "error": f"Docling processing error: {e}", | |
| "meta": {"file_path": file_path}, | |
| }) | |
| ) | |
| if __name__ == "__main__": | |
| main() | |
| """ | |
| ) | |
| # Validate file_path to avoid command injection or unsafe input | |
| if not isinstance(args["file_path"], str) or any(c in args["file_path"] for c in [";", "|", "&", "$", "`"]): | |
| return Data(data={"error": "Unsafe file path detected.", "file_path": args["file_path"]}) | |
| proc = subprocess.run( # noqa: S603 | |
| [sys.executable, "-u", "-c", child_script], | |
| input=json.dumps(args).encode("utf-8"), | |
| capture_output=True, | |
| check=False, | |
| ) | |
| if not proc.stdout: | |
| err_msg = proc.stderr.decode("utf-8", errors="replace") or "no output from child process" | |
| return Data(data={"error": f"Docling subprocess error: {err_msg}", "file_path": file_path}) | |
| try: | |
| result = json.loads(proc.stdout.decode("utf-8")) | |
| except Exception as e: # noqa: BLE001 | |
| err_msg = proc.stderr.decode("utf-8", errors="replace") | |
| return Data( | |
| data={"error": f"Invalid JSON from Docling subprocess: {e}. stderr={err_msg}", "file_path": file_path}, | |
| ) | |
| if not result.get("ok"): | |
| return Data(data={"error": result.get("error", "Unknown Docling error"), **result.get("meta", {})}) | |
| meta = result.get("meta", {}) | |
| if result.get("mode") == "markdown": | |
| exported_content = str(result.get("text", "")) | |
| return Data( | |
| text=exported_content, | |
| data={"exported_content": exported_content, "export_format": self.EXPORT_FORMAT, **meta}, | |
| ) | |
| rows = list(result.get("doc", [])) | |
| return Data(data={"doc": rows, "export_format": self.EXPORT_FORMAT, **meta}) | |
| def process_files( | |
| self, | |
| file_list: list[BaseFileComponent.BaseFile], | |
| ) -> list[BaseFileComponent.BaseFile]: | |
| """Process input files. | |
| - Single file + advanced_mode => Docling in a separate process. | |
| - Otherwise => standard parsing in current process (optionally threaded). | |
| """ | |
| if not file_list: | |
| msg = "No files to process." | |
| raise ValueError(msg) | |
| def process_file_standard(file_path: str, *, silent_errors: bool = False) -> Data | None: | |
| try: | |
| return parse_text_file_to_data(file_path, silent_errors=silent_errors) | |
| except FileNotFoundError as e: | |
| self.log(f"File not found: {file_path}. Error: {e}") | |
| if not silent_errors: | |
| raise | |
| return None | |
| except Exception as e: | |
| self.log(f"Unexpected error processing {file_path}: {e}") | |
| if not silent_errors: | |
| raise | |
| return None | |
| # Advanced path: only for a single Docling-compatible file | |
| if len(file_list) == 1: | |
| file_path = str(file_list[0].path) | |
| if self.advanced_mode and self._is_docling_compatible(file_path): | |
| advanced_data: Data | None = self._process_docling_in_subprocess(file_path) | |
| if advanced_data is None: | |
| return self.rollup_data( | |
| file_list, | |
| [Data(data={"error": "Docling returned no data", "file_path": file_path})], | |
| ) | |
| # --- UNNEST: expand each element in `doc` to its own Data row | |
| payload = getattr(advanced_data, "data", {}) or {} | |
| doc_rows = payload.get("doc") | |
| if isinstance(doc_rows, list): | |
| rows: list[Data | None] = [ | |
| Data( | |
| data={ | |
| "file_path": file_path, | |
| **(item if isinstance(item, dict) else {"value": item}), | |
| }, | |
| ) | |
| for item in doc_rows | |
| ] | |
| return self.rollup_data(file_list, rows) | |
| # If not structured, keep as-is (e.g., markdown export or error dict) | |
| return self.rollup_data(file_list, [advanced_data]) | |
| # Standard multi-file (or single non-advanced) path | |
| concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading) | |
| file_paths = [str(f.path) for f in file_list] | |
| self.log(f"Starting parallel processing of {len(file_paths)} files with concurrency: {concurrency}.") | |
| my_data = parallel_load_data( | |
| file_paths, | |
| silent_errors=self.silent_errors, | |
| load_function=process_file_standard, | |
| max_concurrency=concurrency, | |
| ) | |
| return self.rollup_data(file_list, my_data) | |
| # ------------------------------ Output helpers ----------------------------------- | |
| def load_files_advanced(self) -> DataFrame: | |
| """Load files using advanced Docling processing and export to an advanced format.""" | |
| self.markdown = False | |
| return self.load_files() | |
| def load_files_markdown(self) -> Message: | |
| """Load files using advanced Docling processing and export to Markdown format.""" | |
| self.markdown = True | |
| result = self.load_files() | |
| return Message(text=str(result.text[0])) |
🤖 Prompt for AI Agents
In src/backend/base/langflow/initial_setup/starter_projects/Document Q&A.json
around line 1333, guard against advanced_data being None before calling
rollup_data: after advanced_data =
self._process_docling_in_subprocess(file_path) check if advanced_data is None
(or getattr(advanced_data, "data", None) is None) and if so create a fallback
Data object (e.g., Data(data={"error": "No data from Docling", "file_path":
file_path})) and pass that fallback into rollup_data instead of None; then
proceed with the existing logic to unnest doc rows or return rollup_data.
| "title_case": false, | ||
| "type": "code", | ||
| "value": "\"\"\"Enhanced file component with clearer structure and Docling isolation.\n\nNotes:\n-----\n- Functionality is preserved with minimal behavioral changes.\n- ALL Docling parsing/export runs in a separate OS process to prevent memory\n growth and native library state from impacting the main Langflow process.\n- Standard text/structured parsing continues to use existing BaseFileComponent\n utilities (and optional threading via `parallel_load_data`).\n\"\"\"\n\nfrom __future__ import annotations\n\nimport json\nimport subprocess\nimport sys\nimport textwrap\nfrom copy import deepcopy\nfrom typing import Any\n\nfrom lfx.base.data.base_file import BaseFileComponent\nfrom lfx.base.data.utils import TEXT_FILE_TYPES, parallel_load_data, parse_text_file_to_data\nfrom lfx.inputs.inputs import DropdownInput, MessageTextInput, StrInput\nfrom lfx.io import BoolInput, FileInput, IntInput, Output\nfrom lfx.schema import DataFrame # noqa: TC001\nfrom lfx.schema.data import Data\nfrom lfx.schema.message import Message\n\n\nclass FileComponent(BaseFileComponent):\n \"\"\"File component with optional Docling processing (isolated in a subprocess).\"\"\"\n\n display_name = \"File\"\n description = \"Loads content from files with optional advanced document processing and export using Docling.\"\n documentation: str = \"https://docs.langflow.org/components-data#file\"\n icon = \"file-text\"\n name = \"File\"\n\n # Docling-supported/compatible extensions; TEXT_FILE_TYPES are supported by the base loader.\n VALID_EXTENSIONS = [\n \"adoc\",\n \"asciidoc\",\n \"asc\",\n \"bmp\",\n \"csv\",\n \"dotx\",\n \"dotm\",\n \"docm\",\n \"docx\",\n \"htm\",\n \"html\",\n \"jpeg\",\n \"json\",\n \"md\",\n \"pdf\",\n \"png\",\n \"potx\",\n \"ppsx\",\n \"pptm\",\n \"potm\",\n \"ppsm\",\n \"pptx\",\n \"tiff\",\n \"txt\",\n \"xls\",\n \"xlsx\",\n \"xhtml\",\n \"xml\",\n \"webp\",\n *TEXT_FILE_TYPES,\n ]\n\n # Fixed export settings used when markdown export is requested.\n EXPORT_FORMAT = \"Markdown\"\n IMAGE_MODE = \"placeholder\"\n\n _base_inputs = deepcopy(BaseFileComponent.get_base_inputs())\n\n for input_item in _base_inputs:\n if isinstance(input_item, FileInput) and input_item.name == \"path\":\n input_item.real_time_refresh = True\n break\n\n inputs = [\n *_base_inputs,\n BoolInput(\n name=\"advanced_mode\",\n display_name=\"Advanced Parser\",\n value=False,\n real_time_refresh=True,\n info=(\n \"Enable advanced document processing and export with Docling for PDFs, images, and office documents. \"\n \"Available only for single file processing.\"\n ),\n show=False,\n ),\n DropdownInput(\n name=\"pipeline\",\n display_name=\"Pipeline\",\n info=\"Docling pipeline to use\",\n options=[\"standard\", \"vlm\"],\n value=\"standard\",\n advanced=True,\n ),\n DropdownInput(\n name=\"ocr_engine\",\n display_name=\"OCR Engine\",\n info=\"OCR engine to use. Only available when pipeline is set to 'standard'.\",\n options=[\"\", \"easyocr\"],\n value=\"\",\n show=False,\n advanced=True,\n ),\n StrInput(\n name=\"md_image_placeholder\",\n display_name=\"Image placeholder\",\n info=\"Specify the image placeholder for markdown exports.\",\n value=\"<!-- image -->\",\n advanced=True,\n show=False,\n ),\n StrInput(\n name=\"md_page_break_placeholder\",\n display_name=\"Page break placeholder\",\n info=\"Add this placeholder between pages in the markdown output.\",\n value=\"\",\n advanced=True,\n show=False,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n show=False,\n ),\n # Deprecated input retained for backward-compatibility.\n BoolInput(\n name=\"use_multithreading\",\n display_name=\"[Deprecated] Use Multithreading\",\n advanced=True,\n value=True,\n info=\"Set 'Processing Concurrency' greater than 1 to enable multithreading.\",\n ),\n IntInput(\n name=\"concurrency_multithreading\",\n display_name=\"Processing Concurrency\",\n advanced=True,\n info=\"When multiple files are being processed, the number of files to process concurrently.\",\n value=1,\n ),\n BoolInput(\n name=\"markdown\",\n display_name=\"Markdown Export\",\n info=\"Export processed documents to Markdown format. Only available when advanced mode is enabled.\",\n value=False,\n show=False,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\"),\n ]\n\n # ------------------------------ UI helpers --------------------------------------\n\n def _path_value(self, template: dict) -> list[str]:\n \"\"\"Return the list of currently selected file paths from the template.\"\"\"\n return template.get(\"path\", {}).get(\"file_path\", [])\n\n def update_build_config(\n self,\n build_config: dict[str, Any],\n field_value: Any,\n field_name: str | None = None,\n ) -> dict[str, Any]:\n \"\"\"Show/hide Advanced Parser and related fields based on selection context.\"\"\"\n if field_name == \"path\":\n paths = self._path_value(build_config)\n file_path = paths[0] if paths else \"\"\n file_count = len(field_value) if field_value else 0\n\n # Advanced mode only for single (non-tabular) file\n allow_advanced = file_count == 1 and not file_path.endswith((\".csv\", \".xlsx\", \".parquet\"))\n build_config[\"advanced_mode\"][\"show\"] = allow_advanced\n if not allow_advanced:\n build_config[\"advanced_mode\"][\"value\"] = False\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = False\n\n elif field_name == \"advanced_mode\":\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = bool(field_value)\n\n return build_config\n\n def update_outputs(self, frontend_node: dict[str, Any], field_name: str, field_value: Any) -> dict[str, Any]: # noqa: ARG002\n \"\"\"Dynamically show outputs based on file count/type and advanced mode.\"\"\"\n if field_name not in [\"path\", \"advanced_mode\"]:\n return frontend_node\n\n template = frontend_node.get(\"template\", {})\n paths = self._path_value(template)\n if not paths:\n return frontend_node\n\n frontend_node[\"outputs\"] = []\n if len(paths) == 1:\n file_path = paths[0] if field_name == \"path\" else frontend_node[\"template\"][\"path\"][\"file_path\"][0]\n if file_path.endswith((\".csv\", \".xlsx\", \".parquet\")):\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Content\", name=\"dataframe\", method=\"load_files_structured\"),\n )\n elif file_path.endswith(\".json\"):\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Content\", name=\"json\", method=\"load_files_json\"),\n )\n\n advanced_mode = frontend_node.get(\"template\", {}).get(\"advanced_mode\", {}).get(\"value\", False)\n if advanced_mode:\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Output\", name=\"advanced\", method=\"load_files_advanced\"),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Markdown\", name=\"markdown\", method=\"load_files_markdown\"),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\"),\n )\n else:\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\"),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\"),\n )\n else:\n # Multiple files => DataFrame output; advanced parser disabled\n frontend_node[\"outputs\"].append(Output(display_name=\"Files\", name=\"dataframe\", method=\"load_files\"))\n\n return frontend_node\n\n # ------------------------------ Core processing ----------------------------------\n\n def _is_docling_compatible(self, file_path: str) -> bool:\n \"\"\"Lightweight extension gate for Docling-compatible types.\"\"\"\n docling_exts = (\n \".adoc\",\n \".asciidoc\",\n \".asc\",\n \".bmp\",\n \".csv\",\n \".dotx\",\n \".dotm\",\n \".docm\",\n \".docx\",\n \".htm\",\n \".html\",\n \".jpeg\",\n \".json\",\n \".md\",\n \".pdf\",\n \".png\",\n \".potx\",\n \".ppsx\",\n \".pptm\",\n \".potm\",\n \".ppsm\",\n \".pptx\",\n \".tiff\",\n \".txt\",\n \".xls\",\n \".xlsx\",\n \".xhtml\",\n \".xml\",\n \".webp\",\n )\n return file_path.lower().endswith(docling_exts)\n\n def _process_docling_in_subprocess(self, file_path: str) -> Data | None:\n \"\"\"Run Docling in a separate OS process and map the result to a Data object.\n\n We avoid multiprocessing pickling by launching `python -c \"<script>\"` and\n passing JSON config via stdin. The child prints a JSON result to stdout.\n \"\"\"\n if not file_path:\n return None\n\n args: dict[str, Any] = {\n \"file_path\": file_path,\n \"markdown\": bool(self.markdown),\n \"image_mode\": str(self.IMAGE_MODE),\n \"md_image_placeholder\": str(self.md_image_placeholder),\n \"md_page_break_placeholder\": str(self.md_page_break_placeholder),\n \"pipeline\": str(self.pipeline),\n \"ocr_engine\": str(self.ocr_engine) if getattr(self, \"ocr_engine\", \"\") else None,\n }\n\n # The child is a tiny, self-contained script to keep memory/state isolated.\n child_script = textwrap.dedent(\n r\"\"\"\n import json, sys\n\n def try_imports():\n # Strategy 1: latest layout\n try:\n from docling.datamodel.base_models import ConversionStatus, InputFormat # type: ignore\n from docling.document_converter import DocumentConverter # type: ignore\n from docling_core.types.doc import ImageRefMode # type: ignore\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"latest\"\n except Exception:\n pass\n # Strategy 2: alternative layout\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n try:\n from docling_core.types import ConversionStatus, InputFormat # type: ignore\n except Exception:\n try:\n from docling.datamodel import ConversionStatus, InputFormat # type: ignore\n except Exception:\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n try:\n from docling_core.types.doc import ImageRefMode # type: ignore\n except Exception:\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"alternative\"\n except Exception:\n pass\n # Strategy 3: basic converter only\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"basic\"\n except Exception as e:\n raise ImportError(f\"Docling imports failed: {e}\") from e\n\n def create_converter(strategy, input_format, DocumentConverter, pipeline, ocr_engine):\n if strategy == \"latest\" and pipeline == \"standard\":\n try:\n from docling.datamodel.pipeline_options import PdfPipelineOptions # type: ignore\n from docling.document_converter import PdfFormatOption # type: ignore\n pipe = PdfPipelineOptions()\n if ocr_engine:\n try:\n from docling.models.factories import get_ocr_factory # type: ignore\n pipe.do_ocr = True\n fac = get_ocr_factory(allow_external_plugins=False)\n pipe.ocr_options = fac.create_options(kind=ocr_engine)\n except Exception:\n pipe.do_ocr = False\n fmt = {}\n if hasattr(input_format, \"PDF\"):\n fmt[getattr(input_format, \"PDF\")] = PdfFormatOption(pipeline_options=pipe)\n if hasattr(input_format, \"IMAGE\"):\n fmt[getattr(input_format, \"IMAGE\")] = PdfFormatOption(pipeline_options=pipe)\n return DocumentConverter(format_options=fmt)\n except Exception:\n return DocumentConverter()\n return DocumentConverter()\n\n def export_markdown(document, ImageRefMode, image_mode, img_ph, pg_ph):\n try:\n mode = getattr(ImageRefMode, image_mode.upper(), image_mode)\n return document.export_to_markdown(\n image_mode=mode,\n image_placeholder=img_ph,\n page_break_placeholder=pg_ph,\n )\n except Exception:\n try:\n return document.export_to_text()\n except Exception:\n return str(document)\n\n def to_rows(doc_dict):\n rows = []\n for t in doc_dict.get(\"texts\", []):\n prov = t.get(\"prov\") or []\n page_no = None\n if prov and isinstance(prov, list) and isinstance(prov[0], dict):\n page_no = prov[0].get(\"page_no\")\n rows.append({\n \"page_no\": page_no,\n \"label\": t.get(\"label\"),\n \"text\": t.get(\"text\"),\n \"level\": t.get(\"level\"),\n })\n return rows\n\n def main():\n cfg = json.loads(sys.stdin.read())\n file_path = cfg[\"file_path\"]\n markdown = cfg[\"markdown\"]\n image_mode = cfg[\"image_mode\"]\n img_ph = cfg[\"md_image_placeholder\"]\n pg_ph = cfg[\"md_page_break_placeholder\"]\n pipeline = cfg[\"pipeline\"]\n ocr_engine = cfg.get(\"ocr_engine\")\n meta = {\"file_path\": file_path}\n\n try:\n ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, strategy = try_imports()\n converter = create_converter(strategy, InputFormat, DocumentConverter, pipeline, ocr_engine)\n try:\n res = converter.convert(file_path)\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling conversion error: {e}\", \"meta\": meta}))\n return\n\n ok = False\n if hasattr(res, \"status\"):\n try:\n ok = (res.status == ConversionStatus.SUCCESS) or (str(res.status).lower() == \"success\")\n except Exception:\n ok = (str(res.status).lower() == \"success\")\n if not ok and hasattr(res, \"document\"):\n ok = getattr(res, \"document\", None) is not None\n if not ok:\n print(json.dumps({\"ok\": False, \"error\": \"Docling conversion failed\", \"meta\": meta}))\n return\n\n doc = getattr(res, \"document\", None)\n if doc is None:\n print(json.dumps({\"ok\": False, \"error\": \"Docling produced no document\", \"meta\": meta}))\n return\n\n if markdown:\n text = export_markdown(doc, ImageRefMode, image_mode, img_ph, pg_ph)\n print(json.dumps({\"ok\": True, \"mode\": \"markdown\", \"text\": text, \"meta\": meta}))\n return\n\n # structured\n try:\n doc_dict = doc.export_to_dict()\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling export_to_dict failed: {e}\", \"meta\": meta}))\n return\n\n rows = to_rows(doc_dict)\n print(json.dumps({\"ok\": True, \"mode\": \"structured\", \"doc\": rows, \"meta\": meta}))\n except Exception as e:\n print(\n json.dumps({\n \"ok\": False,\n \"error\": f\"Docling processing error: {e}\",\n \"meta\": {\"file_path\": file_path},\n })\n )\n\n if __name__ == \"__main__\":\n main()\n \"\"\"\n )\n\n # Validate file_path to avoid command injection or unsafe input\n if not isinstance(args[\"file_path\"], str) or any(c in args[\"file_path\"] for c in [\";\", \"|\", \"&\", \"$\", \"`\"]):\n return Data(data={\"error\": \"Unsafe file path detected.\", \"file_path\": args[\"file_path\"]})\n\n proc = subprocess.run( # noqa: S603\n [sys.executable, \"-u\", \"-c\", child_script],\n input=json.dumps(args).encode(\"utf-8\"),\n capture_output=True,\n check=False,\n )\n\n if not proc.stdout:\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\") or \"no output from child process\"\n return Data(data={\"error\": f\"Docling subprocess error: {err_msg}\", \"file_path\": file_path})\n\n try:\n result = json.loads(proc.stdout.decode(\"utf-8\"))\n except Exception as e: # noqa: BLE001\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\")\n return Data(\n data={\"error\": f\"Invalid JSON from Docling subprocess: {e}. stderr={err_msg}\", \"file_path\": file_path},\n )\n\n if not result.get(\"ok\"):\n return Data(data={\"error\": result.get(\"error\", \"Unknown Docling error\"), **result.get(\"meta\", {})})\n\n meta = result.get(\"meta\", {})\n if result.get(\"mode\") == \"markdown\":\n exported_content = str(result.get(\"text\", \"\"))\n return Data(\n text=exported_content,\n data={\"exported_content\": exported_content, \"export_format\": self.EXPORT_FORMAT, **meta},\n )\n\n rows = list(result.get(\"doc\", []))\n return Data(data={\"doc\": rows, \"export_format\": self.EXPORT_FORMAT, **meta})\n\n def process_files(\n self,\n file_list: list[BaseFileComponent.BaseFile],\n ) -> list[BaseFileComponent.BaseFile]:\n \"\"\"Process input files.\n\n - Single file + advanced_mode => Docling in a separate process.\n - Otherwise => standard parsing in current process (optionally threaded).\n \"\"\"\n if not file_list:\n msg = \"No files to process.\"\n raise ValueError(msg)\n\n def process_file_standard(file_path: str, *, silent_errors: bool = False) -> Data | None:\n try:\n return parse_text_file_to_data(file_path, silent_errors=silent_errors)\n except FileNotFoundError as e:\n self.log(f\"File not found: {file_path}. Error: {e}\")\n if not silent_errors:\n raise\n return None\n except Exception as e:\n self.log(f\"Unexpected error processing {file_path}: {e}\")\n if not silent_errors:\n raise\n return None\n\n # Advanced path: only for a single Docling-compatible file\n if len(file_list) == 1:\n file_path = str(file_list[0].path)\n if self.advanced_mode and self._is_docling_compatible(file_path):\n advanced_data: Data | None = self._process_docling_in_subprocess(file_path)\n\n # --- UNNEST: expand each element in `doc` to its own Data row\n payload = getattr(advanced_data, \"data\", {}) or {}\n doc_rows = payload.get(\"doc\")\n if isinstance(doc_rows, list):\n rows: list[Data | None] = [\n Data(\n data={\n \"file_path\": file_path,\n **(item if isinstance(item, dict) else {\"value\": item}),\n },\n )\n for item in doc_rows\n ]\n return self.rollup_data(file_list, rows)\n\n # If not structured, keep as-is (e.g., markdown export or error dict)\n return self.rollup_data(file_list, [advanced_data])\n\n # Standard multi-file (or single non-advanced) path\n concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading)\n file_paths = [str(f.path) for f in file_list]\n self.log(f\"Starting parallel processing of {len(file_paths)} files with concurrency: {concurrency}.\")\n my_data = parallel_load_data(\n file_paths,\n silent_errors=self.silent_errors,\n load_function=process_file_standard,\n max_concurrency=concurrency,\n )\n return self.rollup_data(file_list, my_data)\n\n # ------------------------------ Output helpers -----------------------------------\n\n def load_files_advanced(self) -> DataFrame:\n \"\"\"Load files using advanced Docling processing and export to an advanced format.\"\"\"\n self.markdown = False\n return self.load_files()\n\n def load_files_markdown(self) -> Message:\n \"\"\"Load files using advanced Docling processing and export to Markdown format.\"\"\"\n self.markdown = True\n result = self.load_files()\n return Message(text=str(result.text[0]))\n" | ||
| "value": "\"\"\"Enhanced file component with clearer structure and Docling isolation.\n\nNotes:\n-----\n- Functionality is preserved with minimal behavioral changes.\n- ALL Docling parsing/export runs in a separate OS process to prevent memory\n growth and native library state from impacting the main Langflow process.\n- Standard text/structured parsing continues to use existing BaseFileComponent\n utilities (and optional threading via `parallel_load_data`).\n\"\"\"\n\nfrom __future__ import annotations\n\nimport json\nimport subprocess\nimport sys\nimport textwrap\nfrom copy import deepcopy\nfrom typing import Any\n\nfrom lfx.base.data.base_file import BaseFileComponent\nfrom lfx.base.data.utils import TEXT_FILE_TYPES, parallel_load_data, parse_text_file_to_data\nfrom lfx.inputs.inputs import DropdownInput, MessageTextInput, StrInput\nfrom lfx.io import BoolInput, FileInput, IntInput, Output\nfrom lfx.schema import DataFrame # noqa: TC001\nfrom lfx.schema.data import Data\nfrom lfx.schema.message import Message\n\n\nclass FileComponent(BaseFileComponent):\n \"\"\"Read File component with optional Docling processing (isolated in a subprocess).\"\"\"\n\n display_name = \"Read File\"\n description = \"Loads content from files with optional advanced document processing and export using Docling.\"\n documentation: str = \"https://docs.langflow.org/components-data#file\"\n icon = \"file-text\"\n name = \"File\"\n\n # Docling-supported/compatible extensions; TEXT_FILE_TYPES are supported by the base loader.\n VALID_EXTENSIONS = [\n \"adoc\",\n \"asciidoc\",\n \"asc\",\n \"bmp\",\n \"csv\",\n \"dotx\",\n \"dotm\",\n \"docm\",\n \"docx\",\n \"htm\",\n \"html\",\n \"jpeg\",\n \"json\",\n \"md\",\n \"pdf\",\n \"png\",\n \"potx\",\n \"ppsx\",\n \"pptm\",\n \"potm\",\n \"ppsm\",\n \"pptx\",\n \"tiff\",\n \"txt\",\n \"xls\",\n \"xlsx\",\n \"xhtml\",\n \"xml\",\n \"webp\",\n *TEXT_FILE_TYPES,\n ]\n\n # Fixed export settings used when markdown export is requested.\n EXPORT_FORMAT = \"Markdown\"\n IMAGE_MODE = \"placeholder\"\n\n _base_inputs = deepcopy(BaseFileComponent.get_base_inputs())\n\n for input_item in _base_inputs:\n if isinstance(input_item, FileInput) and input_item.name == \"path\":\n input_item.real_time_refresh = True\n break\n\n inputs = [\n *_base_inputs,\n BoolInput(\n name=\"advanced_mode\",\n display_name=\"Advanced Parser\",\n value=False,\n real_time_refresh=True,\n info=(\n \"Enable advanced document processing and export with Docling for PDFs, images, and office documents. \"\n \"Available only for single file processing.\"\n ),\n show=False,\n ),\n DropdownInput(\n name=\"pipeline\",\n display_name=\"Pipeline\",\n info=\"Docling pipeline to use\",\n options=[\"standard\", \"vlm\"],\n value=\"standard\",\n advanced=True,\n ),\n DropdownInput(\n name=\"ocr_engine\",\n display_name=\"OCR Engine\",\n info=\"OCR engine to use. Only available when pipeline is set to 'standard'.\",\n options=[\"\", \"easyocr\"],\n value=\"\",\n show=False,\n advanced=True,\n ),\n StrInput(\n name=\"md_image_placeholder\",\n display_name=\"Image placeholder\",\n info=\"Specify the image placeholder for markdown exports.\",\n value=\"<!-- image -->\",\n advanced=True,\n show=False,\n ),\n StrInput(\n name=\"md_page_break_placeholder\",\n display_name=\"Page break placeholder\",\n info=\"Add this placeholder between pages in the markdown output.\",\n value=\"\",\n advanced=True,\n show=False,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n show=False,\n ),\n # Deprecated input retained for backward-compatibility.\n BoolInput(\n name=\"use_multithreading\",\n display_name=\"[Deprecated] Use Multithreading\",\n advanced=True,\n value=True,\n info=\"Set 'Processing Concurrency' greater than 1 to enable multithreading.\",\n ),\n IntInput(\n name=\"concurrency_multithreading\",\n display_name=\"Processing Concurrency\",\n advanced=True,\n info=\"When multiple files are being processed, the number of files to process concurrently.\",\n value=1,\n ),\n BoolInput(\n name=\"markdown\",\n display_name=\"Markdown Export\",\n info=\"Export processed documents to Markdown format. Only available when advanced mode is enabled.\",\n value=False,\n show=False,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n ]\n\n # ------------------------------ UI helpers --------------------------------------\n\n def _path_value(self, template: dict) -> list[str]:\n \"\"\"Return the list of currently selected file paths from the template.\"\"\"\n return template.get(\"path\", {}).get(\"file_path\", [])\n\n def update_build_config(\n self,\n build_config: dict[str, Any],\n field_value: Any,\n field_name: str | None = None,\n ) -> dict[str, Any]:\n \"\"\"Show/hide Advanced Parser and related fields based on selection context.\"\"\"\n if field_name == \"path\":\n paths = self._path_value(build_config)\n file_path = paths[0] if paths else \"\"\n file_count = len(field_value) if field_value else 0\n\n # Advanced mode only for single (non-tabular) file\n allow_advanced = file_count == 1 and not file_path.endswith((\".csv\", \".xlsx\", \".parquet\"))\n build_config[\"advanced_mode\"][\"show\"] = allow_advanced\n if not allow_advanced:\n build_config[\"advanced_mode\"][\"value\"] = False\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = False\n\n elif field_name == \"advanced_mode\":\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = bool(field_value)\n\n return build_config\n\n def update_outputs(self, frontend_node: dict[str, Any], field_name: str, field_value: Any) -> dict[str, Any]: # noqa: ARG002\n \"\"\"Dynamically show outputs based on file count/type and advanced mode.\"\"\"\n if field_name not in [\"path\", \"advanced_mode\"]:\n return frontend_node\n\n template = frontend_node.get(\"template\", {})\n paths = self._path_value(template)\n if not paths:\n return frontend_node\n\n frontend_node[\"outputs\"] = []\n if len(paths) == 1:\n file_path = paths[0] if field_name == \"path\" else frontend_node[\"template\"][\"path\"][\"file_path\"][0]\n if file_path.endswith((\".csv\", \".xlsx\", \".parquet\")):\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Content\",\n name=\"dataframe\",\n method=\"load_files_structured\",\n tool_mode=True,\n ),\n )\n elif file_path.endswith(\".json\"):\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Content\", name=\"json\", method=\"load_files_json\", tool_mode=True),\n )\n\n advanced_mode = frontend_node.get(\"template\", {}).get(\"advanced_mode\", {}).get(\"value\", False)\n if advanced_mode:\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Output\",\n name=\"advanced\",\n method=\"load_files_advanced\",\n tool_mode=True,\n ),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Markdown\", name=\"markdown\", method=\"load_files_markdown\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n # Multiple files => DataFrame output; advanced parser disabled\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Files\",\n name=\"dataframe\",\n method=\"load_files\",\n tool_mode=True,\n )\n )\n\n return frontend_node\n\n # ------------------------------ Core processing ----------------------------------\n\n def _is_docling_compatible(self, file_path: str) -> bool:\n \"\"\"Lightweight extension gate for Docling-compatible types.\"\"\"\n docling_exts = (\n \".adoc\",\n \".asciidoc\",\n \".asc\",\n \".bmp\",\n \".csv\",\n \".dotx\",\n \".dotm\",\n \".docm\",\n \".docx\",\n \".htm\",\n \".html\",\n \".jpeg\",\n \".json\",\n \".md\",\n \".pdf\",\n \".png\",\n \".potx\",\n \".ppsx\",\n \".pptm\",\n \".potm\",\n \".ppsm\",\n \".pptx\",\n \".tiff\",\n \".txt\",\n \".xls\",\n \".xlsx\",\n \".xhtml\",\n \".xml\",\n \".webp\",\n )\n return file_path.lower().endswith(docling_exts)\n\n def _process_docling_in_subprocess(self, file_path: str) -> Data | None:\n \"\"\"Run Docling in a separate OS process and map the result to a Data object.\n\n We avoid multiprocessing pickling by launching `python -c \"<script>\"` and\n passing JSON config via stdin. The child prints a JSON result to stdout.\n \"\"\"\n if not file_path:\n return None\n\n args: dict[str, Any] = {\n \"file_path\": file_path,\n \"markdown\": bool(self.markdown),\n \"image_mode\": str(self.IMAGE_MODE),\n \"md_image_placeholder\": str(self.md_image_placeholder),\n \"md_page_break_placeholder\": str(self.md_page_break_placeholder),\n \"pipeline\": str(self.pipeline),\n \"ocr_engine\": str(self.ocr_engine) if getattr(self, \"ocr_engine\", \"\") else None,\n }\n\n # The child is a tiny, self-contained script to keep memory/state isolated.\n child_script = textwrap.dedent(\n r\"\"\"\n import json, sys\n\n def try_imports():\n # Strategy 1: latest layout\n try:\n from docling.datamodel.base_models import ConversionStatus, InputFormat # type: ignore\n from docling.document_converter import DocumentConverter # type: ignore\n from docling_core.types.doc import ImageRefMode # type: ignore\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"latest\"\n except Exception:\n pass\n # Strategy 2: alternative layout\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n try:\n from docling_core.types import ConversionStatus, InputFormat # type: ignore\n except Exception:\n try:\n from docling.datamodel import ConversionStatus, InputFormat # type: ignore\n except Exception:\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n try:\n from docling_core.types.doc import ImageRefMode # type: ignore\n except Exception:\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"alternative\"\n except Exception:\n pass\n # Strategy 3: basic converter only\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"basic\"\n except Exception as e:\n raise ImportError(f\"Docling imports failed: {e}\") from e\n\n def create_converter(strategy, input_format, DocumentConverter, pipeline, ocr_engine):\n if strategy == \"latest\" and pipeline == \"standard\":\n try:\n from docling.datamodel.pipeline_options import PdfPipelineOptions # type: ignore\n from docling.document_converter import PdfFormatOption # type: ignore\n pipe = PdfPipelineOptions()\n if ocr_engine:\n try:\n from docling.models.factories import get_ocr_factory # type: ignore\n pipe.do_ocr = True\n fac = get_ocr_factory(allow_external_plugins=False)\n pipe.ocr_options = fac.create_options(kind=ocr_engine)\n except Exception:\n pipe.do_ocr = False\n fmt = {}\n if hasattr(input_format, \"PDF\"):\n fmt[getattr(input_format, \"PDF\")] = PdfFormatOption(pipeline_options=pipe)\n if hasattr(input_format, \"IMAGE\"):\n fmt[getattr(input_format, \"IMAGE\")] = PdfFormatOption(pipeline_options=pipe)\n return DocumentConverter(format_options=fmt)\n except Exception:\n return DocumentConverter()\n return DocumentConverter()\n\n def export_markdown(document, ImageRefMode, image_mode, img_ph, pg_ph):\n try:\n mode = getattr(ImageRefMode, image_mode.upper(), image_mode)\n return document.export_to_markdown(\n image_mode=mode,\n image_placeholder=img_ph,\n page_break_placeholder=pg_ph,\n )\n except Exception:\n try:\n return document.export_to_text()\n except Exception:\n return str(document)\n\n def to_rows(doc_dict):\n rows = []\n for t in doc_dict.get(\"texts\", []):\n prov = t.get(\"prov\") or []\n page_no = None\n if prov and isinstance(prov, list) and isinstance(prov[0], dict):\n page_no = prov[0].get(\"page_no\")\n rows.append({\n \"page_no\": page_no,\n \"label\": t.get(\"label\"),\n \"text\": t.get(\"text\"),\n \"level\": t.get(\"level\"),\n })\n return rows\n\n def main():\n cfg = json.loads(sys.stdin.read())\n file_path = cfg[\"file_path\"]\n markdown = cfg[\"markdown\"]\n image_mode = cfg[\"image_mode\"]\n img_ph = cfg[\"md_image_placeholder\"]\n pg_ph = cfg[\"md_page_break_placeholder\"]\n pipeline = cfg[\"pipeline\"]\n ocr_engine = cfg.get(\"ocr_engine\")\n meta = {\"file_path\": file_path}\n\n try:\n ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, strategy = try_imports()\n converter = create_converter(strategy, InputFormat, DocumentConverter, pipeline, ocr_engine)\n try:\n res = converter.convert(file_path)\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling conversion error: {e}\", \"meta\": meta}))\n return\n\n ok = False\n if hasattr(res, \"status\"):\n try:\n ok = (res.status == ConversionStatus.SUCCESS) or (str(res.status).lower() == \"success\")\n except Exception:\n ok = (str(res.status).lower() == \"success\")\n if not ok and hasattr(res, \"document\"):\n ok = getattr(res, \"document\", None) is not None\n if not ok:\n print(json.dumps({\"ok\": False, \"error\": \"Docling conversion failed\", \"meta\": meta}))\n return\n\n doc = getattr(res, \"document\", None)\n if doc is None:\n print(json.dumps({\"ok\": False, \"error\": \"Docling produced no document\", \"meta\": meta}))\n return\n\n if markdown:\n text = export_markdown(doc, ImageRefMode, image_mode, img_ph, pg_ph)\n print(json.dumps({\"ok\": True, \"mode\": \"markdown\", \"text\": text, \"meta\": meta}))\n return\n\n # structured\n try:\n doc_dict = doc.export_to_dict()\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling export_to_dict failed: {e}\", \"meta\": meta}))\n return\n\n rows = to_rows(doc_dict)\n print(json.dumps({\"ok\": True, \"mode\": \"structured\", \"doc\": rows, \"meta\": meta}))\n except Exception as e:\n print(\n json.dumps({\n \"ok\": False,\n \"error\": f\"Docling processing error: {e}\",\n \"meta\": {\"file_path\": file_path},\n })\n )\n\n if __name__ == \"__main__\":\n main()\n \"\"\"\n )\n\n # Validate file_path to avoid command injection or unsafe input\n if not isinstance(args[\"file_path\"], str) or any(c in args[\"file_path\"] for c in [\";\", \"|\", \"&\", \"$\", \"`\"]):\n return Data(data={\"error\": \"Unsafe file path detected.\", \"file_path\": args[\"file_path\"]})\n\n proc = subprocess.run( # noqa: S603\n [sys.executable, \"-u\", \"-c\", child_script],\n input=json.dumps(args).encode(\"utf-8\"),\n capture_output=True,\n check=False,\n )\n\n if not proc.stdout:\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\") or \"no output from child process\"\n return Data(data={\"error\": f\"Docling subprocess error: {err_msg}\", \"file_path\": file_path})\n\n try:\n result = json.loads(proc.stdout.decode(\"utf-8\"))\n except Exception as e: # noqa: BLE001\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\")\n return Data(\n data={\"error\": f\"Invalid JSON from Docling subprocess: {e}. stderr={err_msg}\", \"file_path\": file_path},\n )\n\n if not result.get(\"ok\"):\n return Data(data={\"error\": result.get(\"error\", \"Unknown Docling error\"), **result.get(\"meta\", {})})\n\n meta = result.get(\"meta\", {})\n if result.get(\"mode\") == \"markdown\":\n exported_content = str(result.get(\"text\", \"\"))\n return Data(\n text=exported_content,\n data={\"exported_content\": exported_content, \"export_format\": self.EXPORT_FORMAT, **meta},\n )\n\n rows = list(result.get(\"doc\", []))\n return Data(data={\"doc\": rows, \"export_format\": self.EXPORT_FORMAT, **meta})\n\n def process_files(\n self,\n file_list: list[BaseFileComponent.BaseFile],\n ) -> list[BaseFileComponent.BaseFile]:\n \"\"\"Process input files.\n\n - Single file + advanced_mode => Docling in a separate process.\n - Otherwise => standard parsing in current process (optionally threaded).\n \"\"\"\n if not file_list:\n msg = \"No files to process.\"\n raise ValueError(msg)\n\n def process_file_standard(file_path: str, *, silent_errors: bool = False) -> Data | None:\n try:\n return parse_text_file_to_data(file_path, silent_errors=silent_errors)\n except FileNotFoundError as e:\n self.log(f\"File not found: {file_path}. Error: {e}\")\n if not silent_errors:\n raise\n return None\n except Exception as e:\n self.log(f\"Unexpected error processing {file_path}: {e}\")\n if not silent_errors:\n raise\n return None\n\n # Advanced path: only for a single Docling-compatible file\n if len(file_list) == 1:\n file_path = str(file_list[0].path)\n if self.advanced_mode and self._is_docling_compatible(file_path):\n advanced_data: Data | None = self._process_docling_in_subprocess(file_path)\n\n # --- UNNEST: expand each element in `doc` to its own Data row\n payload = getattr(advanced_data, \"data\", {}) or {}\n doc_rows = payload.get(\"doc\")\n if isinstance(doc_rows, list):\n rows: list[Data | None] = [\n Data(\n data={\n \"file_path\": file_path,\n **(item if isinstance(item, dict) else {\"value\": item}),\n },\n )\n for item in doc_rows\n ]\n return self.rollup_data(file_list, rows)\n\n # If not structured, keep as-is (e.g., markdown export or error dict)\n return self.rollup_data(file_list, [advanced_data])\n\n # Standard multi-file (or single non-advanced) path\n concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading)\n file_paths = [str(f.path) for f in file_list]\n self.log(f\"Starting parallel processing of {len(file_paths)} files with concurrency: {concurrency}.\")\n my_data = parallel_load_data(\n file_paths,\n silent_errors=self.silent_errors,\n load_function=process_file_standard,\n max_concurrency=concurrency,\n )\n return self.rollup_data(file_list, my_data)\n\n # ------------------------------ Output helpers -----------------------------------\n\n def load_files_advanced(self) -> DataFrame:\n \"\"\"Load files using advanced Docling processing and export to an advanced format.\"\"\"\n self.markdown = False\n return self.load_files()\n\n def load_files_markdown(self) -> Message:\n \"\"\"Load files using advanced Docling processing and export to Markdown format.\"\"\"\n self.markdown = True\n result = self.load_files()\n return Message(text=str(result.text[0]))\n" |
There was a problem hiding this comment.
Guard None from _process_docling_in_subprocess before rollup.
If advanced_data is None, rollup_data(file_list, [advanced_data]) passes [None] and may break downstream.
Apply this diff:
if self.advanced_mode and self._is_docling_compatible(file_path):
advanced_data: Data | None = self._process_docling_in_subprocess(file_path)
+ if advanced_data is None:
+ return self.rollup_data(
+ file_list,
+ [Data(data={"error": "Docling returned no data", "file_path": file_path})],
+ )
# --- UNNEST: expand each element in `doc` to its own Data row
payload = getattr(advanced_data, "data", {}) or {}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "value": "\"\"\"Enhanced file component with clearer structure and Docling isolation.\n\nNotes:\n-----\n- Functionality is preserved with minimal behavioral changes.\n- ALL Docling parsing/export runs in a separate OS process to prevent memory\n growth and native library state from impacting the main Langflow process.\n- Standard text/structured parsing continues to use existing BaseFileComponent\n utilities (and optional threading via `parallel_load_data`).\n\"\"\"\n\nfrom __future__ import annotations\n\nimport json\nimport subprocess\nimport sys\nimport textwrap\nfrom copy import deepcopy\nfrom typing import Any\n\nfrom lfx.base.data.base_file import BaseFileComponent\nfrom lfx.base.data.utils import TEXT_FILE_TYPES, parallel_load_data, parse_text_file_to_data\nfrom lfx.inputs.inputs import DropdownInput, MessageTextInput, StrInput\nfrom lfx.io import BoolInput, FileInput, IntInput, Output\nfrom lfx.schema import DataFrame # noqa: TC001\nfrom lfx.schema.data import Data\nfrom lfx.schema.message import Message\n\n\nclass FileComponent(BaseFileComponent):\n \"\"\"Read File component with optional Docling processing (isolated in a subprocess).\"\"\"\n\n display_name = \"Read File\"\n description = \"Loads content from files with optional advanced document processing and export using Docling.\"\n documentation: str = \"https://docs.langflow.org/components-data#file\"\n icon = \"file-text\"\n name = \"File\"\n\n # Docling-supported/compatible extensions; TEXT_FILE_TYPES are supported by the base loader.\n VALID_EXTENSIONS = [\n \"adoc\",\n \"asciidoc\",\n \"asc\",\n \"bmp\",\n \"csv\",\n \"dotx\",\n \"dotm\",\n \"docm\",\n \"docx\",\n \"htm\",\n \"html\",\n \"jpeg\",\n \"json\",\n \"md\",\n \"pdf\",\n \"png\",\n \"potx\",\n \"ppsx\",\n \"pptm\",\n \"potm\",\n \"ppsm\",\n \"pptx\",\n \"tiff\",\n \"txt\",\n \"xls\",\n \"xlsx\",\n \"xhtml\",\n \"xml\",\n \"webp\",\n *TEXT_FILE_TYPES,\n ]\n\n # Fixed export settings used when markdown export is requested.\n EXPORT_FORMAT = \"Markdown\"\n IMAGE_MODE = \"placeholder\"\n\n _base_inputs = deepcopy(BaseFileComponent.get_base_inputs())\n\n for input_item in _base_inputs:\n if isinstance(input_item, FileInput) and input_item.name == \"path\":\n input_item.real_time_refresh = True\n break\n\n inputs = [\n *_base_inputs,\n BoolInput(\n name=\"advanced_mode\",\n display_name=\"Advanced Parser\",\n value=False,\n real_time_refresh=True,\n info=(\n \"Enable advanced document processing and export with Docling for PDFs, images, and office documents. \"\n \"Available only for single file processing.\"\n ),\n show=False,\n ),\n DropdownInput(\n name=\"pipeline\",\n display_name=\"Pipeline\",\n info=\"Docling pipeline to use\",\n options=[\"standard\", \"vlm\"],\n value=\"standard\",\n advanced=True,\n ),\n DropdownInput(\n name=\"ocr_engine\",\n display_name=\"OCR Engine\",\n info=\"OCR engine to use. Only available when pipeline is set to 'standard'.\",\n options=[\"\", \"easyocr\"],\n value=\"\",\n show=False,\n advanced=True,\n ),\n StrInput(\n name=\"md_image_placeholder\",\n display_name=\"Image placeholder\",\n info=\"Specify the image placeholder for markdown exports.\",\n value=\"<!-- image -->\",\n advanced=True,\n show=False,\n ),\n StrInput(\n name=\"md_page_break_placeholder\",\n display_name=\"Page break placeholder\",\n info=\"Add this placeholder between pages in the markdown output.\",\n value=\"\",\n advanced=True,\n show=False,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n show=False,\n ),\n # Deprecated input retained for backward-compatibility.\n BoolInput(\n name=\"use_multithreading\",\n display_name=\"[Deprecated] Use Multithreading\",\n advanced=True,\n value=True,\n info=\"Set 'Processing Concurrency' greater than 1 to enable multithreading.\",\n ),\n IntInput(\n name=\"concurrency_multithreading\",\n display_name=\"Processing Concurrency\",\n advanced=True,\n info=\"When multiple files are being processed, the number of files to process concurrently.\",\n value=1,\n ),\n BoolInput(\n name=\"markdown\",\n display_name=\"Markdown Export\",\n info=\"Export processed documents to Markdown format. Only available when advanced mode is enabled.\",\n value=False,\n show=False,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n ]\n\n # ------------------------------ UI helpers --------------------------------------\n\n def _path_value(self, template: dict) -> list[str]:\n \"\"\"Return the list of currently selected file paths from the template.\"\"\"\n return template.get(\"path\", {}).get(\"file_path\", [])\n\n def update_build_config(\n self,\n build_config: dict[str, Any],\n field_value: Any,\n field_name: str | None = None,\n ) -> dict[str, Any]:\n \"\"\"Show/hide Advanced Parser and related fields based on selection context.\"\"\"\n if field_name == \"path\":\n paths = self._path_value(build_config)\n file_path = paths[0] if paths else \"\"\n file_count = len(field_value) if field_value else 0\n\n # Advanced mode only for single (non-tabular) file\n allow_advanced = file_count == 1 and not file_path.endswith((\".csv\", \".xlsx\", \".parquet\"))\n build_config[\"advanced_mode\"][\"show\"] = allow_advanced\n if not allow_advanced:\n build_config[\"advanced_mode\"][\"value\"] = False\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = False\n\n elif field_name == \"advanced_mode\":\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = bool(field_value)\n\n return build_config\n\n def update_outputs(self, frontend_node: dict[str, Any], field_name: str, field_value: Any) -> dict[str, Any]: # noqa: ARG002\n \"\"\"Dynamically show outputs based on file count/type and advanced mode.\"\"\"\n if field_name not in [\"path\", \"advanced_mode\"]:\n return frontend_node\n\n template = frontend_node.get(\"template\", {})\n paths = self._path_value(template)\n if not paths:\n return frontend_node\n\n frontend_node[\"outputs\"] = []\n if len(paths) == 1:\n file_path = paths[0] if field_name == \"path\" else frontend_node[\"template\"][\"path\"][\"file_path\"][0]\n if file_path.endswith((\".csv\", \".xlsx\", \".parquet\")):\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Content\",\n name=\"dataframe\",\n method=\"load_files_structured\",\n tool_mode=True,\n ),\n )\n elif file_path.endswith(\".json\"):\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Content\", name=\"json\", method=\"load_files_json\", tool_mode=True),\n )\n\n advanced_mode = frontend_node.get(\"template\", {}).get(\"advanced_mode\", {}).get(\"value\", False)\n if advanced_mode:\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Output\",\n name=\"advanced\",\n method=\"load_files_advanced\",\n tool_mode=True,\n ),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Markdown\", name=\"markdown\", method=\"load_files_markdown\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n # Multiple files => DataFrame output; advanced parser disabled\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Files\",\n name=\"dataframe\",\n method=\"load_files\",\n tool_mode=True,\n )\n )\n\n return frontend_node\n\n # ------------------------------ Core processing ----------------------------------\n\n def _is_docling_compatible(self, file_path: str) -> bool:\n \"\"\"Lightweight extension gate for Docling-compatible types.\"\"\"\n docling_exts = (\n \".adoc\",\n \".asciidoc\",\n \".asc\",\n \".bmp\",\n \".csv\",\n \".dotx\",\n \".dotm\",\n \".docm\",\n \".docx\",\n \".htm\",\n \".html\",\n \".jpeg\",\n \".json\",\n \".md\",\n \".pdf\",\n \".png\",\n \".potx\",\n \".ppsx\",\n \".pptm\",\n \".potm\",\n \".ppsm\",\n \".pptx\",\n \".tiff\",\n \".txt\",\n \".xls\",\n \".xlsx\",\n \".xhtml\",\n \".xml\",\n \".webp\",\n )\n return file_path.lower().endswith(docling_exts)\n\n def _process_docling_in_subprocess(self, file_path: str) -> Data | None:\n \"\"\"Run Docling in a separate OS process and map the result to a Data object.\n\n We avoid multiprocessing pickling by launching `python -c \"<script>\"` and\n passing JSON config via stdin. The child prints a JSON result to stdout.\n \"\"\"\n if not file_path:\n return None\n\n args: dict[str, Any] = {\n \"file_path\": file_path,\n \"markdown\": bool(self.markdown),\n \"image_mode\": str(self.IMAGE_MODE),\n \"md_image_placeholder\": str(self.md_image_placeholder),\n \"md_page_break_placeholder\": str(self.md_page_break_placeholder),\n \"pipeline\": str(self.pipeline),\n \"ocr_engine\": str(self.ocr_engine) if getattr(self, \"ocr_engine\", \"\") else None,\n }\n\n # The child is a tiny, self-contained script to keep memory/state isolated.\n child_script = textwrap.dedent(\n r\"\"\"\n import json, sys\n\n def try_imports():\n # Strategy 1: latest layout\n try:\n from docling.datamodel.base_models import ConversionStatus, InputFormat # type: ignore\n from docling.document_converter import DocumentConverter # type: ignore\n from docling_core.types.doc import ImageRefMode # type: ignore\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"latest\"\n except Exception:\n pass\n # Strategy 2: alternative layout\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n try:\n from docling_core.types import ConversionStatus, InputFormat # type: ignore\n except Exception:\n try:\n from docling.datamodel import ConversionStatus, InputFormat # type: ignore\n except Exception:\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n try:\n from docling_core.types.doc import ImageRefMode # type: ignore\n except Exception:\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"alternative\"\n except Exception:\n pass\n # Strategy 3: basic converter only\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"basic\"\n except Exception as e:\n raise ImportError(f\"Docling imports failed: {e}\") from e\n\n def create_converter(strategy, input_format, DocumentConverter, pipeline, ocr_engine):\n if strategy == \"latest\" and pipeline == \"standard\":\n try:\n from docling.datamodel.pipeline_options import PdfPipelineOptions # type: ignore\n from docling.document_converter import PdfFormatOption # type: ignore\n pipe = PdfPipelineOptions()\n if ocr_engine:\n try:\n from docling.models.factories import get_ocr_factory # type: ignore\n pipe.do_ocr = True\n fac = get_ocr_factory(allow_external_plugins=False)\n pipe.ocr_options = fac.create_options(kind=ocr_engine)\n except Exception:\n pipe.do_ocr = False\n fmt = {}\n if hasattr(input_format, \"PDF\"):\n fmt[getattr(input_format, \"PDF\")] = PdfFormatOption(pipeline_options=pipe)\n if hasattr(input_format, \"IMAGE\"):\n fmt[getattr(input_format, \"IMAGE\")] = PdfFormatOption(pipeline_options=pipe)\n return DocumentConverter(format_options=fmt)\n except Exception:\n return DocumentConverter()\n return DocumentConverter()\n\n def export_markdown(document, ImageRefMode, image_mode, img_ph, pg_ph):\n try:\n mode = getattr(ImageRefMode, image_mode.upper(), image_mode)\n return document.export_to_markdown(\n image_mode=mode,\n image_placeholder=img_ph,\n page_break_placeholder=pg_ph,\n )\n except Exception:\n try:\n return document.export_to_text()\n except Exception:\n return str(document)\n\n def to_rows(doc_dict):\n rows = []\n for t in doc_dict.get(\"texts\", []):\n prov = t.get(\"prov\") or []\n page_no = None\n if prov and isinstance(prov, list) and isinstance(prov[0], dict):\n page_no = prov[0].get(\"page_no\")\n rows.append({\n \"page_no\": page_no,\n \"label\": t.get(\"label\"),\n \"text\": t.get(\"text\"),\n \"level\": t.get(\"level\"),\n })\n return rows\n\n def main():\n cfg = json.loads(sys.stdin.read())\n file_path = cfg[\"file_path\"]\n markdown = cfg[\"markdown\"]\n image_mode = cfg[\"image_mode\"]\n img_ph = cfg[\"md_image_placeholder\"]\n pg_ph = cfg[\"md_page_break_placeholder\"]\n pipeline = cfg[\"pipeline\"]\n ocr_engine = cfg.get(\"ocr_engine\")\n meta = {\"file_path\": file_path}\n\n try:\n ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, strategy = try_imports()\n converter = create_converter(strategy, InputFormat, DocumentConverter, pipeline, ocr_engine)\n try:\n res = converter.convert(file_path)\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling conversion error: {e}\", \"meta\": meta}))\n return\n\n ok = False\n if hasattr(res, \"status\"):\n try:\n ok = (res.status == ConversionStatus.SUCCESS) or (str(res.status).lower() == \"success\")\n except Exception:\n ok = (str(res.status).lower() == \"success\")\n if not ok and hasattr(res, \"document\"):\n ok = getattr(res, \"document\", None) is not None\n if not ok:\n print(json.dumps({\"ok\": False, \"error\": \"Docling conversion failed\", \"meta\": meta}))\n return\n\n doc = getattr(res, \"document\", None)\n if doc is None:\n print(json.dumps({\"ok\": False, \"error\": \"Docling produced no document\", \"meta\": meta}))\n return\n\n if markdown:\n text = export_markdown(doc, ImageRefMode, image_mode, img_ph, pg_ph)\n print(json.dumps({\"ok\": True, \"mode\": \"markdown\", \"text\": text, \"meta\": meta}))\n return\n\n # structured\n try:\n doc_dict = doc.export_to_dict()\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling export_to_dict failed: {e}\", \"meta\": meta}))\n return\n\n rows = to_rows(doc_dict)\n print(json.dumps({\"ok\": True, \"mode\": \"structured\", \"doc\": rows, \"meta\": meta}))\n except Exception as e:\n print(\n json.dumps({\n \"ok\": False,\n \"error\": f\"Docling processing error: {e}\",\n \"meta\": {\"file_path\": file_path},\n })\n )\n\n if __name__ == \"__main__\":\n main()\n \"\"\"\n )\n\n # Validate file_path to avoid command injection or unsafe input\n if not isinstance(args[\"file_path\"], str) or any(c in args[\"file_path\"] for c in [\";\", \"|\", \"&\", \"$\", \"`\"]):\n return Data(data={\"error\": \"Unsafe file path detected.\", \"file_path\": args[\"file_path\"]})\n\n proc = subprocess.run( # noqa: S603\n [sys.executable, \"-u\", \"-c\", child_script],\n input=json.dumps(args).encode(\"utf-8\"),\n capture_output=True,\n check=False,\n )\n\n if not proc.stdout:\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\") or \"no output from child process\"\n return Data(data={\"error\": f\"Docling subprocess error: {err_msg}\", \"file_path\": file_path})\n\n try:\n result = json.loads(proc.stdout.decode(\"utf-8\"))\n except Exception as e: # noqa: BLE001\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\")\n return Data(\n data={\"error\": f\"Invalid JSON from Docling subprocess: {e}. stderr={err_msg}\", \"file_path\": file_path},\n )\n\n if not result.get(\"ok\"):\n return Data(data={\"error\": result.get(\"error\", \"Unknown Docling error\"), **result.get(\"meta\", {})})\n\n meta = result.get(\"meta\", {})\n if result.get(\"mode\") == \"markdown\":\n exported_content = str(result.get(\"text\", \"\"))\n return Data(\n text=exported_content,\n data={\"exported_content\": exported_content, \"export_format\": self.EXPORT_FORMAT, **meta},\n )\n\n rows = list(result.get(\"doc\", []))\n return Data(data={\"doc\": rows, \"export_format\": self.EXPORT_FORMAT, **meta})\n\n def process_files(\n self,\n file_list: list[BaseFileComponent.BaseFile],\n ) -> list[BaseFileComponent.BaseFile]:\n \"\"\"Process input files.\n\n - Single file + advanced_mode => Docling in a separate process.\n - Otherwise => standard parsing in current process (optionally threaded).\n \"\"\"\n if not file_list:\n msg = \"No files to process.\"\n raise ValueError(msg)\n\n def process_file_standard(file_path: str, *, silent_errors: bool = False) -> Data | None:\n try:\n return parse_text_file_to_data(file_path, silent_errors=silent_errors)\n except FileNotFoundError as e:\n self.log(f\"File not found: {file_path}. Error: {e}\")\n if not silent_errors:\n raise\n return None\n except Exception as e:\n self.log(f\"Unexpected error processing {file_path}: {e}\")\n if not silent_errors:\n raise\n return None\n\n # Advanced path: only for a single Docling-compatible file\n if len(file_list) == 1:\n file_path = str(file_list[0].path)\n if self.advanced_mode and self._is_docling_compatible(file_path):\n advanced_data: Data | None = self._process_docling_in_subprocess(file_path)\n\n # --- UNNEST: expand each element in `doc` to its own Data row\n payload = getattr(advanced_data, \"data\", {}) or {}\n doc_rows = payload.get(\"doc\")\n if isinstance(doc_rows, list):\n rows: list[Data | None] = [\n Data(\n data={\n \"file_path\": file_path,\n **(item if isinstance(item, dict) else {\"value\": item}),\n },\n )\n for item in doc_rows\n ]\n return self.rollup_data(file_list, rows)\n\n # If not structured, keep as-is (e.g., markdown export or error dict)\n return self.rollup_data(file_list, [advanced_data])\n\n # Standard multi-file (or single non-advanced) path\n concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading)\n file_paths = [str(f.path) for f in file_list]\n self.log(f\"Starting parallel processing of {len(file_paths)} files with concurrency: {concurrency}.\")\n my_data = parallel_load_data(\n file_paths,\n silent_errors=self.silent_errors,\n load_function=process_file_standard,\n max_concurrency=concurrency,\n )\n return self.rollup_data(file_list, my_data)\n\n # ------------------------------ Output helpers -----------------------------------\n\n def load_files_advanced(self) -> DataFrame:\n \"\"\"Load files using advanced Docling processing and export to an advanced format.\"\"\"\n self.markdown = False\n return self.load_files()\n\n def load_files_markdown(self) -> Message:\n \"\"\"Load files using advanced Docling processing and export to Markdown format.\"\"\"\n self.markdown = True\n result = self.load_files()\n return Message(text=str(result.text[0]))\n" | |
| def process_files( | |
| self, | |
| file_list: list[BaseFileComponent.BaseFile], | |
| ) -> list[BaseFileComponent.BaseFile]: | |
| \"\"\"Process input files. | |
| - Single file + advanced_mode => Docling in a separate process. | |
| - Otherwise => standard parsing in current process (optionally threaded). | |
| \"\"\" | |
| if not file_list: | |
| msg = \"No files to process.\" | |
| raise ValueError(msg) | |
| def process_file_standard(file_path: str, *, silent_errors: bool = False) -> Data | None: | |
| try: | |
| return parse_text_file_to_data(file_path, silent_errors=silent_errors) | |
| except FileNotFoundError as e: | |
| self.log(f\"File not found: {file_path}. Error: {e}\") | |
| if not silent_errors: | |
| raise | |
| return None | |
| except Exception as e: | |
| self.log(f\"Unexpected error processing {file_path}: {e}\") | |
| if not silent_errors: | |
| raise | |
| return None | |
| # Advanced path: only for a single Docling-compatible file | |
| if len(file_list) == 1: | |
| file_path = str(file_list[0].path) | |
| if self.advanced_mode and self._is_docling_compatible(file_path): | |
| advanced_data: Data | None = self._process_docling_in_subprocess(file_path) | |
| if advanced_data is None: | |
| return self.rollup_data( | |
| file_list, | |
| [Data(data={\"error\": \"Docling returned no data\", \"file_path\": file_path})], | |
| ) | |
| # --- UNNEST: expand each element in `doc` to its own Data row | |
| payload = getattr(advanced_data, \"data\", {}) or {} | |
| doc_rows = payload.get(\"doc\") | |
| if isinstance(doc_rows, list): | |
| rows: list[Data | None] = [ | |
| Data( | |
| data={ | |
| \"file_path\": file_path, | |
| **(item if isinstance(item, dict) else {\"value\": item}), | |
| }, | |
| ) | |
| for item in doc_rows | |
| ] | |
| return self.rollup_data(file_list, rows) | |
| # If not structured, keep as-is (e.g., markdown export or error dict) | |
| return self.rollup_data(file_list, [advanced_data]) | |
| # Standard multi-file (or single non-advanced) path | |
| concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading) | |
| file_paths = [str(f.path) for f in file_list] | |
| self.log(f\"Starting parallel processing of {len(file_paths)} files with concurrency: {concurrency}.\") | |
| my_data = parallel_load_data( | |
| file_paths, | |
| silent_errors=self.silent_errors, | |
| load_function=process_file_standard, | |
| max_concurrency=concurrency, | |
| ) | |
| return self.rollup_data(file_list, my_data) |
🤖 Prompt for AI Agents
In src/backend/base/langflow/initial_setup/starter_projects/Text Sentiment
Analysis.json around line 2474, guard against advanced_data being None before
calling rollup_data: after advanced_data =
self._process_docling_in_subprocess(file_path) check if advanced_data is None
and handle it (e.g., return self.rollup_data(file_list, []) or return an error
Data row) instead of calling self.rollup_data(file_list, [advanced_data]) to
avoid passing [None] downstream.
Add timeout and robust error handling to Docling subprocess call.
subprocess.run has no timeout; a stuck child will hang the flow. Also surface a clear timeout error.
Apply this diff inside the embedded FileComponent code:
- proc = subprocess.run( # noqa: S603
- [sys.executable, "-u", "-c", child_script],
- input=json.dumps(args).encode("utf-8"),
- capture_output=True,
- check=False,
- )
+ try:
+ proc = subprocess.run( # noqa: S603
+ [sys.executable, "-u", "-c", child_script],
+ input=json.dumps(args).encode("utf-8"),
+ capture_output=True,
+ check=False,
+ timeout=getattr(self, "subprocess_timeout", 120),
+ )
+ except subprocess.TimeoutExpired:
+ return Data(data={"error": "Docling subprocess timed out", "file_path": file_path})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "value": "\"\"\"Enhanced file component with clearer structure and Docling isolation.\n\nNotes:\n-----\n- Functionality is preserved with minimal behavioral changes.\n- ALL Docling parsing/export runs in a separate OS process to prevent memory\n growth and native library state from impacting the main Langflow process.\n- Standard text/structured parsing continues to use existing BaseFileComponent\n utilities (and optional threading via `parallel_load_data`).\n\"\"\"\n\nfrom __future__ import annotations\n\nimport json\nimport subprocess\nimport sys\nimport textwrap\nfrom copy import deepcopy\nfrom typing import Any\n\nfrom lfx.base.data.base_file import BaseFileComponent\nfrom lfx.base.data.utils import TEXT_FILE_TYPES, parallel_load_data, parse_text_file_to_data\nfrom lfx.inputs.inputs import DropdownInput, MessageTextInput, StrInput\nfrom lfx.io import BoolInput, FileInput, IntInput, Output\nfrom lfx.schema import DataFrame # noqa: TC001\nfrom lfx.schema.data import Data\nfrom lfx.schema.message import Message\n\n\nclass FileComponent(BaseFileComponent):\n \"\"\"Read File component with optional Docling processing (isolated in a subprocess).\"\"\"\n\n display_name = \"Read File\"\n description = \"Loads content from files with optional advanced document processing and export using Docling.\"\n documentation: str = \"https://docs.langflow.org/components-data#file\"\n icon = \"file-text\"\n name = \"File\"\n\n # Docling-supported/compatible extensions; TEXT_FILE_TYPES are supported by the base loader.\n VALID_EXTENSIONS = [\n \"adoc\",\n \"asciidoc\",\n \"asc\",\n \"bmp\",\n \"csv\",\n \"dotx\",\n \"dotm\",\n \"docm\",\n \"docx\",\n \"htm\",\n \"html\",\n \"jpeg\",\n \"json\",\n \"md\",\n \"pdf\",\n \"png\",\n \"potx\",\n \"ppsx\",\n \"pptm\",\n \"potm\",\n \"ppsm\",\n \"pptx\",\n \"tiff\",\n \"txt\",\n \"xls\",\n \"xlsx\",\n \"xhtml\",\n \"xml\",\n \"webp\",\n *TEXT_FILE_TYPES,\n ]\n\n # Fixed export settings used when markdown export is requested.\n EXPORT_FORMAT = \"Markdown\"\n IMAGE_MODE = \"placeholder\"\n\n _base_inputs = deepcopy(BaseFileComponent.get_base_inputs())\n\n for input_item in _base_inputs:\n if isinstance(input_item, FileInput) and input_item.name == \"path\":\n input_item.real_time_refresh = True\n break\n\n inputs = [\n *_base_inputs,\n BoolInput(\n name=\"advanced_mode\",\n display_name=\"Advanced Parser\",\n value=False,\n real_time_refresh=True,\n info=(\n \"Enable advanced document processing and export with Docling for PDFs, images, and office documents. \"\n \"Available only for single file processing.\"\n ),\n show=False,\n ),\n DropdownInput(\n name=\"pipeline\",\n display_name=\"Pipeline\",\n info=\"Docling pipeline to use\",\n options=[\"standard\", \"vlm\"],\n value=\"standard\",\n advanced=True,\n ),\n DropdownInput(\n name=\"ocr_engine\",\n display_name=\"OCR Engine\",\n info=\"OCR engine to use. Only available when pipeline is set to 'standard'.\",\n options=[\"\", \"easyocr\"],\n value=\"\",\n show=False,\n advanced=True,\n ),\n StrInput(\n name=\"md_image_placeholder\",\n display_name=\"Image placeholder\",\n info=\"Specify the image placeholder for markdown exports.\",\n value=\"<!-- image -->\",\n advanced=True,\n show=False,\n ),\n StrInput(\n name=\"md_page_break_placeholder\",\n display_name=\"Page break placeholder\",\n info=\"Add this placeholder between pages in the markdown output.\",\n value=\"\",\n advanced=True,\n show=False,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n show=False,\n ),\n # Deprecated input retained for backward-compatibility.\n BoolInput(\n name=\"use_multithreading\",\n display_name=\"[Deprecated] Use Multithreading\",\n advanced=True,\n value=True,\n info=\"Set 'Processing Concurrency' greater than 1 to enable multithreading.\",\n ),\n IntInput(\n name=\"concurrency_multithreading\",\n display_name=\"Processing Concurrency\",\n advanced=True,\n info=\"When multiple files are being processed, the number of files to process concurrently.\",\n value=1,\n ),\n BoolInput(\n name=\"markdown\",\n display_name=\"Markdown Export\",\n info=\"Export processed documents to Markdown format. Only available when advanced mode is enabled.\",\n value=False,\n show=False,\n ),\n ]\n\n outputs = [\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n ]\n\n # ------------------------------ UI helpers --------------------------------------\n\n def _path_value(self, template: dict) -> list[str]:\n \"\"\"Return the list of currently selected file paths from the template.\"\"\"\n return template.get(\"path\", {}).get(\"file_path\", [])\n\n def update_build_config(\n self,\n build_config: dict[str, Any],\n field_value: Any,\n field_name: str | None = None,\n ) -> dict[str, Any]:\n \"\"\"Show/hide Advanced Parser and related fields based on selection context.\"\"\"\n if field_name == \"path\":\n paths = self._path_value(build_config)\n file_path = paths[0] if paths else \"\"\n file_count = len(field_value) if field_value else 0\n\n # Advanced mode only for single (non-tabular) file\n allow_advanced = file_count == 1 and not file_path.endswith((\".csv\", \".xlsx\", \".parquet\"))\n build_config[\"advanced_mode\"][\"show\"] = allow_advanced\n if not allow_advanced:\n build_config[\"advanced_mode\"][\"value\"] = False\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = False\n\n elif field_name == \"advanced_mode\":\n for f in (\"pipeline\", \"ocr_engine\", \"doc_key\", \"md_image_placeholder\", \"md_page_break_placeholder\"):\n if f in build_config:\n build_config[f][\"show\"] = bool(field_value)\n\n return build_config\n\n def update_outputs(self, frontend_node: dict[str, Any], field_name: str, field_value: Any) -> dict[str, Any]: # noqa: ARG002\n \"\"\"Dynamically show outputs based on file count/type and advanced mode.\"\"\"\n if field_name not in [\"path\", \"advanced_mode\"]:\n return frontend_node\n\n template = frontend_node.get(\"template\", {})\n paths = self._path_value(template)\n if not paths:\n return frontend_node\n\n frontend_node[\"outputs\"] = []\n if len(paths) == 1:\n file_path = paths[0] if field_name == \"path\" else frontend_node[\"template\"][\"path\"][\"file_path\"][0]\n if file_path.endswith((\".csv\", \".xlsx\", \".parquet\")):\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Content\",\n name=\"dataframe\",\n method=\"load_files_structured\",\n tool_mode=True,\n ),\n )\n elif file_path.endswith(\".json\"):\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Structured Content\", name=\"json\", method=\"load_files_json\", tool_mode=True),\n )\n\n advanced_mode = frontend_node.get(\"template\", {}).get(\"advanced_mode\", {}).get(\"value\", False)\n if advanced_mode:\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Structured Output\",\n name=\"advanced\",\n method=\"load_files_advanced\",\n tool_mode=True,\n ),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Markdown\", name=\"markdown\", method=\"load_files_markdown\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n frontend_node[\"outputs\"].append(\n Output(display_name=\"Raw Content\", name=\"message\", method=\"load_files_message\", tool_mode=True),\n )\n frontend_node[\"outputs\"].append(\n Output(display_name=\"File Path\", name=\"path\", method=\"load_files_path\", tool_mode=True),\n )\n else:\n # Multiple files => DataFrame output; advanced parser disabled\n frontend_node[\"outputs\"].append(\n Output(\n display_name=\"Files\",\n name=\"dataframe\",\n method=\"load_files\",\n tool_mode=True,\n )\n )\n\n return frontend_node\n\n # ------------------------------ Core processing ----------------------------------\n\n def _is_docling_compatible(self, file_path: str) -> bool:\n \"\"\"Lightweight extension gate for Docling-compatible types.\"\"\"\n docling_exts = (\n \".adoc\",\n \".asciidoc\",\n \".asc\",\n \".bmp\",\n \".csv\",\n \".dotx\",\n \".dotm\",\n \".docm\",\n \".docx\",\n \".htm\",\n \".html\",\n \".jpeg\",\n \".json\",\n \".md\",\n \".pdf\",\n \".png\",\n \".potx\",\n \".ppsx\",\n \".pptm\",\n \".potm\",\n \".ppsm\",\n \".pptx\",\n \".tiff\",\n \".txt\",\n \".xls\",\n \".xlsx\",\n \".xhtml\",\n \".xml\",\n \".webp\",\n )\n return file_path.lower().endswith(docling_exts)\n\n def _process_docling_in_subprocess(self, file_path: str) -> Data | None:\n \"\"\"Run Docling in a separate OS process and map the result to a Data object.\n\n We avoid multiprocessing pickling by launching `python -c \"<script>\"` and\n passing JSON config via stdin. The child prints a JSON result to stdout.\n \"\"\"\n if not file_path:\n return None\n\n args: dict[str, Any] = {\n \"file_path\": file_path,\n \"markdown\": bool(self.markdown),\n \"image_mode\": str(self.IMAGE_MODE),\n \"md_image_placeholder\": str(self.md_image_placeholder),\n \"md_page_break_placeholder\": str(self.md_page_break_placeholder),\n \"pipeline\": str(self.pipeline),\n \"ocr_engine\": str(self.ocr_engine) if getattr(self, \"ocr_engine\", \"\") else None,\n }\n\n # The child is a tiny, self-contained script to keep memory/state isolated.\n child_script = textwrap.dedent(\n r\"\"\"\n import json, sys\n\n def try_imports():\n # Strategy 1: latest layout\n try:\n from docling.datamodel.base_models import ConversionStatus, InputFormat # type: ignore\n from docling.document_converter import DocumentConverter # type: ignore\n from docling_core.types.doc import ImageRefMode # type: ignore\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"latest\"\n except Exception:\n pass\n # Strategy 2: alternative layout\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n try:\n from docling_core.types import ConversionStatus, InputFormat # type: ignore\n except Exception:\n try:\n from docling.datamodel import ConversionStatus, InputFormat # type: ignore\n except Exception:\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n try:\n from docling_core.types.doc import ImageRefMode # type: ignore\n except Exception:\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"alternative\"\n except Exception:\n pass\n # Strategy 3: basic converter only\n try:\n from docling.document_converter import DocumentConverter # type: ignore\n class ConversionStatus: SUCCESS = \"success\"\n class InputFormat:\n PDF=\"pdf\"; IMAGE=\"image\"\n class ImageRefMode:\n PLACEHOLDER=\"placeholder\"; EMBEDDED=\"embedded\"\n return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, \"basic\"\n except Exception as e:\n raise ImportError(f\"Docling imports failed: {e}\") from e\n\n def create_converter(strategy, input_format, DocumentConverter, pipeline, ocr_engine):\n if strategy == \"latest\" and pipeline == \"standard\":\n try:\n from docling.datamodel.pipeline_options import PdfPipelineOptions # type: ignore\n from docling.document_converter import PdfFormatOption # type: ignore\n pipe = PdfPipelineOptions()\n if ocr_engine:\n try:\n from docling.models.factories import get_ocr_factory # type: ignore\n pipe.do_ocr = True\n fac = get_ocr_factory(allow_external_plugins=False)\n pipe.ocr_options = fac.create_options(kind=ocr_engine)\n except Exception:\n pipe.do_ocr = False\n fmt = {}\n if hasattr(input_format, \"PDF\"):\n fmt[getattr(input_format, \"PDF\")] = PdfFormatOption(pipeline_options=pipe)\n if hasattr(input_format, \"IMAGE\"):\n fmt[getattr(input_format, \"IMAGE\")] = PdfFormatOption(pipeline_options=pipe)\n return DocumentConverter(format_options=fmt)\n except Exception:\n return DocumentConverter()\n return DocumentConverter()\n\n def export_markdown(document, ImageRefMode, image_mode, img_ph, pg_ph):\n try:\n mode = getattr(ImageRefMode, image_mode.upper(), image_mode)\n return document.export_to_markdown(\n image_mode=mode,\n image_placeholder=img_ph,\n page_break_placeholder=pg_ph,\n )\n except Exception:\n try:\n return document.export_to_text()\n except Exception:\n return str(document)\n\n def to_rows(doc_dict):\n rows = []\n for t in doc_dict.get(\"texts\", []):\n prov = t.get(\"prov\") or []\n page_no = None\n if prov and isinstance(prov, list) and isinstance(prov[0], dict):\n page_no = prov[0].get(\"page_no\")\n rows.append({\n \"page_no\": page_no,\n \"label\": t.get(\"label\"),\n \"text\": t.get(\"text\"),\n \"level\": t.get(\"level\"),\n })\n return rows\n\n def main():\n cfg = json.loads(sys.stdin.read())\n file_path = cfg[\"file_path\"]\n markdown = cfg[\"markdown\"]\n image_mode = cfg[\"image_mode\"]\n img_ph = cfg[\"md_image_placeholder\"]\n pg_ph = cfg[\"md_page_break_placeholder\"]\n pipeline = cfg[\"pipeline\"]\n ocr_engine = cfg.get(\"ocr_engine\")\n meta = {\"file_path\": file_path}\n\n try:\n ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, strategy = try_imports()\n converter = create_converter(strategy, InputFormat, DocumentConverter, pipeline, ocr_engine)\n try:\n res = converter.convert(file_path)\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling conversion error: {e}\", \"meta\": meta}))\n return\n\n ok = False\n if hasattr(res, \"status\"):\n try:\n ok = (res.status == ConversionStatus.SUCCESS) or (str(res.status).lower() == \"success\")\n except Exception:\n ok = (str(res.status).lower() == \"success\")\n if not ok and hasattr(res, \"document\"):\n ok = getattr(res, \"document\", None) is not None\n if not ok:\n print(json.dumps({\"ok\": False, \"error\": \"Docling conversion failed\", \"meta\": meta}))\n return\n\n doc = getattr(res, \"document\", None)\n if doc is None:\n print(json.dumps({\"ok\": False, \"error\": \"Docling produced no document\", \"meta\": meta}))\n return\n\n if markdown:\n text = export_markdown(doc, ImageRefMode, image_mode, img_ph, pg_ph)\n print(json.dumps({\"ok\": True, \"mode\": \"markdown\", \"text\": text, \"meta\": meta}))\n return\n\n # structured\n try:\n doc_dict = doc.export_to_dict()\n except Exception as e:\n print(json.dumps({\"ok\": False, \"error\": f\"Docling export_to_dict failed: {e}\", \"meta\": meta}))\n return\n\n rows = to_rows(doc_dict)\n print(json.dumps({\"ok\": True, \"mode\": \"structured\", \"doc\": rows, \"meta\": meta}))\n except Exception as e:\n print(\n json.dumps({\n \"ok\": False,\n \"error\": f\"Docling processing error: {e}\",\n \"meta\": {\"file_path\": file_path},\n })\n )\n\n if __name__ == \"__main__\":\n main()\n \"\"\"\n )\n\n # Validate file_path to avoid command injection or unsafe input\n if not isinstance(args[\"file_path\"], str) or any(c in args[\"file_path\"] for c in [\";\", \"|\", \"&\", \"$\", \"`\"]):\n return Data(data={\"error\": \"Unsafe file path detected.\", \"file_path\": args[\"file_path\"]})\n\n proc = subprocess.run( # noqa: S603\n [sys.executable, \"-u\", \"-c\", child_script],\n input=json.dumps(args).encode(\"utf-8\"),\n capture_output=True,\n check=False,\n )\n\n if not proc.stdout:\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\") or \"no output from child process\"\n return Data(data={\"error\": f\"Docling subprocess error: {err_msg}\", \"file_path\": file_path})\n\n try:\n result = json.loads(proc.stdout.decode(\"utf-8\"))\n except Exception as e: # noqa: BLE001\n err_msg = proc.stderr.decode(\"utf-8\", errors=\"replace\")\n return Data(\n data={\"error\": f\"Invalid JSON from Docling subprocess: {e}. stderr={err_msg}\", \"file_path\": file_path},\n )\n\n if not result.get(\"ok\"):\n return Data(data={\"error\": result.get(\"error\", \"Unknown Docling error\"), **result.get(\"meta\", {})})\n\n meta = result.get(\"meta\", {})\n if result.get(\"mode\") == \"markdown\":\n exported_content = str(result.get(\"text\", \"\"))\n return Data(\n text=exported_content,\n data={\"exported_content\": exported_content, \"export_format\": self.EXPORT_FORMAT, **meta},\n )\n\n rows = list(result.get(\"doc\", []))\n return Data(data={\"doc\": rows, \"export_format\": self.EXPORT_FORMAT, **meta})\n\n def process_files(\n self,\n file_list: list[BaseFileComponent.BaseFile],\n ) -> list[BaseFileComponent.BaseFile]:\n \"\"\"Process input files.\n\n - Single file + advanced_mode => Docling in a separate process.\n - Otherwise => standard parsing in current process (optionally threaded).\n \"\"\"\n if not file_list:\n msg = \"No files to process.\"\n raise ValueError(msg)\n\n def process_file_standard(file_path: str, *, silent_errors: bool = False) -> Data | None:\n try:\n return parse_text_file_to_data(file_path, silent_errors=silent_errors)\n except FileNotFoundError as e:\n self.log(f\"File not found: {file_path}. Error: {e}\")\n if not silent_errors:\n raise\n return None\n except Exception as e:\n self.log(f\"Unexpected error processing {file_path}: {e}\")\n if not silent_errors:\n raise\n return None\n\n # Advanced path: only for a single Docling-compatible file\n if len(file_list) == 1:\n file_path = str(file_list[0].path)\n if self.advanced_mode and self._is_docling_compatible(file_path):\n advanced_data: Data | None = self._process_docling_in_subprocess(file_path)\n\n # --- UNNEST: expand each element in `doc` to its own Data row\n payload = getattr(advanced_data, \"data\", {}) or {}\n doc_rows = payload.get(\"doc\")\n if isinstance(doc_rows, list):\n rows: list[Data | None] = [\n Data(\n data={\n \"file_path\": file_path,\n **(item if isinstance(item, dict) else {\"value\": item}),\n },\n )\n for item in doc_rows\n ]\n return self.rollup_data(file_list, rows)\n\n # If not structured, keep as-is (e.g., markdown export or error dict)\n return self.rollup_data(file_list, [advanced_data])\n\n # Standard multi-file (or single non-advanced) path\n concurrency = 1 if not self.use_multithreading else max(1, self.concurrency_multithreading)\n file_paths = [str(f.path) for f in file_list]\n self.log(f\"Starting parallel processing of {len(file_paths)} files with concurrency: {concurrency}.\")\n my_data = parallel_load_data(\n file_paths,\n silent_errors=self.silent_errors,\n load_function=process_file_standard,\n max_concurrency=concurrency,\n )\n return self.rollup_data(file_list, my_data)\n\n # ------------------------------ Output helpers -----------------------------------\n\n def load_files_advanced(self) -> DataFrame:\n \"\"\"Load files using advanced Docling processing and export to an advanced format.\"\"\"\n self.markdown = False\n return self.load_files()\n\n def load_files_markdown(self) -> Message:\n \"\"\"Load files using advanced Docling processing and export to Markdown format.\"\"\"\n self.markdown = True\n result = self.load_files()\n return Message(text=str(result.text[0]))\n" | |
| def _process_docling_in_subprocess(self, file_path: str) -> Data | None: | |
| """Run Docling in a separate OS process and map the result to a Data object. | |
| We avoid multiprocessing pickling by launching `python -c "<script>"` and | |
| passing JSON config via stdin. The child prints a JSON result to stdout. | |
| """ | |
| if not file_path: | |
| return None | |
| args: dict[str, Any] = { | |
| "file_path": file_path, | |
| "markdown": bool(self.markdown), | |
| "image_mode": str(self.IMAGE_MODE), | |
| "md_image_placeholder": str(self.md_image_placeholder), | |
| "md_page_break_placeholder": str(self.md_page_break_placeholder), | |
| "pipeline": str(self.pipeline), | |
| "ocr_engine": str(self.ocr_engine) if getattr(self, "ocr_engine", "") else None, | |
| } | |
| # The child is a tiny, self-contained script to keep memory/state isolated. | |
| child_script = textwrap.dedent( | |
| r""" | |
| import json, sys | |
| def try_imports(): | |
| # Strategy 1: latest layout | |
| try: | |
| from docling.datamodel.base_models import ConversionStatus, InputFormat # type: ignore | |
| from docling.document_converter import DocumentConverter # type: ignore | |
| from docling_core.types.doc import ImageRefMode # type: ignore | |
| return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, "latest" | |
| except Exception: | |
| pass | |
| # Strategy 2: alternative layout | |
| try: | |
| from docling.document_converter import DocumentConverter # type: ignore | |
| try: | |
| from docling_core.types import ConversionStatus, InputFormat # type: ignore | |
| except Exception: | |
| try: | |
| from docling.datamodel import ConversionStatus, InputFormat # type: ignore | |
| except Exception: | |
| class ConversionStatus: SUCCESS = "success" | |
| class InputFormat: | |
| PDF="pdf"; IMAGE="image" | |
| try: | |
| from docling_core.types.doc import ImageRefMode # type: ignore | |
| except Exception: | |
| class ImageRefMode: | |
| PLACEHOLDER="placeholder"; EMBEDDED="embedded" | |
| return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, "alternative" | |
| except Exception: | |
| pass | |
| # Strategy 3: basic converter only | |
| try: | |
| from docling.document_converter import DocumentConverter # type: ignore | |
| class ConversionStatus: SUCCESS = "success" | |
| class InputFormat: | |
| PDF="pdf"; IMAGE="image" | |
| class ImageRefMode: | |
| PLACEHOLDER="placeholder"; EMBEDDED="embedded" | |
| return ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, "basic" | |
| except Exception as e: | |
| raise ImportError(f"Docling imports failed: {e}") from e | |
| def create_converter(strategy, input_format, DocumentConverter, pipeline, ocr_engine): | |
| if strategy == "latest" and pipeline == "standard": | |
| try: | |
| from docling.datamodel.pipeline_options import PdfPipelineOptions # type: ignore | |
| from docling.document_converter import PdfFormatOption # type: ignore | |
| pipe = PdfPipelineOptions() | |
| if ocr_engine: | |
| try: | |
| from docling.models.factories import get_ocr_factory # type: ignore | |
| pipe.do_ocr = True | |
| fac = get_ocr_factory(allow_external_plugins=False) | |
| pipe.ocr_options = fac.create_options(kind=ocr_engine) | |
| except Exception: | |
| pipe.do_ocr = False | |
| fmt = {} | |
| if hasattr(input_format, "PDF"): | |
| fmt[getattr(input_format, "PDF")] = PdfFormatOption(pipeline_options=pipe) | |
| if hasattr(input_format, "IMAGE"): | |
| fmt[getattr(input_format, "IMAGE")] = PdfFormatOption(pipeline_options=pipe) | |
| return DocumentConverter(format_options=fmt) | |
| except Exception: | |
| return DocumentConverter() | |
| return DocumentConverter() | |
| def export_markdown(document, ImageRefMode, image_mode, img_ph, pg_ph): | |
| try: | |
| mode = getattr(ImageRefMode, image_mode.upper(), image_mode) | |
| return document.export_to_markdown( | |
| image_mode=mode, | |
| image_placeholder=img_ph, | |
| page_break_placeholder=pg_ph, | |
| ) | |
| except Exception: | |
| try: | |
| return document.export_to_text() | |
| except Exception: | |
| return str(document) | |
| def to_rows(doc_dict): | |
| rows = [] | |
| for t in doc_dict.get("texts", []): | |
| prov = t.get("prov") or [] | |
| page_no = None | |
| if prov and isinstance(prov, list) and isinstance(prov[0], dict): | |
| page_no = prov[0].get("page_no") | |
| rows.append({ | |
| "page_no": page_no, | |
| "label": t.get("label"), | |
| "text": t.get("text"), | |
| "level": t.get("level"), | |
| }) | |
| return rows | |
| def main(): | |
| cfg = json.loads(sys.stdin.read()) | |
| file_path = cfg["file_path"] | |
| markdown = cfg["markdown"] | |
| image_mode = cfg["image_mode"] | |
| img_ph = cfg["md_image_placeholder"] | |
| pg_ph = cfg["md_page_break_placeholder"] | |
| pipeline = cfg["pipeline"] | |
| ocr_engine = cfg.get("ocr_engine") | |
| meta = {"file_path": file_path} | |
| try: | |
| ConversionStatus, InputFormat, DocumentConverter, ImageRefMode, strategy = try_imports() | |
| converter = create_converter(strategy, InputFormat, DocumentConverter, pipeline, ocr_engine) | |
| try: | |
| res = converter.convert(file_path) | |
| except Exception as e: | |
| print(json.dumps({"ok": False, "error": f"Docling conversion error: {e}", "meta": meta})) | |
| return | |
| ok = False | |
| if hasattr(res, "status"): | |
| try: | |
| ok = (res.status == ConversionStatus.SUCCESS) or (str(res.status).lower() == "success") | |
| except Exception: | |
| ok = (str(res.status).lower() == "success") | |
| if not ok and hasattr(res, "document"): | |
| ok = getattr(res, "document", None) is not None | |
| if not ok: | |
| print(json.dumps({"ok": False, "error": "Docling conversion failed", "meta": meta})) | |
| return | |
| doc = getattr(res, "document", None) | |
| if doc is None: | |
| print(json.dumps({"ok": False, "error": "Docling produced no document", "meta": meta})) | |
| return | |
| if markdown: | |
| text = export_markdown(doc, ImageRefMode, image_mode, img_ph, pg_ph) | |
| print(json.dumps({"ok": True, "mode": "markdown", "text": text, "meta": meta})) | |
| return | |
| # structured | |
| try: | |
| doc_dict = doc.export_to_dict() | |
| except Exception as e: | |
| print(json.dumps({"ok": False, "error": f"Docling export_to_dict failed: {e}", "meta": meta})) | |
| return | |
| rows = to_rows(doc_dict) | |
| print(json.dumps({"ok": True, "mode": "structured", "doc": rows, "meta": meta})) | |
| except Exception as e: | |
| print( | |
| json.dumps({ | |
| "ok": False, | |
| "error": f"Docling processing error: {e}", | |
| "meta": {"file_path": file_path}, | |
| }) | |
| ) | |
| if __name__ == "__main__": | |
| main() | |
| """ | |
| ) | |
| # Validate file_path to avoid command injection or unsafe input | |
| if not isinstance(args["file_path"], str) or any(c in args["file_path"] for c in [";", "|", "&", "$", "`"]): | |
| return Data(data={"error": "Unsafe file path detected.", "file_path": args["file_path"]}) | |
| try: | |
| proc = subprocess.run( # noqa: S603 | |
| [sys.executable, "-u", "-c", child_script], | |
| input=json.dumps(args).encode("utf-8"), | |
| capture_output=True, | |
| check=False, | |
| timeout=getattr(self, "subprocess_timeout", 120), | |
| ) | |
| except subprocess.TimeoutExpired: | |
| return Data(data={"error": "Docling subprocess timed out", "file_path": file_path}) | |
| if not proc.stdout: | |
| err_msg = proc.stderr.decode("utf-8", errors="replace") or "no output from child process" | |
| return Data(data={"error": f"Docling subprocess error: {err_msg}", "file_path": file_path}) | |
| try: | |
| result = json.loads(proc.stdout.decode("utf-8")) | |
| except Exception as e: # noqa: BLE001 | |
| err_msg = proc.stderr.decode("utf-8", errors="replace") | |
| return Data( | |
| data={"error": f"Invalid JSON from Docling subprocess: {e}. stderr={err_msg}", "file_path": file_path}, | |
| ) | |
| if not result.get("ok"): | |
| return Data(data={"error": result.get("error", "Unknown Docling error"), **result.get("meta", {})}) | |
| meta = result.get("meta", {}) | |
| if result.get("mode") == "markdown": | |
| exported_content = str(result.get("text", "")) | |
| return Data( | |
| text=exported_content, | |
| data={"exported_content": exported_content, "export_format": self.EXPORT_FORMAT, **meta}, | |
| ) | |
| rows = list(result.get("doc", [])) | |
| return Data(data={"doc": rows, "export_format": self.EXPORT_FORMAT, **meta}) |
| display_name = "Write File" | ||
| description = "Write data to a local file in the selected format." |
There was a problem hiding this comment.
🧩 Analysis chain
Rename to “Write File” — verify downstream references.
Confirm starter JSONs, docs, and UI strings no longer rely on “Save File,” and that the doc anchor still matches.
Run:
Optionally check docs site:
🏁 Script executed:
#!/bin/bash
# Find lingering references and anchors that may now be stale.
rg -nP -C2 '\bSave File\b|#save-file|/save-file' .Length of output: 15367
🌐 Web query:
Does the Langflow docs still use the anchor "components-processing#save-file" for this component, or has it changed to "write-file"?
💡 Result:
The docs still use the "save-file" anchor (components-processing#save-file). See the Processing components page (Save File section). [1]
Sources
[1] Langflow Documentation — Processing components (Save File).
Rename to "Write File" — update downstream references.
- Docs: rename the "Save File" section/anchor in docs/docs/Components/components-processing.mdx (## Save File at ~line 633) to "Write File" and update the component documentation URL in src/lfx/src/lfx/components/processing/save_file.py (currently points to ...#save-file) to ...#write-file.
- Starter project: change "display_name": "Save File" → "Write File" in src/backend/base/langflow/initial_setup/starter_projects/News Aggregator.json (around line 1237).
- Run a global search for "Save File" / "#save-file" / "/save-file" and update any remaining UI strings, links, or anchors to keep code, docs, and starter projects consistent.
🤖 Prompt for AI Agents
In src/lfx/src/lfx/components/processing/save_file.py around lines 18-19, the
component display_name should be changed from "Save File" to "Write File" and
its documentation URL fragment updated from "#save-file" to "#write-file"; also
update the docs file docs/docs/Components/components-processing.mdx (around the
Save File section at ~line 633) by renaming the "Save File" header/anchor to
"Write File" and ensure the anchor id matches, edit
src/backend/base/langflow/initial_setup/starter_projects/News Aggregator.json
(around line 1237) to change the "display_name" value to "Write File", and
finally perform a global repository search for occurrences of "Save File",
"#save-file", and "/save-file" to update UI strings, links, and anchors to the
new "Write File"/"#write-file" variants so code, docs, and starter projects
remain consistent.



This pull request updates the output definitions for several file-related components to consistently include the
tool_mode=Trueparameter. This change ensures that all relevant outputs are flagged for tool mode operation, which likely improves integration or compatibility with downstream tooling.Output configuration updates:
tool_mode=Trueto theOutputfor "Files" in the_base_outputslist ofbase_file.py.tool_mode=Trueto the "Raw Content" output in theFileComponentclass infile.py.update_outputsmethod ofFileComponentto includetool_mode=True, covering structured, raw, markdown, advanced, and path outputs for various file types and modes.tool_mode=Trueto the "File Path" output in theSaveToFileComponentinsave_file.py.Summary by CodeRabbit