Skip to content

Need help with personal docx files #1

@mattolson93

Description

@mattolson93

Hello, great poster at NuerIPS and it was good to meet you all!

I have some custom docx files (pdfs that I converted to docx with adobe), that I am trying to extract text from. I am able to get the docker file up and running, and I've modified run_single_node.sh to run just the annotation on my_docxs.tar.gz in the data folder. The script seems to execute, but I don't see anything in failed or extracted text. What am I doing wrong? I've pasted the whole log below, and I've also tried a tar of just a simple docx with random text in to verify it's not my converted files causing the issue.

Lastly, a demo for just using a personal set of docxs that works for you would be very helpful in debugging.

Thanks,
Matt Olson

[2023-12-21 21:21:36,464]::MainProcess          ::INFO::source_tars: [PosixPath('data/paper.tar.gz'), PosixPath('data/paper2.tar.gz')]
[2023-12-21 21:21:36,471]::MainProcess          ::INFO::args: {'data_dir': 'data', 'output_dir': './data/out', 'input_files': None, 'crawl_id': 'test', 'max_docs': -1, 'soffice_executable': 'soffice', 'config': 'configs/default_config.yaml', 'job_id': None}
[2023-12-21 21:21:36,475]::MainProcess          ::INFO::results_dir: data/out
[2023-12-21 21:21:36,476]::MainProcess          ::INFO::annotations_dir: data/out/multimodal
[2023-12-21 21:21:36,476]::MainProcess          ::INFO::meta_dir: data/out/meta
[2023-12-21 21:21:36,477]::MainProcess          ::INFO::text_dir: data/out/text
[2023-12-21 21:21:36,477]::MainProcess          ::INFO::failed_dir: data/out/failed
[2023-12-21 21:21:36,477]::MainProcess          ::INFO::num_annotators: 2
[2023-12-21 21:21:36,482]::MainProcess          ::INFO::max_docs_per_process: -1
[2023-12-21 21:21:36,486]::AnnotationMonitor-2  ::INFO::Start monitoring...
[2023-12-21 21:21:41,273]::MainProcess          ::INFO::soffice(PID=104) started @ localhost:38357
[2023-12-21 21:21:41,276]::MainProcess          ::INFO::initialized.
[2023-12-21 21:21:41,277]::MainProcess          ::INFO::input_tars=[PosixPath('data/paper.tar.gz')]
[2023-12-21 21:21:45,824]::MainProcess          ::INFO::soffice(PID=178) started @ localhost:58509
[2023-12-21 21:21:45,827]::MainProcess          ::INFO::initialized.
[2023-12-21 21:21:45,827]::MainProcess          ::INFO::input_tars=[PosixPath('data/paper2.tar.gz')]
[2023-12-21 21:21:45,838]::AnnotatorProcess-4   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0001 start processing data/paper2.tar.gz.
[2023-12-21 21:21:45,837]::AnnotatorProcess-3   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0000 start processing data/paper.tar.gz.
[2023-12-21 21:21:45,857]::AnnotatorProcess-3   ::ERROR::(self.run) FileNotFoundError: [Errno 2] No such file or directory: '/usr/app/data/tmp/tmpyo4oefe3'
[2023-12-21 21:21:45,857]::AnnotatorProcess-3   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0000 finished. Shutting down.
[2023-12-21 21:21:45,860]::AnnotatorProcess-3   ::INFO::shutting down soffice process with pid 104
[2023-12-21 21:21:45,944]::AnnotatorProcess-4   ::ERROR::(self.run) FileNotFoundError: [Errno 2] No such file or directory: '/usr/app/data/tmp/tmp3veibb1j'
[2023-12-21 21:21:45,945]::AnnotatorProcess-4   ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0001 finished. Shutting down.
[2023-12-21 21:21:45,947]::AnnotatorProcess-4   ::INFO::shutting down soffice process with pid 178
[2023-12-21 21:21:46,892]::MainProcess          ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0000 done.
[2023-12-21 21:21:46,971]::MainProcess          ::INFO::annotator_3bae1666-e805-4ada-8543-0242642f26eb_0001 done.
[2023-12-21 21:21:46,980]::AnnotationMonitor-2  ::INFO::AnnotationMonitor done.
[2023-12-21 21:21:46,991]::MainProcess          ::INFO::annotation done.
[2023-12-21 21:21:46,992]::MainProcess          ::INFO::total time: 0:00:10.557422

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions