Skip to content

MaxTEX310/UniOne

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 

Repository files navigation

UniOne tip

The papers related to this dataset will be submitted to ICDAR 2025. In order to meet the blind review requirements, the specific contact information will be made public on May 25, 2025.

Update

March 5, 2025: Upload the supplementary explanation file for the Unione Dataset

March 5, 2025: Upload an example of the Unione dataset, you can obtain it here

April 12, 2025: Upload the complete dataset to Baidu Netdisk, and ICDAR will fully open it after recruitment

Instructions

Note: The UniOne dataset can only be used for non-commercial research purposes. For scholars or organizations who wish to use the UniOne dataset, please first fill out this application(Will be added in the future) form and send it to us via email (preferably using an institutional email so that we can quickly identify your information) (contact information to be provided). When submitting the application form to us, please list or attach 1-2 publications you have published in the past 5 years to indicate that you (or your team) have conducted research in related fields such as OCR, mathematical expression recognition, document image processing, or visual information extraction. At present, this dataset is only available free of charge to scholars in the aforementioned fields. After receiving your letter, we will quickly verify your information within two weeks and provide you with the download link and decompression password for the dataset.

In addition, if you have other plans, you can also collaborate with us and we will provide more datasets on this basis.

License

The UniOne dataset should be used for non-commercial research purposes under the Creative Attribution NonCommercial NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.

UniOne Dataset

Currently, document parsing technology is facing the problem of "task silos" caused by fragmented datasets. With the widespread application of deep learning in document understanding, the construction of a unified and shared dataset for collaborative development of upstream and downstream tasks has become an inevitable trend. We have built the first UniOne document dataset that supports the parsing of upstream and downstream tasks. By systematically integrating tasks such as layout analysis, text line detection and recognition, and table recognition, we have innovatively established a cross-task annotation dataset. This dataset: (1) at the layout analysis level, includes 236,790 paragraph-level annotations across 14,481 pages, covering 11 semantic categories; (2) at the text line detection level, based on the layout analysis data, further adds fine-grained annotations for 340,890 lines in 198,901 text paragraphs; (3) for complex scenarios, it introduces 8,000 challenging handwritten mathematical expressions, 18,717 printed mathematical formulas, 26,849 formula texts with unified recognition annotations, and 1,169 tables extracted from papers to fully support document content parsing. To our knowledge, this dataset is the first to achieve cross-task joint modeling from macro layouts to micro elements, breaking through the limitations of traditional single-task datasets and providing essential infrastructure for building the next generation of intelligent document parsing systems.

fig1

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published