Thesis

Datasets

The datasets used in this program are the following:

Name	Description	Source
DBNL	Containing 220 books	https://lab.kb.nl/dataset/dbnl-ocr-data-set
Historical Newspaper	2000 pages of ground-truth data of newspapers	https://lab.kb.nl/dataset/historical-newspapers-ocr-ground-truth
IMPACT project	2055 book pages, 1024 newspaper pages, 1179 parliamentary proceedings, 205 radio bulletins	https://lab.kb.nl/dataset/ground-truth-impact-project
17th_century	Around 30000, 17th century articles
Bible	The 'Statenvertaling' bible from 1637

Overview of folders

data_loader

This folder contains all code needed to import the various data sources mentioned in the table above.

ImageProcessing

This folder contains all code related to the image creation process. This includes creating the building blocks, and using the building blocks for creating the images with text.

utils

Includes a delpher class, used for extracting information from Delpher. The settings class contains application global parameters. Util class contains global functions used throughout the application.

Overview of classes in src folder

Analyze.py and AnalyzeIndividual.py

Not important, purely used for creating my thesis report.

DataCreator.py

Probably on of the most important classes. Used to create images, and run those images through an OCR application using OCR.py. This class is called from create_training_dataset.py.

create_training_dataset.py

Used to call Datacreator.py

finetune_trainer.py

Used to train the actual models.

OCR.py

Used extract the text from the images.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
fonts		fonts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Latex.py		Latex.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thesis

Datasets

Overview of folders

data_loader

ImageProcessing

utils

Overview of classes in src folder

Analyze.py and AnalyzeIndividual.py

DataCreator.py

create_training_dataset.py

finetune_trainer.py

OCR.py

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Thesis

Datasets

Overview of folders

data_loader

ImageProcessing

utils

Overview of classes in src folder

Analyze.py and AnalyzeIndividual.py

DataCreator.py

create_training_dataset.py

finetune_trainer.py

OCR.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages