Gutenberg

About The Project

Gutenberg is a pipeline for training a neural network in segmenting and recognising frequent words in early printed books, in particular we focus on Gutenberg’s Bible. First we describe the creation of a dataset, containing only the Genesis book, using dynamic programming techniques and projection profiles with the aid of a line-by-line transcription. Then we leverage this dataset to train a Mask R-CNN model in order to generalize word segmentation and detection in pages where transcription is not available.

For more information, read the paper located in the repo root.

Built With

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

The project provide a Pipfile file that can be managed with pipenv. pipenv installation is strongly encouraged in order to avoid dependency/reproducibility problems.

pipenv

pip3 install pipenv

Installation

Clone the repo

git clone https://gitlab.com/turboillegali/gutenberg

Install Python dependencies

pipenv install

Usage

Every file under src/ is executable. If you have pipenv installed, executing them so that the python interpreter can find the project dependencies is as easy as running pipenv run $file.

Here's a brief description of each and every file under the src/ directory:

preprocessing.py: Image preprocessing (e.g. skew correction and cropping).
caput.py: Caput detection
punctuation.py: Punctuation detection (e.g. long accents, periods, ...)
lines.py: Line and column segmentation
words.py: Word segmentation (requires output from lines.py)
coco_dataset.py: COCO-like dataset building. Requires outpput from words.py
coco_dataset_chunks.py: Variant of coco_dataset.py where instead of whole pages the images are split in chunks of N lines each (by default N = 7).

The dataset created with the previous steps can be used with the neural network available in the WALL_E_Net.ipnyb

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
dataset		dataset
src		src
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
WALL_E_Net.ipynb		WALL_E_Net.ipynb
paper.pdf		paper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gutenberg

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

About

Releases

Packages

Contributors 4

Languages

LorenzoAgnolucci/Gutenberg

Folders and files

Latest commit

History

Repository files navigation

Gutenberg

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages