Skip to content

Gutenberg is a pipeline for training a neural network in segmenting and recognising frequent words in early printed books, in particular we focus on Gutenberg’s Bible.

Notifications You must be signed in to change notification settings

LorenzoAgnolucci/Gutenberg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gutenberg

Table of Contents

About The Project

Gutenberg is a pipeline for training a neural network in segmenting and recognising frequent words in early printed books, in particular we focus on Gutenberg’s Bible. First we describe the creation of a dataset, containing only the Genesis book, using dynamic programming techniques and projection profiles with the aid of a line-by-line transcription. Then we leverage this dataset to train a Mask R-CNN model in order to generalize word segmentation and detection in pages where transcription is not available.

For more information, read the paper located in the repo root.

Built With

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

The project provide a Pipfile file that can be managed with pipenv. pipenv installation is strongly encouraged in order to avoid dependency/reproducibility problems.

  • pipenv
pip3 install pipenv

Installation

  1. Clone the repo
git clone https://gitlab.com/turboillegali/gutenberg
  1. Install Python dependencies
pipenv install

Usage

Every file under src/ is executable. If you have pipenv installed, executing them so that the python interpreter can find the project dependencies is as easy as running pipenv run $file.

Here's a brief description of each and every file under the src/ directory:

  • preprocessing.py: Image preprocessing (e.g. skew correction and cropping).
  • caput.py: Caput detection
  • punctuation.py: Punctuation detection (e.g. long accents, periods, ...)
  • lines.py: Line and column segmentation
  • words.py: Word segmentation (requires output from lines.py)
  • coco_dataset.py: COCO-like dataset building. Requires outpput from words.py
  • coco_dataset_chunks.py: Variant of coco_dataset.py where instead of whole pages the images are split in chunks of N lines each (by default N = 7).

The dataset created with the previous steps can be used with the neural network available in the WALL_E_Net.ipnyb

About

Gutenberg is a pipeline for training a neural network in segmenting and recognising frequent words in early printed books, in particular we focus on Gutenberg’s Bible.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •