Gutenberg
is a pipeline for training a neural network in segmenting and recognising frequent words in early printed books, in particular we focus on Gutenberg’s Bible.
First we describe the creation of a dataset, containing only the Genesis book, using dynamic programming techniques and projection profiles with the aid of a line-by-line transcription.
Then we leverage this dataset to train a Mask R-CNN model in order to generalize word segmentation and detection in pages where transcription is not available.
For more information, read the paper located in the repo root.
To get a local copy up and running follow these simple steps.
The project provide a Pipfile
file that can be managed with pipenv.
pipenv
installation is strongly encouraged in order to avoid dependency/reproducibility problems.
- pipenv
pip3 install pipenv
- Clone the repo
git clone https://gitlab.com/turboillegali/gutenberg
- Install Python dependencies
pipenv install
Every file under src/
is executable. If you have pipenv
installed, executing them
so that the python interpreter can find the project dependencies is as easy as running pipenv run $file
.
Here's a brief description of each and every file under the src/
directory:
preprocessing.py
: Image preprocessing (e.g. skew correction and cropping).caput.py
: Caput detectionpunctuation.py
: Punctuation detection (e.g. long accents, periods, ...)lines.py
: Line and column segmentationwords.py
: Word segmentation (requires output fromlines.py
)coco_dataset.py
: COCO-like dataset building. Requires outpput fromwords.py
coco_dataset_chunks.py
: Variant ofcoco_dataset.py
where instead of whole pages the images are split in chunks of N lines each (by default N = 7).
The dataset created with the previous steps can be used with the neural network available in the WALL_E_Net.ipnyb