Hi there 👋

StabRise - Document Processing Solutions

Our projects

PDF DataSource for the Apache Spark

Source Code: https://github.com/StabRise/spark-pdf

Home page: https://stabrise.com/spark-pdf/

Quick Start Jupyter Notebook: https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb

The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

Key features:

Read PDF documents to the Spark DataFrame
Support read PDF files lazy per page
Support big files, up to 10k pages
Support scanned PDF files (call OCR)
No need to install Tesseract OCR, it's included in the package

ScaleDP

ScaleDP is an Open-Source Library for processing documents using Apache Spark.

Key features:

Load PDF documents/Images
Extract text from PDF documents/Images
Extract images from PDF documents
OCR Images/PDF documents
Run NER on text extracted from PDF documents/Images
Visualize NER results

De-Identify

De-Identify is tool for de-identification/anonymization data

Supported formats

text
images
pdf documents
DICOM files

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
profile		profile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hi there 👋

Our projects

PDF DataSource for the Apache Spark

Key features:

ScaleDP

Key features:

De-Identify

Supported formats

About

StabRise/.github

Folders and files

Latest commit

History

Repository files navigation

Hi there 👋

Our projects

PDF DataSource for the Apache Spark

Key features:

ScaleDP

Key features:

De-Identify

Supported formats

About

Topics

Resources

Stars

Watchers

Forks