Skip to content

StabRise/.github

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 

Repository files navigation

Hi there 👋

StabRise - Document Processing Solutions

Our projects

PDF DataSource for the Apache Spark

Spark Pdf


Source Code: https://github.com/StabRise/spark-pdf

Home page: https://stabrise.com/spark-pdf/

Quick Start Jupyter Notebook: https://github.com/StabRise/spark-pdf/blob/main/examples/PdfDataSource.ipynb


The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.

Key features:

  • Read PDF documents to the Spark DataFrame
  • Support read PDF files lazy per page
  • Support big files, up to 10k pages
  • Support scanned PDF files (call OCR)
  • No need to install Tesseract OCR, it's included in the package

ScaleDP

ScaleDP

ScaleDP is an Open-Source Library for processing documents using Apache Spark.

Key features:

  • Load PDF documents/Images
  • Extract text from PDF documents/Images
  • Extract images from PDF documents
  • OCR Images/PDF documents
  • Run NER on text extracted from PDF documents/Images
  • Visualize NER results

De-Identify

De-Identify

De-Identify is tool for de-identification/anonymization data

Supported formats

  • text
  • images
  • pdf documents
  • DICOM files