Skip to content

Profiles the data, validates the schema and runs data quality checks and produces a report

Notifications You must be signed in to change notification settings

Nordstrom/bigdata-profiler

Repository files navigation

Bigdata profiler

This is a tool to profile your incoming data, check if it adheres to registered schema and do custom data quality checks. At the end of all this, a human readable report is autogenerated that can be sent over to stakeholders.

Features

  • Config driven data profiling and schema validation
  • Autogeneration of report after every run
  • Integration with datadog monitoring system
  • Extensible and highly customizable.
  • Very little boiler plate code.
  • Support for versioned schema validation.

Dataformats currently supported

  • CSV
  • JSON
  • Parquet

can easily be extended to all the formats that Apache Spark supports for reads.

SQL support for custom data quality checks

Supports both ANSI-SQL as well as Hive QL. List of all supported SQL functions can be found here

Contents

Run Instructions

All one has to do is execute a python script papermill_notebook_runner.py. This script takes in the following arguments in order:

  • Path to the notebook to be run.
  • Path to the output notebook.
  • JSON configuration that will drive the notebook.
python papermill_notebook_runner.py data-validator.ipynb output/data-validator.ipynb '{"dataFormat":"json","inputDataLocation":"s3a://bucket/prefix/generated.json","appName":"cust-profile-data-validation","schemaRepoUrl":"http://schemarepohostaddress","scheRepoSubjectName":"cust-profile","schemaVersionId":"0","customQ1":"select CAST(count(_id) - count(distinct _id) as Long) as diff from dataset","customQ1ResultThreshold":0,"customQ1Operator":"=","customQ2":"select CAST(length(phone) as Long) from dataset","customQ2ResultThreshold":17,"customQ2Operator":"=","customQ3":"select CAST(count(distinct gender) as Long) from dataset","customQ3ResultThreshold":3,"customQ3Operator":"<="}'

Install Instructions

There are several pieces involved.

  • First install jupyter notebooks. Install instructions here.
  • Next install spark magic. Install instructions here
  • Configure sparkmagic with your own Apache Livy endpoints. Config file should look like this
  • Install papermill from source after adding spark-magic kernels. Clone papermill project from here.
  • Update the translators file to add sparkmagic kernels at the very end of the file.
papermill_translators.register("sparkkernel", ScalaTranslator)
papermill_translators.register("pysparkkernel", PythonTranslator)
papermill_translators.register("sparkrkernel", RTranslator)
  • Next install schema repo. Install instructions here.

More details

Find more details on this guide

That should be it. Enjoy Profiling !!

About

Profiles the data, validates the schema and runs data quality checks and produces a report

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published