This is a tool to profile your incoming data, check if it adheres to registered schema and do custom data quality checks. At the end of all this, a human readable report is autogenerated that can be sent over to stakeholders.
- Config driven data profiling and schema validation
- Autogeneration of report after every run
- Integration with datadog monitoring system
- Extensible and highly customizable.
- Very little boiler plate code.
- Support for versioned schema validation.
- CSV
- JSON
- Parquet
can easily be extended to all the formats that Apache Spark supports for reads.
Supports both ANSI-SQL as well as Hive QL. List of all supported SQL functions can be found here
- Datavalidator notebook tool
- Sample dataset dataset
- Sample dataset schema
- Sample result report
- Runner script
All one has to do is execute a python script papermill_notebook_runner.py
. This script takes in the following arguments in order:
- Path to the notebook to be run.
- Path to the output notebook.
- JSON configuration that will drive the notebook.
python papermill_notebook_runner.py data-validator.ipynb output/data-validator.ipynb '{"dataFormat":"json","inputDataLocation":"s3a://bucket/prefix/generated.json","appName":"cust-profile-data-validation","schemaRepoUrl":"http://schemarepohostaddress","scheRepoSubjectName":"cust-profile","schemaVersionId":"0","customQ1":"select CAST(count(_id) - count(distinct _id) as Long) as diff from dataset","customQ1ResultThreshold":0,"customQ1Operator":"=","customQ2":"select CAST(length(phone) as Long) from dataset","customQ2ResultThreshold":17,"customQ2Operator":"=","customQ3":"select CAST(count(distinct gender) as Long) from dataset","customQ3ResultThreshold":3,"customQ3Operator":"<="}'
There are several pieces involved.
- First install jupyter notebooks. Install instructions here.
- Next install spark magic. Install instructions here
- Configure sparkmagic with your own Apache Livy endpoints. Config file should look like this
- Install papermill from source after adding spark-magic kernels. Clone papermill project from here.
- Update the translators file to add sparkmagic kernels at the very end of the file.
papermill_translators.register("sparkkernel", ScalaTranslator)
papermill_translators.register("pysparkkernel", PythonTranslator)
papermill_translators.register("sparkrkernel", RTranslator)
- Next install schema repo. Install instructions here.
Find more details on this guide
That should be it. Enjoy Profiling !!