Skip to content

miniHive/schemastore-analysis

Repository files navigation

JSON-Schema-Analysis

Description

JSON-Schema-Analysis is a project to analyse real-world JSON Schema documents towards their usage of components and features allowed by the JSON Schema standard.
Therefore the JSON Schema files available at the JSON Schema Store are downloaded and analysed using some Python scripts.

Running the Code

Preliminaries

To run this code on your device, you have to download or clone this repository. You must have a version of Python 3 installed. The project is tested with Python version Python 3.7.2. You'll also need to install some packages available with pip. The used packages are:

  • Networkx - A package for operations on graphs
  • PyDot - A package to operate with DOT-Format used to visualize graphs
  • Graphviz - A package to interact with the Graphviz Framework
  • MatPlotLib - A package to create plots with the results of the analysis
  • Pandas - A package used to import and export MS Excel sheets
  • Openpyxl - A package used by Pandas to export data to MS Excel format
  • jsonschema - A package which implements a JSON Schema Validator, that is used to validate
  • xlrd - A packacke for data extraction from Excel documents the Schemas themselves

The following table shows the versions of the packages, which were used to run the project.

Package Version
NetworkX 2.2
PyDot 1.4.1
Graphviz 0.10.1
Matplotlib 3.0.2
Pandas 0.23.4
Openpyxl 2.6.1
jsonschema 3.0.0a4.dev74+g5f5b865

Installing on Windows

To install the named packages run the following code in your Windows PowerShell.

> pip install networkx
> pip install pydot
> pip install graphviz
> pip install matplotlib
> pip install pandas
> pip install openpyxl
> pip install urllib3
> pip install git+https://github.com/Julian/jsonschema.git

It is necessary to install the jsonschema package directly from GitHub to get the latest version. The version available at PyPI does not include support for Schema draft 6 and 7.

Additionally you will have to install the GraphViz software from here. Select the most recent stable release (2.38 on 05/21/2019). The easiest way is to downlowad the msi file. Openening that will start an installer. NOTE: Don't forget to include the /bin directory of your GraphViz to your PATH environment variable on windows after installation. Depending on your installation the path to add to your PATH variable could look like this: C:\Program Files (x86)\Graphviz2.38\bin. For more information on how to change your PATH environment variable please have a look at this post ..

Installing on Unix

To install the named packages run the following code in your Windows PowerShell.

$ pip install networkx
$ pip install pydot
$ pip install graphviz
$ pip install matplotlib
$ pip install pandas
$ pip install openpyxl
$ pip install urllib3
$ pip install git+https://github.com/Julian/jsonschema.git

It is necessary to install the jsonschema package directly from GitHub to get the latest version. The version available at PyPI does not include support for Schema draft 6 and 7.

Additionally you will have to install the GraphViz software from here.

Running the code

After you appropriately installed the packages named above, you can rerun the analysis provided by this project. The main file is JSON_Schema_Analysis.py. It takes some optional command line arguments. Providing -a makes the code analyse all files provided in the directory JSON. With the argument -c <arg> you can specify the amount of files to analyse. It is also possible to print all results on the CLI with the argument -v.

To analyse additional schemas, you have to put them in the directory JSON and add them in the two CSV files responsible for category matching. These are filename_spec.csv and categorisation.csv, both located in the main directory of this project. For each file, you have to specify a nickname, the real filename and a category. Insert a line with nickname,filename in filename_spec.csv and a line with nickname,category in categorisation.csv for every additional JSON Schema you want to analyse. Although JSON_Schema_Analysis specifies the four categories app, data, conf and meta, it is capable of handling other categories just by specifing them as categoryin categorisation.csv.

Results

The results are stored as an Excel sheet named AnalysisResulst.xlsx in the projects main directory and as CSV file AnalysisResulsts.csv in the same directory. Plots have to be generated seperately with the provided scripts explained at the bottom of this document. AnalysisResulsts.xlsx consists of several columns containing information about a specific JSON Schema document in each row. The first column is giving the filename of the JSON Schema document located in the JSON directory. The following columns in the same row provide the information about this JSON Schema generated by the analysis. The following table will give an explanation of the meaning of each column.

Column name Meaning
add_prop_count Number of occurences of the additionalProperties keyword.
all_of_count Number of occurences of the allOf keyword.
any_of_count Number of occurences of the anyOf keyword.
array_count Number of occurences of the array keyword.
str_count Number of occurences of the string type keyword.
enum_count Number of occurences of the enum keyword.
mult_of_count Number of occurences of the multipleOf keyword.
not_count Number of occurences of the not keyword.
number_count Number of occurences of the integer plus number type keywords.
pattern_count Number of occurences of the pattern plus patternProperty keyword.
required_count Number of occurences of the required keyword.
unique_items_count Number of occurences of the uniqueItems keyword.
value_restriction_count Sum of occurences of the min, max, minLength, maxLength, exlusiveMinimum and exlusiveMaximum keywords.
boolean_count Number of occurences of the boolean type keyword.
nulltype_count Number of occurences of the null type keyword.
object_count Number of occurences of the object type keyword.
ref_count Number of occurences of the $ref keyword.
depth_schema Depth of the tree that emerges from loading the raw JSON Schema into an schema_graph .
depth_resolvedTree Depth of the tree after resolving the references. If has_recursion is true, this is the maximum cycle length in the recursive document.
fan_in Maximum Fan-In over all nodes included in the schema_graph.
fan_out Maximum Fan-Out over all nodes included in the schema_graph.
has_recursion Boolean flag that indicates whether the JSON Schema document (i.e. the resolved graph) is recursive.
min_cycle_len Minimum cycle length of a recursive document. If has_recursion is false, this column will be 0.
width Number of leaf nodes in the schema_graph of the raw JSON Schema document.
reachability Boolean flag that indicates whether the schema contains unreachable (unused) definitons.

Project Structure

The main part of the project is located in ./JSON_Schema_Analysis/JSON_Schema_Analysis. The Python script JSON_Schema_Analysis.py contains the main function. When started, it creates several processes equal to the number of virtual CPU cores available on the current machine. These processes are described in the file Analytic_Process.py. The project uses the python multiprocessing library and an Analytic_Process inherits from the process class defined there. Analytic_Processes fetch a file and perform all necessary analytic steps. The results are stored afterwards and a new file is fetched as long as unprocessed files are available. This is implemented to avoid problems with concurrency. The Analytic_Processes build schema_graph from the JSON Schema documents. These graphs are represented by the class defined in schema_graph.py. Most computational stuff is performed there. Three types of graph nodes are defined in the project: KeyValuenNodes, ArrayNodes and ObjectNodes that all inherit from SchemaNode defined in the files with the same name. The file load_schema_from_web.py is used to download additional files in the resolving process every time an external reference is required. The schema_checker.py file performs the validity check with one validator. All type counts and some other counts are performed using the visitor pattern. All used visitors are defined in the subdirectory Visitors. The Meta_Schemas directory contains the JSON Schema Meta Schemas for each draft.

All unit tests performed can be found in the directory PyTest. There is an additional ReadMe.md that describes the structure of the tests.

The top directory of the project contains the results in AnalysisResulst.xlsx and AnalysisResulsts.csv. The contained information is equal. The file categorisation.csv contains the mapping of JSON Schema document's short names to their category. The file filename_spec.csv contains the mapping from document short names (see schemastore.org) to the actual used filenames of the stored JSON Schema documents. Both files are used by the project to determine the category of each file. The file filename_spec.csv is generated by get_schemas_from_store.py. This script downloads all JSON Schema documents from schemastore.org, generates the filenames and stores the schemas in the directory JSON. The file typeCompareBoxplot_CombinedCount.py generates the plot typeCompareBoxplot_CombinedCount.png in the directory Plots by reading the required data from AnalysisResulsts.xlsx. The three barcharts are generated by hist.py. The file countsSpecialCategoriesTotal.csv is generated by table.py. The file writer.py implements helper functions for table.py. Before table.py can be executed, writer.py has to be run at least once.

The directory JsonSchemaAnalysis contains a reference implementation of the python project which was used to validate the calculated results.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages