JSON-Schema-Analysis is a project to analyse real-world JSON Schema documents towards their usage of components and features allowed by the JSON Schema standard. Therefore the JSON Schema files available at the JSON Schema Store are downloaded and analysed using some Python scripts.
To run this code on your device, you have to download or clone this repository. You must have a version of Python 3 installed. The project is tested with Python version 3.7.17. You'll also need to install some packages available with pipenv__. Additionally you will have to install the GraphViz software from here.
After you appropriately installed the packages named above, you can rerun the analysis provided by this project.
The main file is JSON_Schema_Analysis.py
. It takes some optional command line arguments. Providing -a
makes the code analyse all files provided in the directory JSON. With the argument -c <arg>
you can specify the amount of files to analyse. It is also possible to print all results on the CLI with the argument -v
.
To analyse additional schemas, you have to put them in the directory JSON and add them in the two CSV files responsible for category matching. These are filename_spec.csv
and categorisation.csv
, both located in the main directory of this project. For each file, you have to specify a nickname, the real filename and a category. Insert a line with nickname,filename
in filename_spec.csv
and a line with nickname,category
in categorisation.csv
for every additional JSON Schema you want to analyse.
Although JSON_Schema_Analysis specifies the four categories app, data, conf and meta, it is capable of handling other categories just by specifying them as category
in categorisation.csv
.
The results are stored as an Excel sheet named AnalysisResults.xlsx
in the projects main directory and as CSV file AnalysisResults.csv
in the same directory. Plots have to be generated separately with the provided scripts explained at the bottom of this document.
AnalysisResults.xlsx
consists of several columns containing information about a specific JSON Schema document in each row.
The first column is giving the filename of the JSON Schema document located in the JSON directory. The following columns in the same row provide the information about this JSON Schema generated by the analysis.
The following table will give an explanation of the meaning of each column.
Column name | Meaning |
---|---|
add_prop_count |
Number of occurrences of the additionalProperties keyword. |
all_of_count |
Number of occurrences of the allOf keyword. |
any_of_count |
Number of occurrences of the anyOf keyword. |
array_count |
Number of occurrences of the array keyword. |
str_count |
Number of occurrences of the string type keyword. |
enum_count |
Number of occurrences of the enum keyword. |
mult_of_count |
Number of occurrences of the multipleOf keyword. |
not_count |
Number of occurrences of the not keyword. |
number_count |
Number of occurrences of the integer plus number type keywords. |
pattern_count |
Number of occurrences of the pattern plus patternProperty keyword. |
required_count |
Number of occurrences of the required keyword. |
unique_items_count |
Number of occurrences of the uniqueItems keyword. |
value_restriction_count |
Sum of occurrences of the min , max , minLength , maxLength , exclusiveMinimum and exclusiveMaximum keywords. |
boolean_count |
Number of occurrences of the boolean type keyword. |
nulltype_count |
Number of occurrences of the null type keyword. |
object_count |
Number of occurrences of the object type keyword. |
ref_count |
Number of occurrences of the $ref keyword. |
depth_schema |
Depth of the tree that emerges from loading the raw JSON Schema into an schema_graph . |
depth_resolvedTree |
Depth of the tree after resolving the references. If has_recursion is true, this is the maximum cycle length in the recursive document. |
fan_in |
Maximum Fan-In over all nodes included in the schema_graph. |
fan_out |
Maximum Fan-Out over all nodes included in the schema_graph. |
has_recursion |
Boolean flag that indicates whether the JSON Schema document (i.e. the resolved graph) is recursive. |
min_cycle_len |
Minimum cycle length of a recursive document. If has_recursion is false, this column will be 0. |
width |
Number of leaf nodes in the schema_graph of the raw JSON Schema document. |
reachability |
Boolean flag that indicates whether the schema contains unreachable (unused) definitions. |
The Python script JSON_Schema_Analysis.py
contains the main function.
When started, it creates several processes equal to the number of virtual CPU cores available on the current machine. These processes are described in the file Analytic_Process.py
.
The project uses the python multiprocessing library and an Analytic_Process inherits from the process class defined there.
Analytic_Processes fetch a file and perform all necessary analytic steps. The results are stored afterwards and a new file is fetched as long as unprocessed files are available. This is implemented to avoid problems with concurrency.
The Analytic_Processes build schema_graph from the JSON Schema documents. These graphs are represented by the class defined in schema_graph.py
. Most computational stuff is performed there.
Three types of graph nodes are defined in the project: KeyValueNode
s, ArrayNode
s and ObjectNode
s that all inherit from SchemaNode
defined in the files with the same name.
The file load_schema_from_web.py
is used to download additional files in the resolving process every time an external reference is required. The schema_checker.py
file performs the validity check with one validator.
All type counts and some other counts are performed using the visitor pattern. All used visitors are defined in the subdirectory Visitors
. The Meta_Schemas
directory contains the JSON Schema Meta Schemas for each draft.
All unit tests performed can be found in the directory PyTest
. There is an additional README that describes the structure of the tests.
The top directory of the project contains the results in AnalysisResults.xlsx
and AnalysisResults.csv
. The contained information is equal.
The file categorisation.csv
contains the mapping of JSON Schema document's short names to their category. The file filename_spec.csv
contains the mapping from document short names (see schemastore.org)
to the actual used filenames of the stored JSON Schema documents. Both files are used by the project to determine the category of each file.
The file filename_spec.csv
is generated by get_schemas_from_store.py
. This script downloads all JSON Schema documents from schemastore.org, generates the filenames and stores the schemas in the directory JSON.
The file typeCompareBoxplot_CombinedCount.py
generates the plot typeCompareBoxplot_CombinedCount.png
in the directory Plots
by reading the required data from AnalysisResults.xlsx
.
The three bar charts are generated by hist.py
. The file countsSpecialCategoriesTotal.csv
is generated by table.py
. The file writer.py
implements helper functions for table.py
. Before table.py
can be executed, writer.py
has to be run at least once.
The directory JsonSchemaAnalysis
contains a reference implementation of the Python project which was used to validate the calculated results.