Skip to content

dataunitylab/schemastore-analysis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JSON-Schema-Analysis

CI pre-commit.ci status

Description

JSON-Schema-Analysis is a project to analyse real-world JSON Schema documents towards their usage of components and features allowed by the JSON Schema standard. Therefore the JSON Schema files available at the JSON Schema Store are downloaded and analysed using some Python scripts.

Running the Code

Preliminaries

To run this code on your device, you have to download or clone this repository. You must have a version of Python 3 installed. The project is tested with Python version 3.7.17. You'll also need to install some packages available with pipenv__. Additionally you will have to install the GraphViz software from here.

Running the code

After you appropriately installed the packages named above, you can rerun the analysis provided by this project. The main file is JSON_Schema_Analysis.py. It takes some optional command line arguments. Providing -a makes the code analyse all files provided in the directory JSON. With the argument -c <arg> you can specify the amount of files to analyse. It is also possible to print all results on the CLI with the argument -v.

To analyse additional schemas, you have to put them in the directory JSON and add them in the two CSV files responsible for category matching. These are filename_spec.csv and categorisation.csv, both located in the main directory of this project. For each file, you have to specify a nickname, the real filename and a category. Insert a line with nickname,filename in filename_spec.csv and a line with nickname,category in categorisation.csv for every additional JSON Schema you want to analyse. Although JSON_Schema_Analysis specifies the four categories app, data, conf and meta, it is capable of handling other categories just by specifying them as categoryin categorisation.csv.

Results

The results are stored as an Excel sheet named AnalysisResults.xlsx in the projects main directory and as CSV file AnalysisResults.csv in the same directory. Plots have to be generated separately with the provided scripts explained at the bottom of this document. AnalysisResults.xlsx consists of several columns containing information about a specific JSON Schema document in each row. The first column is giving the filename of the JSON Schema document located in the JSON directory. The following columns in the same row provide the information about this JSON Schema generated by the analysis. The following table will give an explanation of the meaning of each column.

Column name Meaning
add_prop_count Number of occurrences of the additionalProperties keyword.
all_of_count Number of occurrences of the allOf keyword.
any_of_count Number of occurrences of the anyOf keyword.
array_count Number of occurrences of the array keyword.
str_count Number of occurrences of the string type keyword.
enum_count Number of occurrences of the enum keyword.
mult_of_count Number of occurrences of the multipleOf keyword.
not_count Number of occurrences of the not keyword.
number_count Number of occurrences of the integer plus number type keywords.
pattern_count Number of occurrences of the pattern plus patternProperty keyword.
required_count Number of occurrences of the required keyword.
unique_items_count Number of occurrences of the uniqueItems keyword.
value_restriction_count Sum of occurrences of the min, max, minLength, maxLength, exclusiveMinimum and exclusiveMaximum keywords.
boolean_count Number of occurrences of the boolean type keyword.
nulltype_count Number of occurrences of the null type keyword.
object_count Number of occurrences of the object type keyword.
ref_count Number of occurrences of the $ref keyword.
depth_schema Depth of the tree that emerges from loading the raw JSON Schema into an schema_graph .
depth_resolvedTree Depth of the tree after resolving the references. If has_recursion is true, this is the maximum cycle length in the recursive document.
fan_in Maximum Fan-In over all nodes included in the schema_graph.
fan_out Maximum Fan-Out over all nodes included in the schema_graph.
has_recursion Boolean flag that indicates whether the JSON Schema document (i.e. the resolved graph) is recursive.
min_cycle_len Minimum cycle length of a recursive document. If has_recursion is false, this column will be 0.
width Number of leaf nodes in the schema_graph of the raw JSON Schema document.
reachability Boolean flag that indicates whether the schema contains unreachable (unused) definitions.

Project Structure

The Python script JSON_Schema_Analysis.py contains the main function. When started, it creates several processes equal to the number of virtual CPU cores available on the current machine. These processes are described in the file Analytic_Process.py. The project uses the python multiprocessing library and an Analytic_Process inherits from the process class defined there. Analytic_Processes fetch a file and perform all necessary analytic steps. The results are stored afterwards and a new file is fetched as long as unprocessed files are available. This is implemented to avoid problems with concurrency. The Analytic_Processes build schema_graph from the JSON Schema documents. These graphs are represented by the class defined in schema_graph.py. Most computational stuff is performed there. Three types of graph nodes are defined in the project: KeyValueNodes, ArrayNodes and ObjectNodes that all inherit from SchemaNode defined in the files with the same name. The file load_schema_from_web.py is used to download additional files in the resolving process every time an external reference is required. The schema_checker.py file performs the validity check with one validator. All type counts and some other counts are performed using the visitor pattern. All used visitors are defined in the subdirectory Visitors. The Meta_Schemas directory contains the JSON Schema Meta Schemas for each draft.

All unit tests performed can be found in the directory PyTest. There is an additional README that describes the structure of the tests.

The top directory of the project contains the results in AnalysisResults.xlsx and AnalysisResults.csv. The contained information is equal. The file categorisation.csv contains the mapping of JSON Schema document's short names to their category. The file filename_spec.csv contains the mapping from document short names (see schemastore.org) to the actual used filenames of the stored JSON Schema documents. Both files are used by the project to determine the category of each file. The file filename_spec.csv is generated by get_schemas_from_store.py. This script downloads all JSON Schema documents from schemastore.org, generates the filenames and stores the schemas in the directory JSON. The file typeCompareBoxplot_CombinedCount.py generates the plot typeCompareBoxplot_CombinedCount.png in the directory Plots by reading the required data from AnalysisResults.xlsx. The three bar charts are generated by hist.py. The file countsSpecialCategoriesTotal.csv is generated by table.py. The file writer.py implements helper functions for table.py. Before table.py can be executed, writer.py has to be run at least once.

The directory JsonSchemaAnalysis contains a reference implementation of the Python project which was used to validate the calculated results.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%