Skip to content
This repository has been archived by the owner on Aug 13, 2024. It is now read-only.

Commit

Permalink
Merge branch 'release/0.2.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
alexanderdean committed Jul 3, 2015
2 parents cf0878a + 02cfdc1 commit 1a9280b
Show file tree
Hide file tree
Showing 71 changed files with 3,969 additions and 346 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ project/plugins/project/

# Vagrant
.vagrant
VERSION
VERSION
23 changes: 23 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,26 @@
0.2.0 (2015-07-01)
------------------
Updated vagrant push to also build and publish webui artifact (#72)
Added NS and CORE settings to Vagrantfile to improve performance (#79)
Removed bin/jarx-stub.sh from project (#71)
Added Elastic Beanstalk deployment using Docker (#70)
Wrote tests for 0.2.0 milestone (#62)
Now getting SchemaVer name from JSON Path (#67)
Added configuration options for self-describing JSON Schema (#17, #19)
Added ability to segment JSON instances based on a JSON property (#48)
Refactored annotators to produce JNothing (#65)
Added enum cardinality option to Web UI (#64)
Now auto-detecting enums with configurable cardinality tolerance (#36)
Now gracefully printing error message and exit app if invalid path given (#51)
Now detecting field contains base64 (#58)
Supported dragging a JSON string (#57)
Added support of newline-delimited JSONs (#56)
Now outputting duplicated keys in Web UI (#61)
Now identifying and warning of misspelt properties (#31)
Created a single-page UI in plain JS (#39)
Added a sbt sub-project to schema-guru which embeds schema-guru in a Spray server (#53)
Fixed incorrectly reduced integer and number (#60)

0.1.0 (2015-06-03)
------------------
Initial release
116 changes: 108 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,39 +2,67 @@

[ ![Build Status] [travis-image] ] [travis] [ ![Release] [release-image] ] [releases] [ ![License] [license-image] ] [license]

Schema Guru is a tool (currently CLI only) allowing you to derive **[JSON Schemas] [json-schema]** from a set of JSON instances.
Schema Guru is a tool (CLI and web) allowing you to derive **[JSON Schemas] [json-schema]** from a set of JSON instances.

Unlike other tools for deriving JSON Schemas, Schema Guru allows you to derive schema from an unlimited set of instances (making schemas much more precise), and supports many more JSON Schema validation properties.

Schema Guru is used heavily in association with Snowplow's own **[Snowplow] [snowplow]** and **[Iglu] [iglu]** projects.

## User Quickstart

### CLI

Download the latest Schema Guru from Bintray:

```bash
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.1.0.zip
$ unzip schema_guru_0.1.0.zip
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.2.0.zip
$ unzip schema_guru_0.2.0.zip
```

Assuming you have a recent JVM installed:

```bash
$ ./schema-guru-0.1.0 --dir {{jsons_directory}}
$ ./schema-guru-0.2.0 --dir {{jsons_directory}}
```

Also you can specify output file for your schema:

```bash
$ ./schema-guru-0.1.0 --dir {{jsons_directory}} --output {{json_schema_file}}
$ ./schema-guru-0.2.0 --dir {{jsons_directory}} --output {{json_schema_file}}
```

Or you can analyze a single JSON instance:

```bash
$ ./schema-guru-0.1.0 --file {{json_instance}}
$ ./schema-guru-0.2.0 --file {{json_instance}}
```

You can also switch Schema Guru into ndjson mode, where it will look for newline delimited JSONs.

In this case all your files need to have `.ndjson` extension (as the **[specifications][ndjson-spec]** says); all `.json` files will be skipped.

```bash
$ ./schema-guru-0.2.0 --ndjson --dir {{ndjsons_directory}}
```

You can specify the enum cardinality tolerance for for your fields. It means that *all* fields which are found to have less than the specified cardinality will be specified in the JSON Schema using the `enum` property.

```bash
$ ./schema-guru-0.2.0 --enum 5 --dir {{jsons_directory}}
```

### Web UI

You can access our hosted demo of the Schema Guru web UI at [schemaguru.snplowanalytics.com] [webui-hosted]. To run it locally:

```bash
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.2.0.zip
$ unzip schema_guru_webui_0.2.0.zip
$ ./schema-guru-webui-0.2.0
```

The above will run a Spray web server containing Schema Guru on [0.0.0.0:8000] [webui-local]. Interface and port can be specified by `--interface` and `--port` respectively.

## Developer Quickstart

Assuming git, **[Vagrant] [vagrant-install]** and **[VirtualBox] [virtualbox-install]** installed:
Expand All @@ -44,9 +72,18 @@ Assuming git, **[Vagrant] [vagrant-install]** and **[VirtualBox] [virtualbox-ins
host$ cd schema-guru
host$ vagrant up && vagrant ssh
guest$ cd /vagrant
guest$ sbt test
guest$ sbt assembly
guest$ sbt "project schema-guru-webui" assembly
```

You can also deploy the Schema Guru web GUI onto Elastic Beanstalk:

```
guest$ cd beanstalk && zip beanstalk.zip *
```

Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk-console]** and upload this zipfile.

## User Manual

### Functionality
Expand All @@ -59,14 +96,68 @@ guest$ sbt test
- date-time (according to ISO-8601)
- IPv4 and IPv6 addresses
- HTTP, HTTPS, FTP URLs
* Recognizes base64 pattern for strings
* Detects integer ranges according to Int16, Int32, Int64
* Detects misspelt properties and produce warnings
* Detects enum values with specified cardinality
* Allows to output **[Self-describing JSON Schema] [self-describing]**
* Allows to produce JSON Schemas with different names based on given JSON Path
* Supports **[Newline Delimited JSON] [ndjson]**

### Assumptions

* All JSONs in the directory are assumed to be of the same event type and will be merged together
* All JSONs are assumed to start with either `{ ... }` or `[ ... ]`
- If they do not they are discarded
* Schema should be as strict as possible - e.g. no `additionalProperties` are allowed currently
* When using Schema Guru to derive schema from newline delimited JSONs they need to have .ndjson extension

### Self-describing JSON
Schema Guru allows you to produce **[Self-describing JSON Schema] [self-describing]**.
To produce it you need to specify vendor, name (if segmentation isn't using, see below), and version (optional, default value is 0-1-0).

```bash
$ ./schema-guru-0.2.0 --dir {{jsons_directory}} --vendor {{your_company}} --name {{schema_name}} --schemaver {{version}}
```

### Schema Segmentation

If you have set of mixed JSONs from one vendor, but with slightly different structure, like:

```json
{ "version": 1,
"type": "track",
"userId": "019mr8mf4r",
"event": "Purchased an Item",
"properties": {
"revenue": "39.95",
"shippingMethod": "2-day" },
"timestamp" : "2012-12-02T00:30:08.276Z" }
```

and

```json
{ "version": 1,
"type": "track",
"userId": "019mr8mf4r",
"event": "Posted a Comment",
"properties": {
"body": "This book is gorgeous!",
"attachment": false },
"timestamp" : "2012-12-02T00:28:02.273Z" }
```

You can run it as follows:
```bash
$ ./schema-guru-0.2.0 --dir {{mixed_jsons_directory}} --output-dir {{output_dir}} --schema-by $.event
```

It will put two (or may be more) JSON Schemas into output dir: Purchased_an_Item.json and Posted_a_comment.json.
They will be derived from JSONs only with corresponding event property, without any intersections.
Assuming that provided JSON Path contain valid string.
All schemas where this JSON Path is absent or contains not a string value will be merged into unmatched.json schema in the same output dir.
Also, when Self-describing JSON Schema producing, it will take schema name in the same way and --name argument can be omitted (it will replace name specified with option).

### Example

Expand Down Expand Up @@ -162,13 +253,22 @@ limitations under the License.
[license-image]: http://img.shields.io/badge/license-Apache--2-blue.svg?style=flat
[license]: http://www.apache.org/licenses/LICENSE-2.0

[release-image]: http://img.shields.io/badge/release-0.1.0-blue.svg?style=flat
[release-image]: http://img.shields.io/badge/release-0.2.0-blue.svg?style=flat
[releases]: https://github.com/snowplow/schema-guru/releases

[json-schema]: http://json-schema.org/

[ndjson]: http://ndjson.org/
[ndjson-spec]: http://dataprotocols.org/ndjson/

[webui-local]: http://0.0.0.0:8000
[webui-hosted]: http://schemaguru.snowplowanalytics.com

[snowplow]: https://github.com/snowplow/snowplow
[iglu]: https://github.com/snowplow/iglu
[self-describing]: http://snowplowanalytics.com/blog/2014/05/15/introducing-self-describing-jsons/

[vagrant-install]: http://docs.vagrantup.com/v2/installation/index.html
[virtualbox-install]: https://www.virtualbox.org/wiki/Downloads

[beanstalk-console]: http://console.aws.amazon.com/elasticbeanstalk
6 changes: 6 additions & 0 deletions Vagrantfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,18 @@ Vagrant.configure("2") do |config|
config.vm.hostname = "schema-guru"
config.ssh.forward_agent = true

# Required for NFS to work, pick any local IP
# Use NFS for shared folders for better performance
# config.vm.network :private_network, ip: '192.168.50.50' # Uncomment to use NFS
# config.vm.synced_folder '.', '/vagrant', nfs: true # Uncomment to use NFS

config.vm.provider :virtualbox do |vb|
vb.name = Dir.pwd().split("/")[-1] + "-" + Time.now.to_f.to_i.to_s
vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
vb.customize [ "guestproperty", "set", :id, "--timesync-threshold", 10000 ]
# Scala is memory-hungry
vb.memory = 5120
# vb.cpus = 4 # Uncomment to use more cores
end

config.vm.provision :shell do |sh|
Expand Down
File renamed without changes.
10 changes: 10 additions & 0 deletions beanstalk/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Dockerfile

FROM java:7
MAINTAINER Snowplow Analytics, [email protected]
WORKDIR /
USER daemon
EXPOSE 8000

ADD start.sh /tmp/
CMD ./tmp/start.sh
11 changes: 11 additions & 0 deletions beanstalk/Dockerrun.aws.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"AWSEBDockerrunVersion": "1",
"Image": {
"Name": "java:7"
},
"Ports": [
{
"ContainerPort": "8000"
}
]
}
7 changes: 7 additions & 0 deletions beanstalk/start.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#/bin/sh

# Script must be executable!
cd /tmp
wget -N http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.2.0.zip
unzip -o schema_guru_webui_0.2.0.zip
java -jar schema-guru-webui-0.2.0
25 changes: 0 additions & 25 deletions bin/jarx-stub.sh

This file was deleted.

46 changes: 27 additions & 19 deletions project/BuildSettings.scala
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,10 @@ import sbt._
import Keys._

object BuildSettings {

// Basic settings for our app
lazy val basicSettings = Seq[Setting[_]](
// Common settings for all our projects
lazy val commonSettings = Seq[Setting[_]](
organization := "com.snowplowanalytics",
version := "0.1.0",
description := "For deriving JSON Schemas from collections of JSON instances",
version := "0.2.0",
scalaVersion := "2.10.5",
crossScalaVersions := Seq("2.10.5", "2.11.6"),
scalacOptions := Seq("-deprecation", "-encoding", "utf8",
Expand All @@ -32,42 +30,52 @@ object BuildSettings {
resolvers ++= Dependencies.resolutionRepos
)

// Settings specific for Schema Guru CLI
lazy val coreSettings = Seq[Setting[_]](
description := "For deriving JSON Schemas from collections of JSON instances",

mainClass in (Compile, run) := Some("com.snowplowanalytics.schemaguru.Main")
)

// Makes our SBT app settings available from within the ETL
lazy val scalifySettings = Seq(sourceGenerators in Compile <+= (sourceManaged in Compile, version, name, organization, scalaVersion) map { (d, v, n, o, sv) =>
val file = d / "settings.scala"
IO.write(file, """package com.snowplowanalytics.schemaguru.generated
|object ProjectSettings {
| val version = "%s"
| val name = "%s"
| val organization = "%s"
| val scalaVersion = "%s"
|}
|""".stripMargin.format(v, n, o, sv))
|object ProjectSettings {
| val version = "%s"
| val name = "%s"
| val organization = "%s"
| val scalaVersion = "%s"
|}
|""".stripMargin.format(v, n, o, sv))
Seq(file)
})

// sbt-assembly settings for building a fat jar
import sbtassembly.Plugin._
import AssemblyKeys._
lazy val sbtAssemblySettings = assemblySettings ++ Seq(

lazy val sbtAssemblyCommonSettings = assemblySettings ++ Seq(
// Executable jarfile
assemblyOption in assembly ~= { _.copy(prependShellScript = Some(defaultShellScript)) },

// Name it as an executable
jarName in assembly := { s"${name.value}-${version.value}" },
jarName in assembly := { s"${name.value}-${version.value}" }
)

lazy val sbtAssemblyCoreSettings = sbtAssemblyCommonSettings ++ Seq(
// Drop these jars
excludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
val excludes = Set(
"commons-beanutils-1.8.3.jar" // Clashes with commons-collections
)
cp filter { jar => excludes(jar.data.getName) }
},

// Make this executable
mainClass in assembly := Some("com.snowplowanalytics.schemaguru.SchemaGuruApp")
mainClass in assembly := Some("com.snowplowanalytics.schemaguru.Main")
)

lazy val buildSettings = basicSettings ++ scalifySettings ++ sbtAssemblySettings
lazy val coreBuildSettings =
commonSettings ++
coreSettings ++
scalifySettings ++
sbtAssemblyCoreSettings
}
Loading

0 comments on commit 1a9280b

Please sign in to comment.