Merge branch 'release/0.2.0'

snowplow-archive · Jul 3, 2015 · 1a9280b · 1a9280b
2 parents cf0878a + 02cfdc1
commit 1a9280b
Show file tree

Hide file tree

Showing 71 changed files with 3,969 additions and 346 deletions.
diff --git a/.gitignore b/.gitignore
@@ -19,4 +19,4 @@ project/plugins/project/
 
 # Vagrant
 .vagrant
-VERSION
+VERSION
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,26 @@
+0.2.0 (2015-07-01)
+------------------
+Updated vagrant push to also build and publish webui artifact (#72)
+Added NS and CORE settings to Vagrantfile to improve performance (#79)
+Removed bin/jarx-stub.sh from project (#71)
+Added Elastic Beanstalk deployment using Docker (#70)
+Wrote tests for 0.2.0 milestone (#62)
+Now getting SchemaVer name from JSON Path (#67)
+Added configuration options for self-describing JSON Schema (#17, #19)
+Added ability to segment JSON instances based on a JSON property (#48)
+Refactored annotators to produce JNothing (#65)
+Added enum cardinality option to Web UI (#64)
+Now auto-detecting enums with configurable cardinality tolerance (#36)
+Now gracefully printing error message and exit app if invalid path given (#51)
+Now detecting field contains base64 (#58)
+Supported dragging a JSON string (#57)
+Added support of newline-delimited JSONs (#56)
+Now outputting duplicated keys in Web UI (#61)
+Now identifying and warning of misspelt properties (#31)
+Created a single-page UI in plain JS (#39)
+Added a sbt sub-project to schema-guru which embeds schema-guru in a Spray server (#53)
+Fixed incorrectly reduced integer and number (#60)
+
 0.1.0 (2015-06-03)
 ------------------
 Initial release
diff --git a/README.md b/README.md
@@ -2,39 +2,67 @@
 
 [ ![Build Status] [travis-image] ] [travis]  [ ![Release] [release-image] ] [releases] [ ![License] [license-image] ] [license]
 
-Schema Guru is a tool (currently CLI only) allowing you to derive **[JSON Schemas] [json-schema]** from a set of JSON instances.
+Schema Guru is a tool (CLI and web) allowing you to derive **[JSON Schemas] [json-schema]** from a set of JSON instances.
 
 Unlike other tools for deriving JSON Schemas, Schema Guru allows you to derive schema from an unlimited set of instances (making schemas much more precise), and supports many more JSON Schema validation properties.
 
 Schema Guru is used heavily in association with Snowplow's own **[Snowplow] [snowplow]** and **[Iglu] [iglu]** projects.
 
 ## User Quickstart
 
+### CLI
+
 Download the latest Schema Guru from Bintray:
 
 ```bash
-$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.1.0.zip
-$ unzip schema_guru_0.1.0.zip
+$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.2.0.zip
+$ unzip schema_guru_0.2.0.zip
 ```
 
 Assuming you have a recent JVM installed:
 
 ```bash
-$ ./schema-guru-0.1.0 --dir {{jsons_directory}}
+$ ./schema-guru-0.2.0 --dir {{jsons_directory}}
 ```
 
 Also you can specify output file for your schema:
 
 ```bash
-$ ./schema-guru-0.1.0 --dir {{jsons_directory}} --output {{json_schema_file}}
+$ ./schema-guru-0.2.0 --dir {{jsons_directory}} --output {{json_schema_file}}
 ```
 
 Or you can analyze a single JSON instance:
 
 ```bash
-$ ./schema-guru-0.1.0 --file {{json_instance}}
+$ ./schema-guru-0.2.0 --file {{json_instance}}
+```
+
+You can also switch Schema Guru into ndjson mode, where it will look for newline delimited JSONs.
+
+In this case all your files need to have `.ndjson` extension (as the **[specifications][ndjson-spec]** says); all `.json` files will be skipped.
+
+```bash
+$ ./schema-guru-0.2.0 --ndjson --dir {{ndjsons_directory}}
 ```
 
+You can specify the enum cardinality tolerance for for your fields. It means that *all* fields which are found to have less than the specified cardinality will be specified in the JSON Schema using the `enum` property.
+
+```bash
+$ ./schema-guru-0.2.0 --enum 5 --dir {{jsons_directory}}
+```
+
+### Web UI
+
+You can access our hosted demo of the Schema Guru web UI at [schemaguru.snplowanalytics.com] [webui-hosted]. To run it locally:
+
+```bash
+$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.2.0.zip
+$ unzip schema_guru_webui_0.2.0.zip
+$ ./schema-guru-webui-0.2.0
+```
+
+The above will run a Spray web server containing Schema Guru on [0.0.0.0:8000] [webui-local]. Interface and port can be specified by `--interface` and `--port` respectively.
+
 ## Developer Quickstart
 
 Assuming git, **[Vagrant] [vagrant-install]** and **[VirtualBox] [virtualbox-install]** installed:
@@ -44,9 +72,18 @@ Assuming git, **[Vagrant] [vagrant-install]** and **[VirtualBox] [virtualbox-ins
  host$ cd schema-guru
  host$ vagrant up && vagrant ssh
 guest$ cd /vagrant
-guest$ sbt test
+guest$ sbt assembly
+guest$ sbt "project schema-guru-webui" assembly
 ```
 
+You can also deploy the Schema Guru web GUI onto Elastic Beanstalk:
+
+```
+guest$ cd beanstalk && zip beanstalk.zip *
+```
+
+Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk-console]** and upload this zipfile.
+
 ## User Manual
 
 ### Functionality
@@ -59,14 +96,68 @@ guest$ sbt test
   - date-time (according to ISO-8601)
   - IPv4 and IPv6 addresses
   - HTTP, HTTPS, FTP URLs
+* Recognizes base64 pattern for strings
 * Detects integer ranges according to Int16, Int32, Int64
+* Detects misspelt properties and produce warnings
+* Detects enum values with specified cardinality
+* Allows to output **[Self-describing JSON Schema] [self-describing]**
+* Allows to produce JSON Schemas with different names based on given JSON Path
+* Supports **[Newline Delimited JSON] [ndjson]**
 
 ### Assumptions
 
 * All JSONs in the directory are assumed to be of the same event type and will be merged together
 * All JSONs are assumed to start with either `{ ... }` or `[ ... ]`
   - If they do not they are discarded
 * Schema should be as strict as possible - e.g. no `additionalProperties` are allowed currently
+* When using Schema Guru to derive schema from newline delimited JSONs they need to have .ndjson extension
+
+### Self-describing JSON
+Schema Guru allows you to produce **[Self-describing JSON Schema] [self-describing]**.
+To produce it you need to specify vendor, name (if segmentation isn't using, see below), and version (optional, default value is 0-1-0).
+
+```bash
+$ ./schema-guru-0.2.0 --dir {{jsons_directory}} --vendor {{your_company}} --name {{schema_name}} --schemaver {{version}}
+```
+
+### Schema Segmentation
+
+If you have set of mixed JSONs from one vendor, but with slightly different structure, like:
+
+```json
+{ "version": 1,
+  "type": "track",
+  "userId": "019mr8mf4r",
+  "event": "Purchased an Item",
+  "properties": {
+    "revenue": "39.95",
+    "shippingMethod": "2-day" },
+  "timestamp" : "2012-12-02T00:30:08.276Z" }
+```
+
+and
+
+```json
+{ "version": 1,
+  "type": "track",
+  "userId": "019mr8mf4r",
+  "event": "Posted a Comment",
+  "properties": {
+    "body": "This book is gorgeous!",
+    "attachment": false },
+  "timestamp" : "2012-12-02T00:28:02.273Z" }
+```
+
+You can run it as follows:
+```bash
+$ ./schema-guru-0.2.0 --dir {{mixed_jsons_directory}} --output-dir {{output_dir}} --schema-by $.event
+```
+
+It will put two (or may be more) JSON Schemas into output dir: Purchased_an_Item.json and Posted_a_comment.json.
+They will be derived from JSONs only with corresponding event property, without any intersections.
+Assuming that provided JSON Path contain valid string.
+All schemas where this JSON Path is absent or contains not a string value will be merged into unmatched.json schema in the same output dir.
+Also, when Self-describing JSON Schema producing, it will take schema name in the same way and --name argument can be omitted (it will replace name specified with option).
 
 ### Example
 
@@ -162,13 +253,22 @@ limitations under the License.
 [license-image]: http://img.shields.io/badge/license-Apache--2-blue.svg?style=flat
 [license]: http://www.apache.org/licenses/LICENSE-2.0
 
-[release-image]: http://img.shields.io/badge/release-0.1.0-blue.svg?style=flat
+[release-image]: http://img.shields.io/badge/release-0.2.0-blue.svg?style=flat
 [releases]: https://github.com/snowplow/schema-guru/releases
 
 [json-schema]: http://json-schema.org/
 
+[ndjson]: http://ndjson.org/
+[ndjson-spec]: http://dataprotocols.org/ndjson/
+
+[webui-local]: http://0.0.0.0:8000
+[webui-hosted]: http://schemaguru.snowplowanalytics.com
+
 [snowplow]: https://github.com/snowplow/snowplow
 [iglu]: https://github.com/snowplow/iglu
+[self-describing]: http://snowplowanalytics.com/blog/2014/05/15/introducing-self-describing-jsons/
 
 [vagrant-install]: http://docs.vagrantup.com/v2/installation/index.html
 [virtualbox-install]: https://www.virtualbox.org/wiki/Downloads
+
+[beanstalk-console]: http://console.aws.amazon.com/elasticbeanstalk
diff --git a/Vagrantfile b/Vagrantfile
@@ -4,12 +4,18 @@ Vagrant.configure("2") do |config|
   config.vm.hostname = "schema-guru"
   config.ssh.forward_agent = true
 
+  # Required for NFS to work, pick any local IP
+  # Use NFS for shared folders for better performance
+  # config.vm.network :private_network, ip: '192.168.50.50' # Uncomment to use NFS
+  # config.vm.synced_folder '.', '/vagrant', nfs: true # Uncomment to use NFS
+
   config.vm.provider :virtualbox do |vb|
     vb.name = Dir.pwd().split("/")[-1] + "-" + Time.now.to_f.to_i.to_s
     vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
     vb.customize [ "guestproperty", "set", :id, "--timesync-threshold", 10000 ]
     # Scala is memory-hungry
     vb.memory = 5120
+    # vb.cpus = 4 # Uncomment to use more cores
   end
 
   config.vm.provision :shell do |sh|

diff --git a/package/.gitignore → beanstalk/.gitignore b/package/.gitignore → beanstalk/.gitignore
diff --git a/beanstalk/Dockerfile b/beanstalk/Dockerfile
@@ -0,0 +1,10 @@
+# Dockerfile
+
+FROM java:7
+MAINTAINER Snowplow Analytics, [email protected]
+WORKDIR /
+USER daemon
+EXPOSE 8000
+
+ADD start.sh /tmp/  
+CMD ./tmp/start.sh  
diff --git a/beanstalk/Dockerrun.aws.json b/beanstalk/Dockerrun.aws.json
@@ -0,0 +1,11 @@
+{
+    "AWSEBDockerrunVersion": "1",
+        "Image": {
+            "Name": "java:7"
+        },
+        "Ports": [
+        { 
+            "ContainerPort": "8000"
+        }
+    ]
+}
diff --git a/beanstalk/start.sh b/beanstalk/start.sh
@@ -0,0 +1,7 @@
+#/bin/sh
+
+# Script must be executable!
+cd /tmp
+wget -N http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.2.0.zip
+unzip -o schema_guru_webui_0.2.0.zip
+java -jar schema-guru-webui-0.2.0
diff --git a/bin/jarx-stub.sh b/bin/jarx-stub.sh
diff --git a/project/BuildSettings.scala b/project/BuildSettings.scala
@@ -17,12 +17,10 @@ import sbt._
 import Keys._
 
 object BuildSettings {
-
-  // Basic settings for our app
-  lazy val basicSettings = Seq[Setting[_]](
+  // Common settings for all our projects
+  lazy val commonSettings = Seq[Setting[_]](
     organization          :=  "com.snowplowanalytics",
-    version               :=  "0.1.0",
-    description           :=  "For deriving JSON Schemas from collections of JSON instances",
+    version               :=  "0.2.0",
     scalaVersion          :=  "2.10.5",
     crossScalaVersions    :=  Seq("2.10.5", "2.11.6"),
     scalacOptions         :=  Seq("-deprecation", "-encoding", "utf8",
@@ -32,42 +30,52 @@ object BuildSettings {
     resolvers             ++= Dependencies.resolutionRepos
   )
 
+  // Settings specific for Schema Guru CLI
+  lazy val coreSettings = Seq[Setting[_]](
+    description           :=  "For deriving JSON Schemas from collections of JSON instances",
+
+    mainClass in (Compile, run) := Some("com.snowplowanalytics.schemaguru.Main")
+  )
+
   // Makes our SBT app settings available from within the ETL
   lazy val scalifySettings = Seq(sourceGenerators in Compile <+= (sourceManaged in Compile, version, name, organization, scalaVersion) map { (d, v, n, o, sv) =>
     val file = d / "settings.scala"
     IO.write(file, """package com.snowplowanalytics.schemaguru.generated
-      |object ProjectSettings {
-      |  val version = "%s"
-      |  val name = "%s"
-      |  val organization = "%s"
-      |  val scalaVersion = "%s"
-      |}
-      |""".stripMargin.format(v, n, o, sv))
+                     |object ProjectSettings {
+                     |  val version = "%s"
+                     |  val name = "%s"
+                     |  val organization = "%s"
+                     |  val scalaVersion = "%s"
+                     |}
+                     |""".stripMargin.format(v, n, o, sv))
     Seq(file)
   })
 
   // sbt-assembly settings for building a fat jar
   import sbtassembly.Plugin._
   import AssemblyKeys._
-  lazy val sbtAssemblySettings = assemblySettings ++ Seq(
-
+  lazy val sbtAssemblyCommonSettings = assemblySettings ++ Seq(
     // Executable jarfile
     assemblyOption in assembly ~= { _.copy(prependShellScript = Some(defaultShellScript)) },
 
     // Name it as an executable
-    jarName in assembly := { s"${name.value}-${version.value}" },
+    jarName in assembly := { s"${name.value}-${version.value}" }
+  )
 
+  lazy val sbtAssemblyCoreSettings = sbtAssemblyCommonSettings ++ Seq(
     // Drop these jars
     excludedJars in assembly <<= (fullClasspath in assembly) map { cp =>
       val excludes = Set(
         "commons-beanutils-1.8.3.jar" // Clashes with commons-collections
       )
       cp filter { jar => excludes(jar.data.getName) }
     },
-
-    // Make this executable
-    mainClass in assembly := Some("com.snowplowanalytics.schemaguru.SchemaGuruApp")
+    mainClass in assembly := Some("com.snowplowanalytics.schemaguru.Main")
   )
 
-  lazy val buildSettings = basicSettings ++ scalifySettings ++ sbtAssemblySettings
+  lazy val coreBuildSettings =
+    commonSettings ++
+    coreSettings ++
+    scalifySettings ++
+    sbtAssemblyCoreSettings
 }