Merge branch 'release/0.4.0'

snowplow-archive · Nov 17, 2015 · 3de5dc2 · 3de5dc2
2 parents c1b4d09 + 49b58ce
commit 3de5dc2
Show file tree

Hide file tree

Showing 60 changed files with 2,619 additions and 1,246 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,7 @@
 *.log
 .idea
 .DS_Store
+*.pyc
 
 # Scala worksheets
 src/test/scala/*.sc
@@ -19,4 +20,4 @@ project/plugins/project/
 
 # Vagrant
 .vagrant
-VERSION
+VERSION
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,5 +1,19 @@
-0.3.0 (2015-07-29)
-------------------
+Version 0.4.0 (2015-11-17)
+--------------------------
+Split core, cli and subprojects (#9)
+Implemented Spark job wrapping Schema Guru (#10)
+JsonSchema upgraded to a Monoid (#23)
+Detection and merge algorithm refactored from transforming raw JSON to use of Schema Types (#52)
+Added detection of known enum sets (#66)
+Fixed misapplying of base64 pattern (#76)
+Fixed incorrect schema for array structures (#81)
+Option --size renamed to --varchar-size (#98)
+Disallowed conjunction of --with-json-paths and --split-product-types options (#99)
+Schema URI comment in header of DDL replaced COMMENT ON statement (#105)
+Now deriving minLength and maxLength for string values (#107)
+
+Version 0.3.0 (2015-07-28)
+--------------------------
 Swapped all occurrences of "igluutils" with "schemaddl" to reflect renaming (#97)
 Fixed ordering for JSONPaths file (#96)
 Updated README to reflect new 0.3.0 (#93)
@@ -11,8 +25,8 @@ Unified CLI options (#90)
 Added `ddl` command which generates JSON Paths files and Redshift DDL (#84)
 Moved existing functionality into `derive` command (#83)
 
-0.2.0 (2015-07-01)
-------------------
+Version 0.2.0 (2015-07-01)
+--------------------------
 Updated vagrant push to also build and publish webui artifact (#72)
 Added NS and CORE settings to Vagrantfile to improve performance (#79)
 Removed bin/jarx-stub.sh from project (#71)
@@ -34,6 +48,6 @@ Created a single-page UI in plain JS (#39)
 Added a sbt sub-project to schema-guru which embeds schema-guru in a Spray server (#53)
 Fixed incorrectly reduced integer and number (#60)
 
-0.1.0 (2015-06-03)
-------------------
+Version 0.1.0 (2015-06-03)
+--------------------------
 Initial release
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 [ ![Build Status] [travis-image] ] [travis]  [ ![Release] [release-image] ] [releases] [ ![License] [license-image] ] [license]
 
-Schema Guru is a tool (CLI and web) allowing you to derive **[JSON Schemas] [json-schema]** from a set of JSON instances process and transform it into different data definition formats.
+Schema Guru is a tool (CLI, Spark job and web) allowing you to derive **[JSON Schemas] [json-schema]** from a set of JSON instances process and transform it into different data definition formats.
 
 Current primary features include:
 
@@ -18,8 +18,8 @@ Schema Guru is used heavily in association with Snowplow's own **[Snowplow] [sno
 Download the latest Schema Guru from Bintray:
 
 ```bash
-$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.3.0.zip
-$ unzip schema_guru_0.3.0.zip
+$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.4.0.zip
+$ unzip schema_guru_0.4.0.zip
 ```
 
 Assuming you have a recent JVM installed.
@@ -33,25 +33,58 @@ You can use as input either single JSON file or directory with JSON instances (i
 Following command will print JSON Schema to stdout:
 
 ```bash
-$ ./schema-guru-0.3.0 schema {{input}}
+$ ./schema-guru-0.4.0 schema {{input}}
 ```
 
 Also you can specify output file for your schema:
 
 ```bash
-$ ./schema-guru-0.3.0 schema --output {{json_schema_file}} {{input}} 
+$ ./schema-guru-0.4.0 schema --output {{json_schema_file}} {{input}} 
 ```
 
 You can also switch Schema Guru into **[NDJSON] [ndjson]** mode, where it will look for newline delimited JSONs:
 
 ```bash
-$ ./schema-guru-0.3.0 schema --ndjson {{input}}
+$ ./schema-guru-0.4.0 schema --ndjson {{input}}
 ```
 
 You can specify the enum cardinality tolerance for your fields. It means that *all* fields which are found to have less than the specified cardinality will be specified in the JSON Schema using the `enum` property.
 
 ```bash
-$ ./schema-guru-0.3.0 schema --enum 5 {{input}}
+$ ./schema-guru-0.4.0 schema --enum 5 {{input}}
+```
+
+If you know that some particular set of values can appear, but don't want to set big enum cardinality, you may want to specify predefined enum set with ``--enum-sets`` multioption, like this:
+
+```bash
+$ ./schema-guru-0.4.0 schema --enum-sets iso_4217 --enum-sets iso_3166-1_aplha-3 /path/to/instances
+```
+
+Currently Schema Guru includes following built-in enum sets (written as they should appear in CLI):
+
+* [iso_4217] [iso-4217]
+* [iso_3166-1_aplha-2] [iso-3166-1-alpha-2]
+* [iso_3166-1_aplha-3] [iso-3166-1-alpha-3]
+* Special `all` set, which mean all built-in enums will be included
+
+If you need to include very specific enum set, you can define it by yourself in JSON file with array like this:
+
+```json
+["Mozilla Firefox", "Google Chrome", "Netscape Navigator", "Internet Explorer"]
+```
+
+And pass path to this file instead of enum name:
+
+```bash
+$ ./schema-guru-0.4.0 schema --enum-sets all --enum-sets /path/to/browsers.json /path/to/instances
+```
+
+Schema Guru will derive `minLength` and `maxLength` properties for strings based on shortest and longest strings.
+But this may be a problem if you process small amount of instances. 
+To avoid this too strict Schema, you can use `--no-length` option.
+
+```bash
+$ ./schema-guru-0.4.0 schema --no-length /path/to/few-instances
 ```
 
 #### DDL derivation
@@ -63,25 +96,25 @@ Currently we support DDL only for **[Amazon Redshift] [redshift]**, but in futur
 Following command will just save Redshift (default ``--db`` value) DDL to current dir.
 
 ```bash
-$ ./schema-guru-0.3.0 ddl {{input}}
+$ ./schema-guru-0.4.0 ddl {{input}}
 ```
 
 You also can specify directory for output:
 
 ```bash
-$ ./schema-guru-0.3.0 ddl --output {{ddl_dir}} {{input}}
+$ ./schema-guru-0.4.0 ddl --output {{ddl_dir}} {{input}}
 ```
 
 If you're not a Snowplow Platform user, don't use **[Self-describing Schema] [self-describing]** or just don't want anything specific to it you can produce raw schema:
 
 ```bash
-$ ./schema-guru-0.3.0 ddl --raw {{input}}
+$ ./schema-guru-0.4.0 ddl --raw {{input}}
 ```
 
 You may also want to get JSONPaths file for Redshift's **[COPY] [redshift-copy]** command. It will place ``jsonpaths`` dir alongside with ``sql``:
 
 ```bash
-$ ./schema-guru-0.3.0 ddl --with-json-paths {{input}}
+$ ./schema-guru-0.4.0 ddl --with-json-paths {{input}}
 ```
 
 The most embarrassing part of shifting from dynamic-typed world to static-typed is product types (or union types) like this in JSON Schema: ``["integer", "string"]``.
@@ -90,31 +123,53 @@ Thus we provide you two options. By default product types will be transformed as
 But there's another way - you can split column with product types into separate ones with it's types as postfix, for example property ``model`` with type ``["string", "integer"]`` will be transformed into two columns ``mode_string`` and ``model_integer``.
 This behaviour can be achieved with ``--split-product-types``.
 
-Another thing everyone need to consider is default VARCHAR size. If there's no clues about it (like ``maxLength``) 255 will be used.
+Another thing everyone need to consider is default VARCHAR size. If there's no clues about it (like ``maxLength``) 4096 will be used.
 You can also specify this default value:
 
 ```bash
-$ ./schema-guru-0.3.0 ddl --size 32 {{input}}
+$ ./schema-guru-0.4.0 ddl --varchar-size 32 {{input}}
 ```
 
 You can also specify Redshift Schema for your table. For non-raw mode ``atomic`` used as default.
 
 ```bash
-$ ./schema-guru-0.3.0 ddl --raw --schema business {{input}}
+$ ./schema-guru-0.4.0 ddl --raw --schema business {{input}}
 ```
 
 ### Web UI
 
 You can access our hosted demo of the Schema Guru web UI at [schemaguru.snplowanalytics.com] [webui-hosted]. To run it locally:
 
 ```bash
-$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.3.0.zip
-$ unzip schema_guru_webui_0.3.0.zip
-$ ./schema-guru-webui-0.3.0
+$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.4.0.zip
+$ unzip schema_guru_webui_0.4.0.zip
+$ ./schema-guru-webui-0.4.0
 ```
 
 The above will run a Spray web server containing Schema Guru on [0.0.0.0:8000] [webui-local]. Interface and port can be specified by `--interface` and `--port` respectively.
 
+### Apache Spark
+
+Since version 0.4.0 Schema Guru shipping with Spark job for deriving JSON Schemas.
+To help users getting started with Schema Guru on Amazon Elastic MapReduce we provide [pyinvoke] [pyinvoke] ``tasks.py``.
+
+Recommended way to start is install all requirements and assembly fatjar as described in [Developer Quickstart](#developer-quickstart).
+
+Before run you need:
+
+* An AWS CLI profile, e.g. *my-profile*
+* A EC2 keypair, e.g. *my-ec2-keypair*
+* At least one Amazon S3 bucket, e.g. *my-bucket*
+
+To provision the cluster and start the job you need to use `run_emr` task:
+
+```bash
+$ cd sparkjob
+$ inv run_emr my-profile my-bucket/input/ my-bucket/output/ my-bucket/errors/ my-bucket/logs my-ec2-keypair
+```
+
+If you need some specific options for Spark job, you can specify these in `tasks.py`. The Spark job accepts the same options as the CLI application, but note that `--output` isn't optional and we have a new optional `--errors-path`.
+
 ## Developer Quickstart
 
 Assuming git, **[Vagrant] [vagrant-install]** and **[VirtualBox] [virtualbox-install]** installed:
@@ -125,7 +180,13 @@ Assuming git, **[Vagrant] [vagrant-install]** and **[VirtualBox] [virtualbox-ins
  host$ vagrant up && vagrant ssh
 guest$ cd /vagrant
 guest$ sbt assembly
+```
+
+Also, optional:
+
+```bash
 guest$ sbt "project schema-guru-webui" assembly
+guest$ sbt "project schema-guru-sparkjob" assembly
 ```
 
 You can also deploy the Schema Guru web GUI onto Elastic Beanstalk:
@@ -150,10 +211,12 @@ Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk
   - date-time (according to ISO-8601)
   - IPv4 and IPv6 addresses
   - HTTP, HTTPS, FTP URLs
+* Recoginzed minLength and maxLength properties for strings
 * Recognizes base64 pattern for strings
 * Detects integer ranges according to Int16, Int32, Int64
 * Detects misspelt properties and produce warnings
 * Detects enum values with specified cardinality
+* Detects known enum sets built-in or specified by user
 * Allows to output **[Self-describing JSON Schema] [self-describing]**
 * Allows to produce JSON Schemas with different names based on given JSON Path
 * Supports **[Newline Delimited JSON] [ndjson]**
@@ -172,7 +235,7 @@ Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk
 * Number with ``multiplyOf`` 0.01 becomes ``DECIMAL``
 * Handles Self-describing JSON and can produce raw DDL
 * Recognizes integer size by ``minimum`` and ``maximum`` values
-
+* Object without ``properties``, but with ``patternProperties`` becomes ``VARCHAR(4096)``
 
 ### Assumptions
 
@@ -186,7 +249,7 @@ Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk
 To produce it you need to specify vendor, name (if segmentation isn't using, see below), and version (optional, default value is 1-0-0).
 
 ```bash
-$ ./schema-guru-0.3.0 schema --vendor {{your_company}} --name {{schema_name}} --schemaver {{version}} {{input}}
+$ ./schema-guru-0.4.0 schema --vendor {{your_company}} --name {{schema_name}} --schemaver {{version}} {{input}}
 ```
 
 ### Schema Segmentation
@@ -219,7 +282,7 @@ and
 
 You can run it as follows:
 ```bash
-$ ./schema-guru-0.3.0 schema --output {{output_dir}} --schema-by $.event {{mixed_jsons_directory}}
+$ ./schema-guru-0.4.0 schema --output {{output_dir}} --schema-by $.event {{mixed_jsons_directory}}
 ```
 
 It will put two (or may be more) JSON Schemas into output dir: Purchased_an_Item.json and Posted_a_comment.json.
@@ -322,7 +385,7 @@ limitations under the License.
 [license-image]: http://img.shields.io/badge/license-Apache--2-blue.svg?style=flat
 [license]: http://www.apache.org/licenses/LICENSE-2.0
 
-[release-image]: http://img.shields.io/badge/release-0.3.0-blue.svg?style=flat
+[release-image]: http://img.shields.io/badge/release-0.4.0-blue.svg?style=flat
 [releases]: https://github.com/snowplow/schema-guru/releases
 
 [json-schema]: http://json-schema.org/
@@ -343,5 +406,10 @@ limitations under the License.
 
 [vagrant-install]: http://docs.vagrantup.com/v2/installation/index.html
 [virtualbox-install]: https://www.virtualbox.org/wiki/Downloads
+[pyinvoke]: http://www.pyinvoke.org/
 
 [beanstalk-console]: http://console.aws.amazon.com/elasticbeanstalk
+
+[iso-4217]: https://en.wikipedia.org/wiki/ISO_4217
+[iso-3166-1-alpha-2]: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
+[iso-3166-1-alpha-3]: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3
diff --git a/project/BuildSettings.scala b/project/BuildSettings.scala
@@ -20,7 +20,7 @@ object BuildSettings {
   // Common settings for all our projects
   lazy val commonSettings = Seq[Setting[_]](
     organization          :=  "com.snowplowanalytics",
-    version               :=  "0.3.0",
+    version               :=  "0.4.0",
     scalaVersion          :=  "2.10.5",
     crossScalaVersions    :=  Seq("2.10.5", "2.11.6"),
     scalacOptions         :=  Seq("-deprecation", "-encoding", "utf8",

diff --git a/project/Dependencies.scala b/project/Dependencies.scala
@@ -30,16 +30,16 @@ object Dependencies {
     // Scala
     val argot            = "1.0.3"
     val scalaz7          = "7.0.6"
-    val algebird         = "0.8.1"
-    val json4s           = "3.2.11"
+    val json4s           = "3.2.10"   // don't upgrade to 3.2.11 https://github.com/json4s/json4s/issues/212
     val jsonpath         = "0.6.4"
+    val schemaddl     	 = "0.2.0"
     val akka             = "2.3.9"
     val spray            = "1.3.3"
+    val spark            = "1.3.1"
     // Scala (test only)
     val specs2           = "2.3.13"
     val scalazSpecs2     = "0.2"
     val scalaCheck       = "1.12.2"
-    val schemaddl     	 = "0.1.0"
   }
 
   object Libraries {
@@ -52,7 +52,6 @@ object Dependencies {
     // Scala
     val argot            = "org.clapper"                %% "argot"                     % V.argot
     val scalaz7          = "org.scalaz"                 %% "scalaz-core"               % V.scalaz7
-    val algebird         = "com.twitter"                %% "algebird-core"             % V.algebird
     val json4sJackson    = "org.json4s"                 %% "json4s-jackson"            % V.json4s
     val json4sScalaz     = "org.json4s"                 %% "json4s-scalaz"             % V.json4s
     val jsonpath         = "io.gatling"                 %% "jsonpath"                  % V.jsonpath
@@ -61,6 +60,8 @@ object Dependencies {
     val akka             = "com.typesafe.akka"          %% "akka-actor"                % V.akka
     val sprayCan         = "io.spray"                   %% "spray-can"                 % V.spray
     val sprayRouting     = "io.spray"                   %% "spray-routing"             % V.spray
+    // Spark
+    val sparkCore        = "org.apache.spark"           %% "spark-core"                % V.spark          % "provided"
     // Scala (test only)
     val specs2           = "org.specs2"                 %% "specs2"                    % V.specs2         % "test"
     val scalazSpecs2     = "org.typelevel"              %% "scalaz-specs2"             % V.scalazSpecs2   % "test"

diff --git a/project/SchemaGuruBuild.scala b/project/SchemaGuruBuild.scala
@@ -20,6 +20,7 @@ object SchemaGuruBuild extends Build {
   import Dependencies._
   import BuildSettings._
   import WebuiBuildSettings._
+  import SparkjobBuildSettings._
 
   // Configure prompt to show current project.
   override lazy val settings = super.settings :+ {
@@ -41,7 +42,6 @@ object SchemaGuruBuild extends Build {
         // Scala
         Libraries.argot,
         Libraries.scalaz7,
-        Libraries.algebird,
         Libraries.json4sJackson,
         Libraries.json4sScalaz,
         Libraries.jsonpath,
@@ -68,4 +68,11 @@ object SchemaGuruBuild extends Build {
       )
     )
     .dependsOn(project)
+
+  lazy val sparkjob = Project("schema-guru-sparkjob", file("sparkjob"))
+    .settings(sparkjobBuildSettings: _*)
+    .settings(
+      libraryDependencies ++= Seq(Libraries.sparkCore)
+    )
+    .dependsOn(project)
 }