Skip to content
This repository has been archived by the owner on Aug 13, 2024. It is now read-only.

Commit

Permalink
Merge branch 'release/0.4.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
alexanderdean committed Nov 17, 2015
2 parents c1b4d09 + 49b58ce commit 3de5dc2
Show file tree
Hide file tree
Showing 60 changed files with 2,619 additions and 1,246 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
*.log
.idea
.DS_Store
*.pyc

# Scala worksheets
src/test/scala/*.sc
Expand All @@ -19,4 +20,4 @@ project/plugins/project/

# Vagrant
.vagrant
VERSION
VERSION
26 changes: 20 additions & 6 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
0.3.0 (2015-07-29)
------------------
Version 0.4.0 (2015-11-17)
--------------------------
Split core, cli and subprojects (#9)
Implemented Spark job wrapping Schema Guru (#10)
JsonSchema upgraded to a Monoid (#23)
Detection and merge algorithm refactored from transforming raw JSON to use of Schema Types (#52)
Added detection of known enum sets (#66)
Fixed misapplying of base64 pattern (#76)
Fixed incorrect schema for array structures (#81)
Option --size renamed to --varchar-size (#98)
Disallowed conjunction of --with-json-paths and --split-product-types options (#99)
Schema URI comment in header of DDL replaced COMMENT ON statement (#105)
Now deriving minLength and maxLength for string values (#107)

Version 0.3.0 (2015-07-28)
--------------------------
Swapped all occurrences of "igluutils" with "schemaddl" to reflect renaming (#97)
Fixed ordering for JSONPaths file (#96)
Updated README to reflect new 0.3.0 (#93)
Expand All @@ -11,8 +25,8 @@ Unified CLI options (#90)
Added `ddl` command which generates JSON Paths files and Redshift DDL (#84)
Moved existing functionality into `derive` command (#83)

0.2.0 (2015-07-01)
------------------
Version 0.2.0 (2015-07-01)
--------------------------
Updated vagrant push to also build and publish webui artifact (#72)
Added NS and CORE settings to Vagrantfile to improve performance (#79)
Removed bin/jarx-stub.sh from project (#71)
Expand All @@ -34,6 +48,6 @@ Created a single-page UI in plain JS (#39)
Added a sbt sub-project to schema-guru which embeds schema-guru in a Spray server (#53)
Fixed incorrectly reduced integer and number (#60)

0.1.0 (2015-06-03)
------------------
Version 0.1.0 (2015-06-03)
--------------------------
Initial release
110 changes: 89 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[ ![Build Status] [travis-image] ] [travis] [ ![Release] [release-image] ] [releases] [ ![License] [license-image] ] [license]

Schema Guru is a tool (CLI and web) allowing you to derive **[JSON Schemas] [json-schema]** from a set of JSON instances process and transform it into different data definition formats.
Schema Guru is a tool (CLI, Spark job and web) allowing you to derive **[JSON Schemas] [json-schema]** from a set of JSON instances process and transform it into different data definition formats.

Current primary features include:

Expand All @@ -18,8 +18,8 @@ Schema Guru is used heavily in association with Snowplow's own **[Snowplow] [sno
Download the latest Schema Guru from Bintray:

```bash
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.3.0.zip
$ unzip schema_guru_0.3.0.zip
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_0.4.0.zip
$ unzip schema_guru_0.4.0.zip
```

Assuming you have a recent JVM installed.
Expand All @@ -33,25 +33,58 @@ You can use as input either single JSON file or directory with JSON instances (i
Following command will print JSON Schema to stdout:

```bash
$ ./schema-guru-0.3.0 schema {{input}}
$ ./schema-guru-0.4.0 schema {{input}}
```

Also you can specify output file for your schema:

```bash
$ ./schema-guru-0.3.0 schema --output {{json_schema_file}} {{input}}
$ ./schema-guru-0.4.0 schema --output {{json_schema_file}} {{input}}
```

You can also switch Schema Guru into **[NDJSON] [ndjson]** mode, where it will look for newline delimited JSONs:

```bash
$ ./schema-guru-0.3.0 schema --ndjson {{input}}
$ ./schema-guru-0.4.0 schema --ndjson {{input}}
```

You can specify the enum cardinality tolerance for your fields. It means that *all* fields which are found to have less than the specified cardinality will be specified in the JSON Schema using the `enum` property.

```bash
$ ./schema-guru-0.3.0 schema --enum 5 {{input}}
$ ./schema-guru-0.4.0 schema --enum 5 {{input}}
```

If you know that some particular set of values can appear, but don't want to set big enum cardinality, you may want to specify predefined enum set with ``--enum-sets`` multioption, like this:

```bash
$ ./schema-guru-0.4.0 schema --enum-sets iso_4217 --enum-sets iso_3166-1_aplha-3 /path/to/instances
```

Currently Schema Guru includes following built-in enum sets (written as they should appear in CLI):

* [iso_4217] [iso-4217]
* [iso_3166-1_aplha-2] [iso-3166-1-alpha-2]
* [iso_3166-1_aplha-3] [iso-3166-1-alpha-3]
* Special `all` set, which mean all built-in enums will be included

If you need to include very specific enum set, you can define it by yourself in JSON file with array like this:

```json
["Mozilla Firefox", "Google Chrome", "Netscape Navigator", "Internet Explorer"]
```

And pass path to this file instead of enum name:

```bash
$ ./schema-guru-0.4.0 schema --enum-sets all --enum-sets /path/to/browsers.json /path/to/instances
```

Schema Guru will derive `minLength` and `maxLength` properties for strings based on shortest and longest strings.
But this may be a problem if you process small amount of instances.
To avoid this too strict Schema, you can use `--no-length` option.

```bash
$ ./schema-guru-0.4.0 schema --no-length /path/to/few-instances
```

#### DDL derivation
Expand All @@ -63,25 +96,25 @@ Currently we support DDL only for **[Amazon Redshift] [redshift]**, but in futur
Following command will just save Redshift (default ``--db`` value) DDL to current dir.

```bash
$ ./schema-guru-0.3.0 ddl {{input}}
$ ./schema-guru-0.4.0 ddl {{input}}
```

You also can specify directory for output:

```bash
$ ./schema-guru-0.3.0 ddl --output {{ddl_dir}} {{input}}
$ ./schema-guru-0.4.0 ddl --output {{ddl_dir}} {{input}}
```

If you're not a Snowplow Platform user, don't use **[Self-describing Schema] [self-describing]** or just don't want anything specific to it you can produce raw schema:

```bash
$ ./schema-guru-0.3.0 ddl --raw {{input}}
$ ./schema-guru-0.4.0 ddl --raw {{input}}
```

You may also want to get JSONPaths file for Redshift's **[COPY] [redshift-copy]** command. It will place ``jsonpaths`` dir alongside with ``sql``:

```bash
$ ./schema-guru-0.3.0 ddl --with-json-paths {{input}}
$ ./schema-guru-0.4.0 ddl --with-json-paths {{input}}
```

The most embarrassing part of shifting from dynamic-typed world to static-typed is product types (or union types) like this in JSON Schema: ``["integer", "string"]``.
Expand All @@ -90,31 +123,53 @@ Thus we provide you two options. By default product types will be transformed as
But there's another way - you can split column with product types into separate ones with it's types as postfix, for example property ``model`` with type ``["string", "integer"]`` will be transformed into two columns ``mode_string`` and ``model_integer``.
This behaviour can be achieved with ``--split-product-types``.

Another thing everyone need to consider is default VARCHAR size. If there's no clues about it (like ``maxLength``) 255 will be used.
Another thing everyone need to consider is default VARCHAR size. If there's no clues about it (like ``maxLength``) 4096 will be used.
You can also specify this default value:

```bash
$ ./schema-guru-0.3.0 ddl --size 32 {{input}}
$ ./schema-guru-0.4.0 ddl --varchar-size 32 {{input}}
```

You can also specify Redshift Schema for your table. For non-raw mode ``atomic`` used as default.

```bash
$ ./schema-guru-0.3.0 ddl --raw --schema business {{input}}
$ ./schema-guru-0.4.0 ddl --raw --schema business {{input}}
```

### Web UI

You can access our hosted demo of the Schema Guru web UI at [schemaguru.snplowanalytics.com] [webui-hosted]. To run it locally:

```bash
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.3.0.zip
$ unzip schema_guru_webui_0.3.0.zip
$ ./schema-guru-webui-0.3.0
$ wget http://dl.bintray.com/snowplow/snowplow-generic/schema_guru_webui_0.4.0.zip
$ unzip schema_guru_webui_0.4.0.zip
$ ./schema-guru-webui-0.4.0
```

The above will run a Spray web server containing Schema Guru on [0.0.0.0:8000] [webui-local]. Interface and port can be specified by `--interface` and `--port` respectively.

### Apache Spark

Since version 0.4.0 Schema Guru shipping with Spark job for deriving JSON Schemas.
To help users getting started with Schema Guru on Amazon Elastic MapReduce we provide [pyinvoke] [pyinvoke] ``tasks.py``.

Recommended way to start is install all requirements and assembly fatjar as described in [Developer Quickstart](#developer-quickstart).

Before run you need:

* An AWS CLI profile, e.g. *my-profile*
* A EC2 keypair, e.g. *my-ec2-keypair*
* At least one Amazon S3 bucket, e.g. *my-bucket*

To provision the cluster and start the job you need to use `run_emr` task:

```bash
$ cd sparkjob
$ inv run_emr my-profile my-bucket/input/ my-bucket/output/ my-bucket/errors/ my-bucket/logs my-ec2-keypair
```

If you need some specific options for Spark job, you can specify these in `tasks.py`. The Spark job accepts the same options as the CLI application, but note that `--output` isn't optional and we have a new optional `--errors-path`.

## Developer Quickstart

Assuming git, **[Vagrant] [vagrant-install]** and **[VirtualBox] [virtualbox-install]** installed:
Expand All @@ -125,7 +180,13 @@ Assuming git, **[Vagrant] [vagrant-install]** and **[VirtualBox] [virtualbox-ins
host$ vagrant up && vagrant ssh
guest$ cd /vagrant
guest$ sbt assembly
```

Also, optional:

```bash
guest$ sbt "project schema-guru-webui" assembly
guest$ sbt "project schema-guru-sparkjob" assembly
```

You can also deploy the Schema Guru web GUI onto Elastic Beanstalk:
Expand All @@ -150,10 +211,12 @@ Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk
- date-time (according to ISO-8601)
- IPv4 and IPv6 addresses
- HTTP, HTTPS, FTP URLs
* Recoginzed minLength and maxLength properties for strings
* Recognizes base64 pattern for strings
* Detects integer ranges according to Int16, Int32, Int64
* Detects misspelt properties and produce warnings
* Detects enum values with specified cardinality
* Detects known enum sets built-in or specified by user
* Allows to output **[Self-describing JSON Schema] [self-describing]**
* Allows to produce JSON Schemas with different names based on given JSON Path
* Supports **[Newline Delimited JSON] [ndjson]**
Expand All @@ -172,7 +235,7 @@ Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk
* Number with ``multiplyOf`` 0.01 becomes ``DECIMAL``
* Handles Self-describing JSON and can produce raw DDL
* Recognizes integer size by ``minimum`` and ``maximum`` values

* Object without ``properties``, but with ``patternProperties`` becomes ``VARCHAR(4096)``

### Assumptions

Expand All @@ -186,7 +249,7 @@ Now just create a new Docker app in the **[Elastic Beanstalk Console] [beanstalk
To produce it you need to specify vendor, name (if segmentation isn't using, see below), and version (optional, default value is 1-0-0).

```bash
$ ./schema-guru-0.3.0 schema --vendor {{your_company}} --name {{schema_name}} --schemaver {{version}} {{input}}
$ ./schema-guru-0.4.0 schema --vendor {{your_company}} --name {{schema_name}} --schemaver {{version}} {{input}}
```

### Schema Segmentation
Expand Down Expand Up @@ -219,7 +282,7 @@ and

You can run it as follows:
```bash
$ ./schema-guru-0.3.0 schema --output {{output_dir}} --schema-by $.event {{mixed_jsons_directory}}
$ ./schema-guru-0.4.0 schema --output {{output_dir}} --schema-by $.event {{mixed_jsons_directory}}
```

It will put two (or may be more) JSON Schemas into output dir: Purchased_an_Item.json and Posted_a_comment.json.
Expand Down Expand Up @@ -322,7 +385,7 @@ limitations under the License.
[license-image]: http://img.shields.io/badge/license-Apache--2-blue.svg?style=flat
[license]: http://www.apache.org/licenses/LICENSE-2.0

[release-image]: http://img.shields.io/badge/release-0.3.0-blue.svg?style=flat
[release-image]: http://img.shields.io/badge/release-0.4.0-blue.svg?style=flat
[releases]: https://github.com/snowplow/schema-guru/releases

[json-schema]: http://json-schema.org/
Expand All @@ -343,5 +406,10 @@ limitations under the License.

[vagrant-install]: http://docs.vagrantup.com/v2/installation/index.html
[virtualbox-install]: https://www.virtualbox.org/wiki/Downloads
[pyinvoke]: http://www.pyinvoke.org/

[beanstalk-console]: http://console.aws.amazon.com/elasticbeanstalk

[iso-4217]: https://en.wikipedia.org/wiki/ISO_4217
[iso-3166-1-alpha-2]: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2
[iso-3166-1-alpha-3]: https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3
2 changes: 1 addition & 1 deletion project/BuildSettings.scala
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ object BuildSettings {
// Common settings for all our projects
lazy val commonSettings = Seq[Setting[_]](
organization := "com.snowplowanalytics",
version := "0.3.0",
version := "0.4.0",
scalaVersion := "2.10.5",
crossScalaVersions := Seq("2.10.5", "2.11.6"),
scalacOptions := Seq("-deprecation", "-encoding", "utf8",
Expand Down
9 changes: 5 additions & 4 deletions project/Dependencies.scala
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,16 @@ object Dependencies {
// Scala
val argot = "1.0.3"
val scalaz7 = "7.0.6"
val algebird = "0.8.1"
val json4s = "3.2.11"
val json4s = "3.2.10" // don't upgrade to 3.2.11 https://github.com/json4s/json4s/issues/212
val jsonpath = "0.6.4"
val schemaddl = "0.2.0"
val akka = "2.3.9"
val spray = "1.3.3"
val spark = "1.3.1"
// Scala (test only)
val specs2 = "2.3.13"
val scalazSpecs2 = "0.2"
val scalaCheck = "1.12.2"
val schemaddl = "0.1.0"
}

object Libraries {
Expand All @@ -52,7 +52,6 @@ object Dependencies {
// Scala
val argot = "org.clapper" %% "argot" % V.argot
val scalaz7 = "org.scalaz" %% "scalaz-core" % V.scalaz7
val algebird = "com.twitter" %% "algebird-core" % V.algebird
val json4sJackson = "org.json4s" %% "json4s-jackson" % V.json4s
val json4sScalaz = "org.json4s" %% "json4s-scalaz" % V.json4s
val jsonpath = "io.gatling" %% "jsonpath" % V.jsonpath
Expand All @@ -61,6 +60,8 @@ object Dependencies {
val akka = "com.typesafe.akka" %% "akka-actor" % V.akka
val sprayCan = "io.spray" %% "spray-can" % V.spray
val sprayRouting = "io.spray" %% "spray-routing" % V.spray
// Spark
val sparkCore = "org.apache.spark" %% "spark-core" % V.spark % "provided"
// Scala (test only)
val specs2 = "org.specs2" %% "specs2" % V.specs2 % "test"
val scalazSpecs2 = "org.typelevel" %% "scalaz-specs2" % V.scalazSpecs2 % "test"
Expand Down
9 changes: 8 additions & 1 deletion project/SchemaGuruBuild.scala
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ object SchemaGuruBuild extends Build {
import Dependencies._
import BuildSettings._
import WebuiBuildSettings._
import SparkjobBuildSettings._

// Configure prompt to show current project.
override lazy val settings = super.settings :+ {
Expand All @@ -41,7 +42,6 @@ object SchemaGuruBuild extends Build {
// Scala
Libraries.argot,
Libraries.scalaz7,
Libraries.algebird,
Libraries.json4sJackson,
Libraries.json4sScalaz,
Libraries.jsonpath,
Expand All @@ -68,4 +68,11 @@ object SchemaGuruBuild extends Build {
)
)
.dependsOn(project)

lazy val sparkjob = Project("schema-guru-sparkjob", file("sparkjob"))
.settings(sparkjobBuildSettings: _*)
.settings(
libraryDependencies ++= Seq(Libraries.sparkCore)
)
.dependsOn(project)
}
Loading

0 comments on commit 3de5dc2

Please sign in to comment.