- github-232: [BUG] Column descriptions should be propagated in UNIONs
- github-233: [BUG] Missing Hadoop dependencies for S3, Delta, etc
- github-235: Implement new
rest
hook with fine control - github-229: A build target should not fail if Impala "COMPUTE STATS" fails
- github-236: 'copy' target should not apply output schema
- github-237: jdbcQuery relation should use fields "sql" and "file" instead of "query"
- github-239: Allow optional SQL statement for creating jdbcTable
- github-238: Implement new 'jdbcCommand' target
- github-240: [BUG] Data quality checks in documentation should not fail on NULL values
- github-241: Throw an error on duplicate entity definitions
- github-220: Upgrade Delta-Lake to 2.0 / 2.1
- github-242: Switch to Spark 3.3 as default
- github-243: Use alternative Spark MS SQL Connector for Spark 3.3
- github-244: Generate project HTML documentation with optional external CSS file
- github-226: Upgrade to Spark 3.2.2
- github-227: [BUG] Flowman should not fail with field names containing "-", "/" etc
- github-228: Padding and truncation of CHAR(n)/VARCHAR(n) should be configurable
- github-202: Add support for Spark 3.3
- github-203: [BUG] Resource dependencies for Hive should be case-insensitive
- github-204: [BUG] Detect indirect dependencies in a chain of Hive views
- github-207: [BUG] Build should not directly fail if inferring dirty status fails
- github-209: [BUG] HiveViews should not trigger cascaded refresh during CREATE phase even when nothing is changed
- github-211: Implement new hiveQuery relation
- github-210: [BUG] HiveTables should be migrated if partition columns change
- github-208: Implement JDBC hook for database based semaphores
- github-212: [BUG] Hive views should not be migrated in RELAXED mode if only comments have changed
- github-214: Update ImpalaJDBC driver to 2.6.26.1031
- github-144: Support changing primary key for JDBC relations
- github-216: [BUG] Floats should be represented as FLOAT and not REAL in MySQL/MariaDB
- github-217: Support collations for creating/migrating JDBC tables
- github-218: [BUG] Postgres dialect should be used for Postgres JDBC URLs
- github-219: [BUG] SchemaMapping should retain incoming comments
- github-215: Support COLUMN STORE INDEX for MS SQL Server
- github-182: Support column descriptions in JDBC relations (SQL Server / Azure SQL)
- github-224: Support column descriptions for MariaDB / MySQL databases
- github-223: Support column descriptions for Postgres database
- github-205: Initial support Oracle DB via JDBC
- github-225: [BUG] Staging schema should not have comments
We take backward compatibility very seriously. But sometimes a breaking change is needed to clean up code and to
enable new features. This release contains some breaking changes, which are annoying but simple to fix.
In order to respect null
as keyword in YAML with a special semantics, some entities needed to be renamed, as
described in the following table:
category | old kind | new kind |
---|---|---|
mapping | null | empty |
relation | null | empty |
target | null | empty |
store | null | none |
history | null | none |
- github-195: [BUG] Metric "target_records" is not reset correctly after an execution phase is finished
- github-197: [BUG] Impala REFRESH METADATA should not fail when dropping views
- github-184: Only read in *.yml / *.yaml files in module loader
- github-183: Support storing SQL in external file in
hiveView
- github-185: Missing _SUCCESS file when writing to dynamic partitions
- github-186: Support output mode
OVERWRITE_DYNAMIC
for Delta relation - github-149: Support creating views in JDBC with new
jdbcView
relation - github-190: Replace logo in documentation
- github-188: Log detailed timing information when writing to JDBC relation
- github-191: Add user provided description to quality checks
- github-192: Provide example queries for JDBC metric sink
- github-175: '--jobs' parameter starts way to many parallel jobs
- github-176: start-/end-date in report should not be the same
- github-177: Implement generic SQL schema check
- github-179: Update DeltaLake dependency to 1.2.1
- github-168: Support optional filters in data quality checks
- github-169: Support sub-queries in filter conditions
- github-171: Parallelize loading of project files
- github-172: Update CDP7 profile to the latest patch level
- github-153: Use non-privileged user in Docker image
- github-174: Provide application for generating YAML schema
We take backward compatibility very seriously. But sometimes a breaking change is needed to clean up code and to enable new features. This release contains some breaking changes, which are annoying but simple to fix. In order to avoid YAML schema inconsistencies, some entities needed to be renamed, as described in the following table:
category | old kind | new kind |
---|---|---|
mapping | const | values |
mapping | empty | null |
mapping | read | relation |
mapping | readRelation | relation |
mapping | readStream | stream |
relation | const | values |
relation | empty | null |
relation | jdbc | jdbcTable, jdbcQuery |
relation | table | hiveTable |
relation | view | hiveView |
schema | embedded | inline |
- github-154: Fix failing migration when PK requires change due to data type
- github-156: Recreate indexes when data type of column changes
- github-155: Project level configs are used outside job
- github-157: Fix UPSERT operations for SQL Server
- github-158: Improve non-nullability of primary key column
- github-160: Use sensible defaults for default documenter
- github-161: Improve schema caching during execution
- github-162: ExpressionColumnCheck does not work when results contain NULL values
- github-163: Implement new column length quality check
- github-148: Support staging table for all JDBC relations
- github-120: Use staging tables for UPSERT and MERGE operations in JDBC relations
- github-147: Add support for PostgreSQL
- github-151: Implement column level lineage in documentation
- github-121: Correctly apply documentation, before/after and other common attributes to templates
- github-152: Implement new 'cast' mapping
- Add new
sqlserver
relation - Implement new documentation subsystem
- Change default build to Spark 3.2.1 and Hadoop 3.3.1
- Add new
drop
target for removing tables - Speed up project loading by reusing Jackson mapper
- Implement new
jdbc
metric sink - Implement schema cache in Executor to speed up documentation and similar tasks
- Add new config variables
flowman.execution.mapping.schemaCache
andflowman.execution.relation.schemaCache
- Add new config variable
flowman.default.target.verifyPolicy
to ignore empty tables during VERIFY phase - Implement initial support for indexes in JDBC relations
- Fix importing projects
- flowexec now returns different exit codes depending on the processing result
- Fix wrong dependencies in Swagger plugin
- Implement basic schema inference for local CSV files
- Implement new
stack
mapping - Improve error messages of local CSV parser
- Implement detection of dependencies introduced by schema
- Fix detection of Derby metastore to truncate comment lengths.
- Add new config variable
flowman.default.relation.input.columnMismatchPolicy
(default isIGNORE
) - Add new config variable
flowman.default.relation.input.typeMismatchPolicy
(default isIGNORE
) - Add new config variable
flowman.default.relation.output.columnMismatchPolicy
(default isADD_REMOVE_COLUMNS
) - Add new config variable
flowman.default.relation.output.typeMismatchPolicy
(default isCAST_ALWAYS
) - Improve handling of
_SUCCESS
files for detecting (non-)dirty directories - Implement new
merge
target - Implement merge operation for Delta relations
- Implement merge operation for JDBC relations (only for some databases, i.e. MS SQL)
- Add new config variable
flowman.execution.target.useHistory
(default isfalse
) - Change the semantics of config variable
flowman.execution.target.forceDirty
(default isfalse
) - Add new
-d
/--dirty
option for explicitly marking individual targets as dirty
- Add build profile for Hadoop 3.3
- Add build profile for Spark 3.2
- Allow SQL expressions as dimensions in
aggregate
mapping - Update Hive views when the resulting schema would change
- Add new
mapping cache
command to FlowShell - Support embedded connection definitions
- Much improved Flowman History Server
- Fix wrong metric names with TemplateTarget
- Implement more
template
types forconnection
,schema
,dataset
,assertion
andmeasure
- Implement new
measure
target for creating custom metrics for measuring data quality - Add new config option
flowman.execution.mapping.parallelism
- Improve automatic schema migration for Hive and JDBC relations
- Improve support of
CHAR(n)
andVARCHAR(n)
types. Those types will now be propagated to Hive with newer Spark versions - Support writing to dynamic partitions for file relations, Hive tables, JDBC relations and Delta tables
- Fix the name of some config variables (floman.* => flowman.*)
- Added new config variables
flowman.default.relation.migrationPolicy
andflowman.default.relation.migrationStrategy
- Add plugin for supporting DeltaLake (https://delta.io), which provides
deltaTable
anddeltaFile
relation types - Fix non-deterministic column order in
schema
mapping,values
mapping andvalues
relation - Mark Hive dependencies has 'provided', which reduces the size of dist packages
- Significantly reduce size of AWS dependencies in AWS plugin
- Add new build profile for Cloudera CDP-7.1
- Improve Spark configuration of
LocalSparkSession
andTestRunner
- Update Spark 3.0 build profile to Spark 3.0.3
- Upgrade Impala JDBC driver from 2.6.17.1020 to 2.6.23.1028
- Upgrade MySQL JDBC driver from 8.0.20 to 8.0.25
- Upgrade MariaDB JDBC driver from 2.2.4 to 2.7.3
- Upgrade several Maven plugins to latest versions
- Add new config option
flowman.workaround.analyze_partition
to workaround CDP 7.1 issues - Fix migrating Hive views to tables and vice-versa
- Add new option "-j " to allow running multiple job instances in parallel
- Add new option "-j " to allow running multiple tests in parallel
- Add new
uniqueKey
assertion - Add new
schema
assertion - Update Swagger libraries for
swagger
schema - Implement new
openapi
plugin to support OpenAPI 3.0 schemas - Add new
readHive
mapping - Add new
simpleReport
andreport
hook - Implement new templates
- Bump CDH version to 6.3.4
- Fix scope of some dependencies
- Update Spark to 3.1.2
- Add new
values
relation
- New Flowman Kernel and Flowman Studio application prototypes
- New ParallelExecutor
- Fix before/after dependencies in
count
target - Default build is now Spark 3.1 + Hadoop 3.2
- Remove build profiles for Spark 2.3 and CDH 5.15
- Add MS SQL Server plugin containing JDBC driver
- Speed up file listing for
file
relations - Use Spark JobGroups
- Better support running Flowman on Windows with appropriate batch scripts
- Add logo to Flowman Shell
- Fix name of config option
flowman.execution.executor.class
- Add new
groupedAggregate
mapping - Reimplement target ordering, configurable via
flowman.execution.scheduler.class
- Implement new assertions
columns
andexpression
- New configuration variable
floman.default.target.rebalance
- New configuration variable
floman.default.target.parallelism
- Changed behaviour: The
mergeFile
target now does not assume any more that thetarget
is local. If you already usemergeFiles
with a local file, you need to prefix the target file name withfile://
. - Add new
-t
argument for selectively building a subset of targets - Remove example-plugin
- Add quickstart guide
- Add new "flowman-parent" BOM for projects using Flowman
- Move
com.dimajix.flowman.annotations
package tocom.dimajix.flowman.spec.annotations
- Add new log redaction
- Integrate Scala scode coverage analysis
assemble
will fail when trying to use non-existing columns- Move
swagger
andjson
schema support into separate plugins - Change default build to Spark 3.0 and Hadoop 3.2
- Update Spark to 3.0.2
- Rename class
Executor
toExecution
- watch your plugins! - Implement new configurable
Executor
class for executing build targets. - Add build profile for Spark 3.1.x
- Update ScalaTest to 3.2.5 - watch your unittests for changed ScalaTest API!
- Add new
case
mapping - Add new
--dry-run
command line option - Add new
mock
andnull
mapping types - Add new
mock
relation - Add new
values
mapping - Add new
values
dataset - Implement new testing capabilities
- Rename
update
mapping toupsert
mapping, which better describes its functionality - Introduce new
VALIDATE
phase, which is executed even beforeCREATE
phase - Implement new
validate
andverify
targets - Implement new
deptree
command in Flowman shell
- Upgrade to Spark 2.4.7 and Spark 3.0.1
- Clean up dependencies
- Disable build of Docker image
- Update examples
- Fix dropping of partitions which could cause issues on CDH6
- Fix AWS plugin for Hadoop 3.x
- Improve setup of logging
- Shade Velocity for better interoperability with Spark 3
- Add new web hook facility in namespaces and jobs
- Existing targets will not be overwritten anymore by default. Either use the
--force
command line option, or set the configuration propertyflowman.execution.target.forceDirty
totrue
for the old behaviour. - Add new command line option
--keep-going
- Implement new
com.dimajix.spark.io.DeferredFileCommitProtocol
which can be used by setting the Spark configuration parameterspark.sql.sources.commitProtocolClass
- Add new
flowshell
application
- Code improvements
- Do not implicitly set SPARK_MASTER in configuration
- Add support for CDH6
- Add support for Spark 3.0
- Improve support for Hadoop 3.x
- Refactor Maven module structure
- Implement new Scala DSL for creating projects
- Fix ordering bug in target execution
- Merge
migrate
phase intocreate
phase - Rename
input
field tomapping
in most targets - Lots of minor code improvements
- Fix type coercion of DecimalTypes
- Improve support for Swagger Schema
- Fix infinite loop in recursiveSql
- Add new RecursiveSqlMapping
- Refactor
describe
method of mappings - Fix TemplateRelation to return correct partitions and fields
- Add
filter
attribute to many mappings
- Improve build dependency management with DataSets
- Update to newest Swagger V2 parser
- Workaround for bug in Swagger parser for enums
- Tidy up logging
- Remove HDFS directories when dropping Hive table partitions
- Improve migrations of HiveUnionTable
- Improve schema support in
copy
target
- Add new
earliest
mapping
- Improve Hive compatibility of SQL generator for UNION statements
- Add support for Spark 3.0 preview
- Remove HBase plugin
- Add optional
filter
toreadRelation
mapping - Improve Hive compatibility of SQL generator for ORDER BY statements
- Fix target table search in Hive Union Table
- Improve Impala catalog support
- Add
error
output toextractJson
mapping
- Add new
hiveUnionTable
relation - Add new
union
schema - Support nested columns in deduplication
- Support nested Hive VIEWs
- Support Spark 2.4.4
- Fix wrong Spark application name if configured via Spark config
- Complete overhaul of job execution. No tasks anymore
- Improve Swagger schema support
- Add configuration option for column insert position of
historize
mapping
- Add optional filter condition to
latest
mapping
- Improve generation of SQL code containing window functions
- Add new metric system
- Add Hive view generation from mappings
- Support for Hadoop 3.1 and 3.2 (without Hive)
- Add
historize
mapping
- Fix build with CDH-5.15 profile
- Implement initial REST server
- Implement initial prototype of UI
- Implement new datasets for new tasks (copy/compare/...)
- Add support for checkpoint directory
- Implement column renaming in projections
- CopyRelationTask also performs projection
explode
mapping supports simple data types
- Fix NPE in ShowRelationTask
- Add multiple relations to
showRelation
task - github-33: Add new
unit
mapping - github-34: Fix NPE in Swagger schema reader
- github-36: Add new
explode
mapping
- github-32: Improve handling of nullable structs
- github-30: Refactoring of whole specification handling
- github-30: Add new
template
mapping - Add new
flatten
entry in assembler - Implement new
flatten
mapping - github-31: Fix handling of local project definition in flowexec
- Add parameters to "job run" CLI
- Fix error handling of failing "build" and "clean" tasks
- Add support for Spark 2.4.2
- Add new assemble mapping
- Add new conform mapping
- Add null format for Spark
- Add deployment
- Update to Spark 2.3.2
- Small fixes
Small fixes
Initial release