Skip to content

Latest commit

 

History

History
28 lines (15 loc) · 2.54 KB

Fundamentals of Data Engineering.md

File metadata and controls

28 lines (15 loc) · 2.54 KB

#oreilly

Metadata has a significant impact on the utility of data.

Not all data is accessed in the same way. Retrieval patterns will vary greatly based on the data being stored and queried. This brings up the notion of the "temperatures" of data. Data access frequency will determine the temperature of your data.

Virtually all data we deal with is inherently streaming. Data is nearly always produced and updated continually at its source. Batch ingestion is simply a specialized and convenient way of processing this stream in large chunks.

You've reached the last stage of the data engineering lifecycle. Now that the data has been ingested, stored, and transformed into coherent and useful structures, it's time to get value from your data. "Getting value" from data means different things to different users. #relative-correctness #meaningfulness #value

Although self-service analytics is simple in theory, it's tough to pull off in practice. The main reason is that poor data quality, organizational silos, and a lack of adequate data skills often get in the way of allowing widespread use of analytics.

Serving data: is the data of sufficient quality to perform reliable feature engineering? Quality requirements and assessments are developed in close collaboration with teams consuming the data.

Data engineering now encompasses far more than tools and technology. The field is now moving up the value chain, incorporating traditional enterprise practices such as data management and cost optimization and newer practices like DevOps.

Data governance is a foundation for data-driven business practices and a mission-critical part of the data engineering lifecycle. When data governance is practiced well, people, processes and technologies align to treat data as a key business driver; if data issues occur, they are promptly handled.

The core categories of data governance are discoverability, security, and accountability. Within these core categories are subcategories such as data quality, metadata, and privacy.

Metadata tools are only as good as their connectors to data systems and their ability to share metadata.

Data has a social element; each organization accumulates social capital and knowledge around processes, datasets, and pipelines. Human-oriented metadata systems focus on the social aspect of metadata.

Metadata is "data about data" and it underpins every section of the data engineering lifecycle. Metadata is exactly the data needed to make data discoverable and governable.

Managing data quality is tough if no one is accountable for the data in question.