2 minute read

这篇笔记来自于Buz Blog

1. Components of a Good Data Platform

1.1. Instrumentation and Integration

这条比较不言自明,首先必须有data,之后才能讨论它的quality

1.2. The pipeline

Pipelines通常是batch或者streaming,虽然这两种religion经常有争论,但一些相似的concept是共通的. Pipelines collect dat aand put it somewhere

Best pipelines通常有以下特点:

  • Move data reliably
  • Annotate payloads with metadata such as provenance, collected_at timestamps, fingerprints, etc.
  • Generate stats to provde the operator with feedback
  • Validate and bifurcate payloads, if you’re lucky
  • Know about and act on payload sensitivities - obfuscate, hash, tokenize, redact, redirect, etc
  • Minimize moving pieces
  • Don’t spend all the CEO’s 💸💸💸 so they can afford that house in the Bahamas

1.3. Storage and access

Storage/access systems range from a wee little Postgres database, to Snowflake, to a data lake filled with Parquet fish and the Loch Ness Trino monster

1.4. Date Discovery

As things scale, pipelines/databases/data models变得越来越繁杂,这时候discovery就非常重要

1.5. Observability, Monitoring, and Alerting

2. Design Goals of a Good Data Platform

2.1. Comply with rules

比如CPRA

2.2. Minimize bad data

有bad data本身已经很坏了,更糟的是不知道bad data存在

2.3. Maximize knowledge of what the system is doing

要有各种各样的monitoring/logs来了解系统的现状

2.4. Minimize friction for all parties involved

参与开发/使用/维护data platform的engineer可能来自各个方面,甚至是不太懂技术的data engineer,要尽量让这些人都有good exprience

Want to get buy-in? Minimize friction. Want to increase adoption? Automate others’ toil. Want sustainable systems? Reduce cognitive load

3. The Contract-Powered Platform

Schemas are the nucleus of sustainable data platforms

在一开始就enfornce schema的选项通常并不被采纳并且被认为是unnecessary overhead,但它是一项重要的长期投资,而且之后再enfornce schema会比较麻烦

  • Schemas empower the “producer” <-> “consumer” relationship
  • Schemas are data discovery: Schematizing data upfront means data discovery and documentation writes itself
  • Schemas power data validation in transit
  • Schemas help stop bad instrumentation from being implemented in the first place
  • Schemas improve code quality
  • Schemas power automation: Destination tables can be automatically created and migrated as schemas evolve
  • Schemas as observability: Calculating namespace-level statistics and splicing them into observability tools is the natural next step, 当每个payload都有对应的namespace/schema并且与相关的Datadog等metrics系统整合起来的时候,相关的metrics/table都可以变得self-serve
  • Schemas power compliance-oriented requirements
  • Schemas are the foundation of higher-order data models: It is pretty easy to turn a schema into a dbt source so analytics engineers can reliably build upon a well-defined, trustworthy foundation
  • Schemas are the foundation of data products

4. The Contract-Powered Workflow

4.1. Draft, iterate on, and deploy a schema

Non-engineer stakeholders可以自己定义最初版本的schema,而不需要来自data platform team的人参与

4.2. Bring tracking libraries and systems up to parity

每当新的schema加入或者version更新的时候,automation kicks in and (at minimum):

  • Builds and deploys new tracking SDK’s for engineering teams
  • Pushes schema metadata ∆ to data discovery tools
  • Ensures infrastructure dependencies (Kafka topics, database tables, etc)
  • Pushes the schema to the appropriate place for pipeline-level validation
  • Creates dbt sources for the analytics engineers

4.3. Implement tracking

不论是frontend, backend还是infrastructure tool,都可以发送tracking data到系统里

4.4. Deploy

With contract-powered workflows the following prereqs are taken care of before instrumentation rolls out, not after:

  • Implementers and stakeholders talk to each other using shared verbiage.
  • Versioned, language-specific data structures are generated like all other code dependencies.
  • Metadata is pushed to discovery tools.
  • The pipeline is primed to accept incoming payloads and mark them as “good” or “bad”.
  • Observability tools are ready to go for instantaneous feedback in development and production.
  • Downstream analytics/modeling entrypoints (like dbt sources) are in place and can be immediately used.

5. The Schema-Powered Future

Some other reading if you want to dive in: