Data Pipeline Design Principles

in #news4 years ago

In 2020, the field of open-source Data Engineering is finally coming-of-age. In addition to the heavy duty proprietary software for creating data pipelines, workflow orchestration and testing, more open-source software (with an option to upgrade to Enterprise) have made their place in the market.
Solutions range from completely self-hosted and self-managed to the ones where very little engineering (fully managed cloud-based solutions) effort is required. In addition to the risk of lock-in with fully managed solutions, there’s a high cost of choosing that option too. Whatever the downside, fully managed solutions enable businesses to thrive before hiring and nurturing a fully functional data engineering team.
Data Engineering teams are doing much more than just moving data from one place to another or writing transforms for the ETL pipeline. Data Engineering is more an ☂ term that covers data modelling, database administration, data warehouse design & implementation, ETL pipelines, data integration, database testing, CI/CD for data and other DataOps things.
Data Pipelines
To transform and transport data is one of the core responsibilities of the Data Engineer. Data Pipelines are at the centre of the responsibilities. If we were to draw a Maslow’s Hierarchy of Needs pyramid, data sanity and data availability would be at the bottom. It’s essential. Data Pipelines make sure that the data is available.
Having some experience working with data pipelines and having read the existing literature on this, I have listed down the five qualities/principles that a data pipeline must have to contribute to the success of the overall data engineering effort.
https://www.disability-benefits-help.org/node/14448
https://www.disability-benefits-help.org/node/14461
https://www.disability-benefits-help.org/node/14465
https://www.disability-benefits-help.org/node/14474
https://www.disability-benefits-help.org/node/14477
https://www.disability-benefits-help.org/node/14487
https://www.disability-benefits-help.org/node/14505
https://www.disability-benefits-help.org/node/14513
https://www.disability-benefits-help.org/node/14519
https://rexahav436.medium.com/data-pipeline-design-principles-6bbf69bb1d70
https://www.guest-articles.com/news/data-pipeline-design-principles-30-11-2020

  1. Replayability
    Irrespective of whether it’s a real-time or a batch pipeline, a pipeline should be able to be replayed from any agreed-upon point-in-time to load the data again in case of bugs, unavailability of data at source or any number of issues. The feature of replayability rests on the principles of immutability, idempotency of data. This is what builds deterministicness into the data pipeline.
  2. Auditability
    For real-time pipelines, we can term this observability. The idea is to have a clear view of what is running (or what ran), what failed, how it failed so that it’s easy to find action items to fix the pipeline. In a general sense, auditability is the quality of a data pipeline that enables the data engineering team to see the history of events in a sane, readable manner.
  3. Scalability
    It’s a no brainier. Data is like entropy. It will always increase. To make sure that as the data gets bigger and bigger, the pipelines are well equipped to handle that, is essential. This would often lead data engineering teams to make choices about different types of scalable systems including fully-managed, serverless and so on.
  4. Reliability
    In addition to the data pipeline being reliable, reliability here also means that the data transformed and transported by the pipeline is also reliable — which means to say that enough thought and effort has gone into understanding engineering & business requirements, writing tests and reducing areas prone to manual error. A good metric could be the automation test coverage of the sources, targets and the data pipeline itself.
  5. Security
    In one of his testimonies to the Congress, when asked whether the Europeans are right on the data privacy issues, Mark Zuckerberg said they usually get it right the first time. Data privacy is important. GDPR has set the standard for the world to follow. Most countries in the world adhere to some level of data security. To have different levels of security for countries, states, industries, businesses and peers poses a great challenge for the engineering folks.
    To make sure that the data pipeline adheres to the security & compliance requirements is of utmost importance and in many cases it is legally binding. Security breaches and data leaks have brought companies down. It’s worth investing in the technologies that matter.
    Conclusion
    These were five of the qualities of an ideal data pipeline. This list could be broken up into many more points but it’s pointed to the right direction. If you follow these principles when designing a pipeline, it’d result in the absolute minimum number of sleepless nights fixing bugs, scaling up and data privacy issues.

Coin Marketplace

STEEM 0.19
TRX 0.13
JST 0.030
BTC 63749.66
ETH 3419.02
USDT 1.00
SBD 2.48