spirosgyros.net

<The Importance of Version Control in Modern Data Engineering>

Written on

In today's data engineering landscape, the implementation of an effective version control system is essential. Earlier this year, Orchestra made its version control capabilities available to all users, enabling them to manage changes in data pipelines efficiently. This system is critical to our mission of empowering data teams to construct data pipelines swiftly and effectively.

Orchestra relies on code, allowing users to inspect changes directly within the graphical user interface (GUI). This feature enables data teams to work collaboratively on Data and AI pipelines without disrupting production environments. Additionally, the introduction of basic role-based access controls permits managers to onboard more developers, facilitating contributions to pipeline development while maintaining the integrity of production systems.

Curious about how this can benefit your team? Tired of pipeline failures? Explore Orchestra today!

The platform offers numerous advantages akin to traditional git-based version control, such as collaboration, code review, versioning, and backups/restoration, without the burden of connecting to a git repository for minor modifications. We are excited about the potential benefits for Orchestra users and are eager to integrate fully with git providers soon. For further insights on how data engineers can implement version control and CI/CD effectively, read on.

Streamlining CI/CD with a Modular Architecture

In a modular architecture, the primary data pipeline management system focuses solely on orchestration, alerting, and monitoring. This separation means that various components needing testing, such as Python code and dependencies, are managed independently.

To illustrate this concept, let's consider a straightforward example involving Airflow.

Airflow Example

Imagine a basic data pipeline consisting of three tasks: - Task A: Retrieves data from an API and stores it in a Data Lake (Table A). - Task B: Cleans the data in the Data Lake and generates a file (Table B). - Task C: Updates a dashboard that references data in Table B.

In Airflow, these tasks are orchestrated with a defined dependency: A ? B ? C.

If a column name in the source data changes, the code in all tasks must be updated, requiring a refactor from ‘col_a’ to ‘col_x’. How should CI/CD address this?

One approach is to create a staging environment in S3, mirroring the production setup. Once changes are validated in a development environment, Airflow would need to regenerate Table A, execute Task B, and then refresh Table B. If all goes well, the production instance of Airflow would execute the same steps.

While this method is robust, it is slow and costly, requiring significant resources for development, staging, and production. Maintaining an up-to-date staging environment is also resource-intensive, making this approach less than ideal.

Instead, after local changes are made, developers should target a development environment where necessary resources are zero-copy cloned. Popular solutions include Nessie for data lakes and SQL Mesh/dbt for cloud warehouse environments.

New data, like Table A in Task A, should incorporate the updated schema into the development branch. Once this is done and tests on Table A are completed, Task B should execute, followed by testing a dummy version of the dashboard pointing at Table B.

In practice, setting this up can be quite challenging. Many organizations utilize data ingestion tools like Fivetran or Airbyte, which often lack support for staging environments. Similarly, dashboarding tools related to Task C rarely accommodate staging environments without duplicating workspaces.

The Solution: Modular Data Architecture

Using Airflow solely as an orchestration tool while employing separate applications for data movement and transformation significantly simplifies CI/CD processes.

In this architecture: - An ingestion service handles Task A. - A transformation service manages Task B. - Task C can remain within Airflow for simplicity.

The CI/CD process for the ingestion service becomes straightforward. When a column name changes, a new data source with the updated schema can be created in production. The functions responsible for data retrieval can be tested and deployed immediately if CI/CD passes, utilizing an expand/contract method.

The transformation service follows a similar workflow, needing to reference the new data source. Lastly, CI/CD for Airflow requires minimal additional testing since the data ingestion and transformation tasks have already been validated. At this stage, the pull request should simply switch old task references to the new ones.

By adopting a modular architecture, CI/CD can be both robust and straightforward. Unlike traditional software engineering, CI/CD in data engineering involves cloning, testing, and removing datasets. By focusing solely on orchestration, the process becomes more manageable, leveraging established expand and contract methodologies.

Advantages of Version Control in Data

Utilizing Git offers numerous benefits, one of which is outlined in the previous section regarding CI/CD facilitation. Another key advantage is effective version control.

In this section, we will explore the advantages of version control for data engineering.

#### Preventing Unwanted Production Changes

A primary benefit of version control is the prevention of unintended changes reaching production. This is typically achieved through "Pull Request Review" (PR Review).

During the PR Review process, a reviewer assesses the proposed changes, allowing for approval or comments, which helps prevent unexamined alterations from being pushed into production. This process also fosters collaboration and trunk-based development.

#### Rollback and Risk Mitigation

Occasionally, unforeseen consequences from code changes can lead to unwanted modifications in production. In such cases, the ability to revert to previous code versions is vital. This capability mitigates risk, as managers can confidently roll back to earlier, error-free versions if necessary.

#### Accelerated Development Cycles

Without versioning, developers are unable to incrementally work on changes; any modifications must be immediately reflected in production. By implementing version control, especially distinguishing between draft and published versions, data engineers can effectively "branch" their work, allowing for flexibility without prematurely affecting production.

Distinguishing Between Version Control and Git-based Version Control

A solid version control system, like the one in Orchestra, enhances development speed, risk management through rollback capabilities, and naturally prevents unwanted changes from entering production. However, this differs from using Git specifically as a version control method.

There are pros and cons to consider when comparing Git with other version control systems.

#### Utilizing Everything-as-Code

A clear advantage and disadvantage of Git is that all processes can be managed through code. This flexibility allows for large-scale changes to be executed easily. However, this same flexibility can lead to issues if a robust developer framework (like linting, auto-completion, and CI/CD) is not in place, resulting in potential complications.

#### Continuous Integration and Delivery

Platforms like GitHub facilitate automated testing through GitHub actions when code changes are pushed. This is crucial to mitigate the variety of bugs that can arise from coding.

On the other hand, systems like Orchestra or Coalesce are designed to be user-friendly, reducing the necessity for immediate checks post-publication, although this capability will be available in the future.

#### Costs

While platforms like GitHub offer 2,000 free CI/CD minutes monthly, running CI/CD actions at scale incurs costs that are often overlooked when utilizing a Git-based version control system.

#### Context Switching and Slower Development

Frequent context switching between user interfaces like Airflow and version control systems such as GitHub can be challenging. The need for CI/CD can also result in prolonged waiting times for developers between pull requests, even for minor adjustments. Moreover, if the entire team lacks familiarity with tools like Git, this can further slow down progress due to bottlenecks in PR reviews.

Conclusion

The feedback regarding Orchestra's version control system has been overwhelmingly positive. Users are experiencing tangible benefits, including improved development speed, enhanced risk management, and effective onboarding with appropriate safeguards.

This article has explored various aspects of version control, particularly its integration with Git. While Git is not yet fully exposed and generally available, we are excited about the possibility of offering this capability to those who require it in the near future. Let's keep moving forward!

Learn More About Orchestra

Orchestra stands as a top-tier Data Pipeline Management platform. You have the software to monitor your data; what about the platform dedicated to overseeing your data pipelines?

Our documentation is available for reference, and we encourage you to explore our integrations—designed to help you launch your pipelines instantly. We also maintain a blog featuring contributions from the Orchestra team and guest writers, along with whitepapers for more comprehensive insights.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Unlock Your Sleep Potential: Energize Your Days and Elevate Happiness

Discover how improving your sleep can transform your energy and happiness, leading to increased productivity and a brighter mindset.

How to Accelerate Your Path to Senior Software Engineer

Discover essential strategies to become a senior software engineer faster and enhance your programming career.

A New Dawn for WeWork: Signs of Recovery Post-Bankruptcy

WeWork's debt-for-equity swap offers hope for recovery as it restructures amid bankruptcy challenges.

The Importance of Inspiring Educators Like Richard Feynman

Richard Feynman exemplified the ideal educator, blending deep knowledge with innovative teaching methods. This article explores his legacy and influence.

Exploring Humanity's Most Profound Scientific Questions

A deep dive into five of the biggest unanswered questions in science, pondering existence, consciousness, and the universe.

Discover 3 Powerful Books That Offer Lifelong Wisdom

Explore three transformative books that offer profound insights and wisdom for personal growth and introspection.

Navigating Falling Stats: A Philosophical Approach to Priorities

Explore strategies for addressing declining stats through resilience and investigation while maintaining creative energy.

Mastering Python Error Management: A Comprehensive Guide to Smooth Code Execution

Learn to handle errors in Python with 'try' and 'except' for more resilient and robust programs.