Musings on the Pros and Cons of Apache Airflow
By Karl Gutwin, Director of Software Engineering Services at The BioTeam
December 22, 2020 | TRENDS FROM THE TRENCHES—I’m routinely asked questions about Apache Airflow; like when to use it, how to get the most out of it and what to watch out for. This article is an opportunity for me to share the answers to the questions that I’m most commonly asked, as well as being an opportunity for me to share my opinion on why I think it’s one of the most critical tools in the toolbox of a data scientist.
What Is It?
Airflow is an open-source task scheduler that allows users to programmatically author, build, monitor, and share workflows in the cloud. The crowd-sourced software has been around since 2014 when its development was spurred by the data team at AirBnB. They donated their code to the Apache Foundation to make it widely available. To this day, AirBnB and many other corporations still contribute code and make financial donations to support its continued development. Commercial alternatives that are easier to work with do exist, but the price is prohibitive to some groups and commercial off-the-shelf software often doesn’t offer the flexibility and community support that open-source software does.
Most workflows are constructed out of a collection of individual tasks linked together by their dependencies. This concept can be represented as a directed acyclic graph (DAG). DAGs have a topological order to them so that the computation can flow in only the specified order, without loops back to earlier steps which cause additional complexity. As an example, cooking a meal is a step-by-step process, where there can be parallel steps and decisions, but where you usually cannot go backwards to an earlier step (you cannot un-bake a cake).
For examples relevant to data scientists, workflows are present in virtually any area where automation is possible. This includes tasks such as automating your DevOps processes, fetching data, machine learning pipelines, data transfer, monitoring cron jobs and so on. The best way to think about Airflow is as an automation platform that is oriented toward the data space. Many times, organizations have started off with a script or manual process to copy and transfer data but evolved to wanting to automate that data flow. Long gone are the days when a data scientist had to run a script every Friday night as part of a scheduled database update.
Airflow allows users to create workflows with high granularity and track the progress as they execute. This makes it easy to scale out the management of lots of these jobs at the same time, which is good when you think about all the data scientists in a large organization who each have responsibilities for managing individual pieces of complicated data flows. Airflow enables you to have a platform that can just run all these jobs on a schedule, adding more and more jobs as you need to.
Another advantage of Airflow and similar tools is that they make it easy to do potentially large data operations. As an example, if you wanted to run an SQL query every day, or if you wanted to run it every hour and have the results of that SQL query stored as a Parquet file in an S3 bucket, that sequence of operations can be done with out-of-the-box components within Airflow. That SQL query could potentially generate terabytes of data, but Airflow doesn't really care because it's not necessarily looking at every single row that's coming out of that query. Airflow coordinates the movement of the bits of the data stream that are most important.
Open-source software is backed by a surprising amount of terrific and free support from the community. You may get caught up with a Python dependency issue or be struggling with a cluster scale configuration issue or something else. You could spend a lot of time on Google trying to find answers, or many times you can post the issue in a community forum to find advice and fixes, especially if you’re someone who supports that community when you have the opportunity to help.
Application In The Field
I’ve had a lot of experience working with life science clients in the field. Two recent examples come top of mind. One client was building a scientific pipeline to move data from lab instruments to a scientific data storage system. Their data scientists needed to be able to write arbitrary workflows for each data source. The challenge was that each of these data sources (various lab equipment, including e-Lab notebooks) generated data in different formats, some of which were incompatible with direct loading to the downstream system. That incompatibility meant that the data weren’t accessible for analysis, and furthermore, users wouldn’t find the scientific data system useful if it didn’t have their data in it.
What they needed was a system that could handle arbitrary data transformations and work as an on-demand data pipeline. Basically, they needed a solution so that data could be automatically submitted, be processed as each individual file type was needed, and submitted to the downstream system, with minimal manual interventions. Together, we implemented a simple trigger-based system using Airflow on a single server. Whenever data entered an S3 bucket, it triggered a series of Airflow workflows moving the data along its journey. All their code was centralized in a Git repository to improve code quality and long-term management, which also will make it easier to validate the system in the future if needed. Here, Airflow helped our client to ease the burden of data curation, data tagging, and implementing policies for better hygiene of data.
Another client had a more traditional need to do ETL (Extract, Translate and Load). That’s a long-standing term that refers to the movement of data from a source to a destination. People may be surprised to learn how many data scientists are using Airflow in an AWS environment and that there are numerous existing components available within that ecosystem. Airflow integrates directly with their EMR (Elastic Map Reduce) system on AWS for running Spark jobs alongside arbitrary Python scripts that they’ve written to do basic periodic jobs. They’ve automated their workflows and can track the success of each job; they no longer need to remember to manually run them or intervene. Now, they can just monitor the workflows and be confident that their data are being regularly refreshed.
Four Clear Benefits
One, Airflow is open source. While it’s not for everyone, many data scientists prefer to support and work with their peers in the community versus buy commercial software. There are advantages: you can download it and start using it right away versus enduring a long procurement cycle and process to get a quote, submit a proposal, secure the budget, sign the licensing contract and all that. It’s liberating to be in control and make the selection whenever you want to.
Two, it’s very flexible. In some of the commercial tools, they are great until you go off the main path and try to do something a little creative that goes beyond what the tool was designed to do. Airflow was designed to work within an architecture that is standard for nearly every software development environment. Dynamic pipeline generation is another attractive aspect of its flexibility. You can run one big Airflow server or multiple small ones; the flexibility is there to support either approach.
Three, it’s highly scalable, both up and down. A typical Airflow deployment is often to simply deploy it on a single server. But you can get it to run very well inside a Docker container, even on your laptop, to support local development of pipelines. You can also scale it up to very large deployments with dozens of nodes running tasks in parallel in a highly available clustered configuration.
Four, it runs very well in a cloud environment. There are options to run it in a cloud native, scalable fashion; it will work with Kubernetes and it will work with auto-scaling cloud clusters. It's fundamentally a Python system that just gets deployed as a couple of services. So, any environment that will run one or more Linux boxes with Python and a database for state management can run this environment, which opens a lot of options for data scientists.
Airflow’s heavy reliance on Python both for the workflow language (DAGs) as well as the application itself can be both a strength and a weakness at times, as developing DAGs and maintaining the application require an effective knowledge of Python. Advanced features such as authentication, non-standard data sources, and parallelization require specialized infrastructure knowledge, and often must be set up manually as part of the design of an effective workflow. Airflow’s Linux-specific code base and inability to run multiple versions of Python in parallel has limited its adoption. Also, it doesn’t support quick, on-the-fly changes to workflows, so you have to be intentional about what you’re doing.
The core of Airflow is typically stable and reliable but there are glitches, particularly when you’re trying to scale. Also, the web user interface is arcane and has a steep learning curve. It’s just not as smooth and slick as we’ve come to expect given today’s standards in application user interface design, but the many tutorials and example code available from a variety of sources help address this issue.
Airflow executes Python code, and it does so in a manner that is somewhat idiomatic. You can't just give it an arbitrary Python script; you have to write it within Airflow's specific constructs. Each Python script that you write as a workflow needs to be managed, and it can be easy to fall into a trap of avoiding good software development practices in the process of rapidly building and running your Airflow code. For example, not having revision control in place, code consistency, rules, style checks and so on can create a potential minefield, especially for new staff coming into the environment you created. In the extreme, this can result in a mess that needs extra effort to clean up.
There’s also some risk involved with long-term community support of a free set of software. While commercial software isn’t necessarily guaranteed to have long-term support either, many organizations hesitate when considering open source for business-critical functions. However, the advantages may still outweigh the disadvantages and risk associated with incorporating open-source software in a production environment.
Getting The Most Out Of Airflow
For long-term success with your Airflow instances, there are several important “best practices” to consider. First, treat your DAGs as code being developed, and use standard practices for software development for this code. Structure your deployment to use automation and revision control to manage your DAGs; by default, Airflow gives you a directory and expects you to put your code into the directory, which invites manual local modifications. Instead, establish rules for making all changes in a central Git repository such as GitLab or GitHub, which then can be automatically synchronized with your Airflow server. This synchronization can be done through a workflow that runs on your Airflow server which pulls the code from your Git repository on a set schedule; do a git pull hourly if you want. It is also a good idea to limit access to the server to prohibit users from being able to edit the files in the directory. In this way, any conflicts can be resolved in the repository versus on the server.
Teams should also have a strategy on how to handle branching and merging. This could be something as simple as requiring everyone to always rebase the code and have a single branch. Or, it could be more complicated with feature branches, numerous tiers, code review by committee and pull requests to determine what gets merged into production. By determining this with your team up front, you can minimize clashes which can impact the pace of development.
Since it’s all Python code, it’s also a very good idea to test it. PyTest is a well-known Python testing framework that can be used to help in several ways. First, in certain circumstances, it will let you do unit testing of your custom code. Two, it can be set up in a way to do a “smoke test” of the workflow scripts. Smoke testing is mostly for catching syntax errors or other deviations from standards that you may have missed. Beyond that, you can lean into even more elaborate integration testing. For example, if you have a CI (Continuous Integration) system available, you can automatically run integration tests on changes to the code as they are pushed to the repository. In this design, the CI system launches a predefined script which will create a copy of the new code, test it, flag any errors and report back with a status and error log, reducing the risk of any errors getting deployed into the server.
Next, it’s worth not just thinking big about Airflow, but also thinking small. By this, I mean consider having multiple smaller Airflow servers deployed rather than one large server or cluster. This approach can be particularly beneficial when you need to support multiple independent user groups who may have different availability requirements or package dependencies. This can also make security controls easier, as it is generally easier to restrict access at the server level rather than within the Airflow application. To make this approach work efficiently, you will need to rely on DevOps automation and templates to deploy and manage your Airflow instances, which today is considered best practice regardless of the number of deployments you decide to make.
Finally, there are commercial vendors and open-source templates that are worth considering if Airflow appears to be daunting to deploy and manage. First, Turbine is an open-source CloudFormation template for AWS which is designed to deploy a fully auto-scaling configuration with a single click deployment. Also, Astronomer offers commercial support and a SaaS deployment of Airflow, which is particularly appealing for organizations that rely heavily on Airflow and want stability and expertise from a vendor. Last, but not least, consider working with a team such as BioTeam if you are looking into bringing Airflow into your environment and want advice and expertise.
For more from BioTeam, see the Trends from Trenches newsletter: https://bioteam.net/newsletter/