Apache Airflow, A must-know orchestration tool for Data engineers.

DataGeeks
9 min readFeb 8, 2023

Apache Airflow is an orchestration tool developed by Airbnb and later given to the open-source community. Today, it is the most beloved orchestration framework by data engineers because it is dynamic (Python based), scalable, interactive, and extensible.
Note: It is an orchestration framework, not a streaming and data processing framework.

Let’s start your journey with airflow by understanding its basic concepts. here is the article describing how to install airflow with a couple of steps to quickly start your Airflow journey -> link

Architecture of Airflow

Webserver: The web server is the flux server with a unicorn, Serving the user interface. Without the web server, you can't access the user interface of Airflow.

scheduler: It is the heart of the Airflow, Responsible for scheduling all the tasks in the Airflow and submitting tasks to the executor to run. If a scheduler is down or not working you won't be able to trigger any task.

Metadata Database: All the data related to airflow are stored in the metadata database e.g data related to connections, variables, Dags, jobs, etc.
You can choose any database which is compatible with SQLAlchemy as a Metadata Database e.g Postgres, MySQL…

--

--

DataGeeks

A data couple, Having 15 years of combined experience in data and love to share the knowledge about Data