PREFECT could be PERFECT
At 21st century the concept of modern data engineering has picked up its name and fame.
Where we do not talk about how we can ingest data from source A to Destination B. We also need to be worried about how are we going to source XYZ and apply some business core logics on the fly and put a destination in the way it could be picked up by another system which is totally independent of all.
The real world use case could be even more worse and complicated that this.
Prefect seem to have solved this problem by keeping the logic simple underneath “Jobs as Code”.
Prefect
Prefect in a simple term is a Workflow management system.
Prefect has built to overcome from the problamatic areas in current Modern data engineering system and at the same time provide the flexibility to the user with less burden in building the flow system.
The core idea is to use python to build your jobs, but with Decorators.
One quick fact:
Prefect already has more unit tests and greater test coverage than any other workflow engine, including the entire Airflow platform
Tasks are functions
Prefect says your’e tasks you want to run are nothing more than a function.
Write your function in python and apply the decorator and you are done.
In a bit more technical term, your functions are tasks which has got special rules of its won. It could be when to run, the dependency/upstream task or parameter parsing.
But the core the idea still remains the same. Tasks are functions.
Workflows are containers
When you define the workflows its basically the group of tasks. you have defined the core logic in the tasks and using the workflows you have grouped them together.
But remember, workflows doesn't have any specific logics to run.
Communication via state
Now you have defined the task and as well as the workflows(flows). But how are you going to communicate between each of the tasks or with the the other workflows.
This could be taken care bu the STATE. State defines the behaviour of the workflow which can be passed along. It’s applicable for tasks and as well as for Workflows.
Now since you have understood the idea of Prefect let’s see the framework.
Prefect framework
At a high level prefect framework has 3 most important components.
- Workflow definition
- Workflow engine
- Workflow state
Workflow Definition
This is where you write your core business logic. Or you say what exactly you want your job to do.
With that you define special rules. lets say how do you want your job to run. You define the dependency with the other jobs. or you pass a parameter. Or you want to do if else condition based on the data.
Workflow Engine
This is the heart of the prefect. where we execute our jobs or flows or tasks.
You have defined the special rules in workflow definition and at workflow engine you would execute the same.
It’s not enough to merely “kick off” each task in sequence; a task may stop running because it succeeded, failed, skipped, paused, or even crashed! Each of these outcomes might require a different response, either from the engine itself or from subsequent tasks. The role of the engine is to introspect the workflow definition and guarantee that every task follows the rules it was assigned.
Workflow State
The workflow engine would end up in some state.
It could be success if all is well. Or it could be failure based on some conditions
By imbuing every task and workflow with a strong notion of “state”, Prefect enables a rich vocabulary to describe a workflow at any moment before, during, or after its execution.
So overall prefect is not only the option you have for workflow management system.
We have
- Airflow which is been adapted by 100s of organisations.
- Azure logic apps
Prefect vs airflow
Getting tightly coupled with Airflow
Airflow has its own way of representing the logics and how are you going to run, execute or schedule them. In a way instead of using Airflow to solve our use case we end up converting our use case to fit with Airflow architecture.
This becomes a problem when you want to build a complex business logic or to get away with airflow altogether.
But in Prefect the tasks are simple function with decorators. Which makes it easier to integrate with your use case.
The execution-date confusion
the most common query the new Data engineer would have is what is execution date.
Airflow has a strict dependency on a specific time: the execution_date
. No DAG can run without an execution date, and no DAG can run twice for the same execution date. Do you have a specific DAG that needs to run twice, with both instantiations starting at the same time? Airflow doesn’t support that; there are no exceptions.
Data exchange
The most common requirement we have in the Data engineering is to communicate and pass the data between the tasks. We do not want to persist any data and the modern data engineering requires the data to be passed on the fly without persist.
XComs use admin access to write executable pickles into the Airflow metadata database, which has security implications. Even in JSON form, it has immense data privacy issues. This data has no TTL or expiration, which creates performance and cost issues.
Prefect elevates dataflow to a first class operation. Tasks can receive inputs and return outputs, and Prefect manages this dependency in a transparent way. Additionally, Prefect almost never writes this data into its database; instead, the storage of results (only when required) is managed by secure result logic that users can easily configure.
Prefect backend
Docker container
Prefect runs on top of a docker container which makes it easier to manage and scale
Engine
Its the core which binds all the components together
Agent
Are the one who are responsible to run a flow, manage
GraphQL
Uses the GraphQL to quey using APIs.
Postgres
A backend server to store job history.
We can deploy Prefect locally and as well as in the cloud.
It does supprt AWS Fargate as well.
Well if you do not want any trouble of maintaining, you have Prefect cloud which is a managed platform for Prefect.
Creating Prefect Flow
- Write a Task
- Which is basically where you define core logic
- Add a decorator over the task which defines the rules.
- write a Flow which is grouping and managing the tasks.
It’s as simple as that.
Over and above you can add a logging in your job.
You can define the schedules as when to run.
And you may even pass the parameters to your job.
Let’s take 1 small python example and shall show you how easy it is to convert to Prefect.
Python example:
Let’s convert the same to prefect.
Prefect:
You can be more realistic and add a few dependency as well.
You may can add the logger as well:
Demo time
Let’s install the library first
pip install “prefect[viz]”
Let’s use our example
https://gist.github.com/ajithshetty/03fd2a3c948518dae97d71f5b9ea76a5
When we run
f.visualize()
Before you start the execution start the docker or docker-compose
To configure Prefect for local orchestration
prefect backend server
Once the agent is started
prefect server start
Once all components are running, you can view the UI by visiting http://localhost:8080.
Please note that executing flows from the server requires at least one Prefect Agent to be running:
prefect agent local start
Finally, to register any flow with the server, call flow.register()
. For more detail, please see the orchestration docs.
Let’s create a project
prefect create project ‘prefect_example’
and register our code with the project.
f.register(project_name="prefect_example")
Upon running the code you get the below output:
Let’s navigate to the UI:
Click on Quick Run
After successful run:
Click on Schematic to see the dependency
Click on Task to see the individual tasks:
Reference:
Above demo code can be found in my repo:
Ajith Shetty
Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.
Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓
Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data