PREFECT could be PERFECT

Ajith Shetty
Geek Culture
Published in
7 min readSep 19, 2021

--

source: https://github.com/PrefectHQ/prefect

At 21st century the concept of modern data engineering has picked up its name and fame.

Where we do not talk about how we can ingest data from source A to Destination B. We also need to be worried about how are we going to source XYZ and apply some business core logics on the fly and put a destination in the way it could be picked up by another system which is totally independent of all.

The real world use case could be even more worse and complicated that this.

Prefect seem to have solved this problem by keeping the logic simple underneath “Jobs as Code”.

Prefect

Prefect in a simple term is a Workflow management system.

Prefect has built to overcome from the problamatic areas in current Modern data engineering system and at the same time provide the flexibility to the user with less burden in building the flow system.

The core idea is to use python to build your jobs, but with Decorators.

One quick fact:

Prefect already has more unit tests and greater test coverage than any other workflow engine, including the entire Airflow platform

Tasks are functions

Prefect says your’e tasks you want to run are nothing more than a function.

Write your function in python and apply the decorator and you are done.

In a bit more technical term, your functions are tasks which has got special rules of its won. It could be when to run, the dependency/upstream task or parameter parsing.

But the core the idea still remains the same. Tasks are functions.

Workflows are containers

When you define the workflows its basically the group of tasks. you have defined the core logic in the tasks and using the workflows you have grouped them together.

But remember, workflows doesn't have any specific logics to run.

Communication via state

Now you have defined the task and as well as the workflows(flows). But how are you going to communicate between each of the tasks or with the the other workflows.

This could be taken care bu the STATE. State defines the behaviour of the workflow which can be passed along. It’s applicable for tasks and as well as for Workflows.

Now since you have understood the idea of Prefect let’s see the framework.

Prefect framework

At a high level prefect framework has 3 most important components.

  • Workflow definition
  • Workflow engine
  • Workflow state

Workflow Definition

This is where you write your core business logic. Or you say what exactly you want your job to do.

With that you define special rules. lets say how do you want your job to run. You define the dependency with the other jobs. or you pass a parameter. Or you want to do if else condition based on the data.

Workflow Engine

This is the heart of the prefect. where we execute our jobs or flows or tasks.

You have defined the special rules in workflow definition and at workflow engine you would execute the same.

It’s not enough to merely “kick off” each task in sequence; a task may stop running because it succeeded, failed, skipped, paused, or even crashed! Each of these outcomes might require a different response, either from the engine itself or from subsequent tasks. The role of the engine is to introspect the workflow definition and guarantee that every task follows the rules it was assigned.

Workflow State

The workflow engine would end up in some state.

It could be success if all is well. Or it could be failure based on some conditions

By imbuing every task and workflow with a strong notion of “state”, Prefect enables a rich vocabulary to describe a workflow at any moment before, during, or after its execution.

So overall prefect is not only the option you have for workflow management system.

We have

  1. Airflow which is been adapted by 100s of organisations.
  2. Azure logic apps

Prefect vs airflow

Getting tightly coupled with Airflow

Airflow has its own way of representing the logics and how are you going to run, execute or schedule them. In a way instead of using Airflow to solve our use case we end up converting our use case to fit with Airflow architecture.

This becomes a problem when you want to build a complex business logic or to get away with airflow altogether.

But in Prefect the tasks are simple function with decorators. Which makes it easier to integrate with your use case.

The execution-date confusion

the most common query the new Data engineer would have is what is execution date.

Airflow has a strict dependency on a specific time: the execution_date. No DAG can run without an execution date, and no DAG can run twice for the same execution date. Do you have a specific DAG that needs to run twice, with both instantiations starting at the same time? Airflow doesn’t support that; there are no exceptions.

Data exchange

The most common requirement we have in the Data engineering is to communicate and pass the data between the tasks. We do not want to persist any data and the modern data engineering requires the data to be passed on the fly without persist.

XComs use admin access to write executable pickles into the Airflow metadata database, which has security implications. Even in JSON form, it has immense data privacy issues. This data has no TTL or expiration, which creates performance and cost issues.

Prefect elevates dataflow to a first class operation. Tasks can receive inputs and return outputs, and Prefect manages this dependency in a transparent way. Additionally, Prefect almost never writes this data into its database; instead, the storage of results (only when required) is managed by secure result logic that users can easily configure.

Prefect backend

Docker container

Prefect runs on top of a docker container which makes it easier to manage and scale

Engine

Its the core which binds all the components together

Agent

Are the one who are responsible to run a flow, manage

GraphQL

Uses the GraphQL to quey using APIs.

Postgres

A backend server to store job history.

We can deploy Prefect locally and as well as in the cloud.

It does supprt AWS Fargate as well.

Well if you do not want any trouble of maintaining, you have Prefect cloud which is a managed platform for Prefect.

Creating Prefect Flow

  1. Write a Task
  2. Which is basically where you define core logic
  3. Add a decorator over the task which defines the rules.
  4. write a Flow which is grouping and managing the tasks.

It’s as simple as that.

Over and above you can add a logging in your job.

You can define the schedules as when to run.

And you may even pass the parameters to your job.

Let’s take 1 small python example and shall show you how easy it is to convert to Prefect.

Python example:

Let’s convert the same to prefect.

Prefect:

You can be more realistic and add a few dependency as well.

You may can add the logger as well:

Demo time

Let’s install the library first

pip install “prefect[viz]”

Let’s use our example

https://gist.github.com/ajithshetty/03fd2a3c948518dae97d71f5b9ea76a5

When we run

f.visualize()

Before you start the execution start the docker or docker-compose

To configure Prefect for local orchestration

prefect backend server

Once the agent is started

prefect server start

Once all components are running, you can view the UI by visiting http://localhost:8080.

Please note that executing flows from the server requires at least one Prefect Agent to be running:

prefect agent local start

Finally, to register any flow with the server, call flow.register(). For more detail, please see the orchestration docs.

Let’s create a project

prefect create project ‘prefect_example’

and register our code with the project.

f.register(project_name="prefect_example")

Upon running the code you get the below output:

Let’s navigate to the UI:

Click on Quick Run

After successful run:

Click on Schematic to see the dependency

Click on Task to see the individual tasks:

Reference:

Above demo code can be found in my repo:

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data

--

--

Ajith Shetty
Geek Culture

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/