We have 100s of blogs talking about Dagster and what does it do. And having a discussion between airflow vs Dagster for a long long time.
But here we are talking a step back and talking about the origin of the Dagster and shall answer below questions.
- How did it start
- Why did it come to an existence
- Why do we need it
A Quick Intro to Dagster
Dagster is developed by Elementl founded by Nick Schrock CEO of Elementl, a company aiming to reshape the data management ecosystem, and the creator of Dagster, a new programming model for data processing.
Just before we spill the beans on Dagster, let’s talk about real world problems in the Data Engineering and Data Science world?
Almost 99.999 percent of the time we hear people talking about how bad the data is and how painful it is to fix it.
Data People would like to know below to know more about it:
- From where the hell we get the data
- I have no Idea what does this mean
- I don’t trust this data nor I can test my Model with this.
- Well, its 2021 and this isn’t Sexy.
Data Engineers and Data Science Spend their 80% of the time Cleaning the data and 20% doing the actual Job.
I think we have laid down the problem statement already and now let´s see how did Dagster is going to help us.
So we have this 2 teams Data Engineering and Data Science sitting at SILO and asking one another to do their job.
Data Science builds the Model And gives it to Data Engineer and Data Engineer Productionise.
But they do not know what is happening in the other side.
Its 21st century and we need a tool which takes away all your pain of building testing productionise and I worry only about solving the problem
Dagster just does what it is supposed it. Brings them together.
Dagster takes care of the whole end to end pipeline.
We can create a collaborated pipeline across the Units like Data Engineers and Data Scientists.
It can take away the repetitive jobs of cleaning your data and testing it all again. By the UI directly we can validate our data test and develop our pipeline.
Lets look at the core principles of Dagster
- Develop and test instantly
2. Support the current tools
3. Incremental adoption of tools support
4. Productivity gain immediately
Develop and Test Instantly
Dagster has a rich UI and it is very verbose.
It displays the logs instantly and helps you code better. It provides auto completing type aheads.
We can type and see how it is reacting.
We can write and test again and again.
You can execute a piece of your pipeline right on your UI.
Support the current Tool
We can create and link the dependencies with our current tech stack. It could be Spark, S3, Snowflake etc.
Incremental adoption of tools support
One of the coolest thing I have seen is the PaperMill integration.
ENAAF: Execute your Notebook As A Function
We can create our dagster pipeline and integrate with the Jupyter Notebooks using PaperMill.
Now Dagster can run your pipeline and outputs a notebook.
Which you can take and run in your Jupyter and work on it.
Productivity gain instantly
It’s not wrong to say that we are too dependent with Airflow and we have our dags built already and it is a big change to move to Dagster.
But Dagster has a way, we can link Dagster with Airflow and can create Dags on the go.
The core of any deployment of Dagster is Dagit, a process that serves a user interface and responds to GraphQL queries.
Editor: To write, implement and test the logics
Dag view: Unify your view of pipelines
Console: To run the Dagster locally
Dagster python API
A core Api which is built on top of Python.
Supports Spark with Python and Scala
Can write SQL query
Can connect to Snowflake and other DBs
Create Airflow DAG using the API
Integrate with Jupyter Notebook
Reference, Image Courtesy and Motivated by
Numerous talk and presentations by Nick Schrock - founder and CEO of Elementl
Want to learn more?
Dagster is a data orchestrator for machine learning, analytics, and ETL Or read about: To install Dagster and Dagit…
Build pipelines of computations written in Spark, SQL, DBT, or any other framework. Locally develop pipelines…
Setup for the Tutorial | Dagster
Welcome to the Dagster tutorial! Before we get started, we need to install Dagster on our machine. We'll assume that…
GitHub - dagster-io/dagster: A data orchestrator for machine learning, analytics, and ETL.
Dagster is a data orchestrator for machine learning, analytics, and ETL Dagster lets you define pipelines in terms of…
I hope this blog was helpful and helped you to understand the the basics of Dagster and why do we need it.
BigData Engineer — Love for Bigdata, Analytics, Cloud and Infrastructure.
Subscribe to my: Weekly Newsletter Just Enough Data