Let’s do it DBT way!
Shift from ETL to ELT
Ever since the beginning of Bigdata bubble, we have all been only talking about how we are going to Extract the Data Transform the Data and Load the Data.
But the problem here is the data and compute are tightly coupled. And we pay for when we do not need as well.
Data warehousing is something where we spend most of our Dollar to store and maintain.
But even after spending a huge chunk of money are we able to get the same amount output.
With the help of Fivetran or Stich we could answer the Ingestion problems which takes away all the problem.
For Data warehousing we could take Redshift or Bigquery.
Now comes the main problem. How are we going to Democratise the data?
How can we publish the data for analysts?
How we can we automate the process with boiler plate code instead of Analysts writing the same code over and over again
How can we build the lineage or dependency between the tables and queries.
Now you may say, use Python and Airflow for writing the dependency of your queries.
The pain out for the Analysts here is they would need to wait for the Data engineers to write the code, build the dependency, schedule the jobs and monitor before Analysts could know about the existence of the data.
Enters DBT(Data Build Tool) which is going to answer this perticular problem in the 21st century.
The only requirement for you to start using DBT is to know SQL.
Well DBT gives more power to Analysts so that they can work just similar to Software Engineers like using version control, testing and documenttions.
What is dbt?
dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.
dbt does the T
in ELT
(Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.
The role of dbt within a modern data stack is discussed in more detail here.
dbt also enables analysts to work more like software engineers, in line with the dbt Viewpoint.
Ref: https://docs.getdbt.com/docs/introduction
Need of Transformation
- You need to remove the duplicate the data
- Write it in a format, the way you want to read for reporting
- Pre-Aggregate the tables
- Simple answer would be you do not trust the data, so you need to write your own transformation
Where does DBT fit?
DBT sits right on top of your data warehousing and pushes your query to your warehouse.
The core job is to convert your data from raw to transformed.
DBT Core Vs DBT Cloud
DBT core
- Its an open source data transformation tool
- You can have CLI as an interface
- It includes core SQL compilation logic
- Jinja templating(https://jinja.palletsprojects.com/en/2.10.x/templates/)
DBT cloud
- Its fully managed
- Supports Logging and alerting
- You can orchestrate the jobs
- User authentication and SSO
Core Concepts
Everything is a Select statements
Using DBT we write SQL statements and and build the dags and dependencies.
In DBT language we call it as models. we write models and schedules them and run them when needed.
Model
Model is a select statement. You create your sql and give it a model name.
The name of the file is used as the model name
Materialisation
Materialisation basically used to store your models in the warehouse.
There are four types of materialisations built within dbt. They are:
- table: You create tables everytime you run your dbt model as create table statement.
- view: Similar to table, here the model will be saved as views whenever you run your models.
- incremental: models do support incremental data since we have run it last time.
- ephemeral: models are not directly built into the database. Instead, dbt will interpolate the code from this model into dependent models as a common table expression.
DBT Generates the DDL at run time
You write a dbt model
{{ config (materialized=table) }}
Select * from testdb.test_table where order > 1000
It gets converted and runs as
create table new_db.new_table as ( select * from testdb.test_table where order > 1000);
Ref
Ref is mostly used to create the dependency between your models.
Models are basically queries. You write 100 different models for your transformation. Using ref you can create the dependency and refer 1 model from another.
Example:
select *
from public.raw_data
select *
from {{ref('model_a')}}
Ref takes care of 2 things:
Dependency: Creates the dependency and refer 1 model from another
Parameterisation: Interpolating the schema into your model file so that you can change your configuration for every deployments.
DBT Lineage Graph
creates the lineage and dependency plus it takes care of the parallelisation wherever possible
Source:https://docs.getdbt.com/docs/building-a-dbt-project/documentation#
Easily build and test
DBT will also help you in testing your code and data.
You write your logics and rules in yml file. And at run time these logics will be compiled to SQL and shall validates your data.
You can write a rules like:
- column values should be unique
- should contain between x and y
- Foreign key constraints
- etc.
Example YAML file:
version: 2
models:
- name: orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'returned']
- name: customer_id
tests:
- relationships:
to: ref('customers')
field: id
Documentation
dbt provides a way to generate documentation for your dbt project and render it as a website. The documentation for your project includes:
- Information about your project: including model code, a DAG of your project, any tests you’ve added to a column, and more.
- Information about your data warehouse: including column data types, and table sizes. This information is generated by running queries against the information schema.
ref: https://docs.getdbt.com/docs/building-a-dbt-project/documentation
Reusability
In dbt, you can combine SQL with Jinja, a templating language.
Using Jinja turns your dbt project into a programming environment for SQL, giving you the ability to do things that aren’t normally possible in SQL. For example, with Jinja you can:
- Use control structures (e.g.
if
statements andfor
loops) in SQL - Use environment variables in your dbt project for production deployments
- Change the way your project builds based on the current target.
ref:https://docs.getdbt.com/docs/building-a-dbt-project/jinja-macros
DEMO TIME
Below command to install in MAC OS
brew update
brew install git
brew tap dbt-labs/dbt
brew install dbt
You can use pip as well to install
python3 -m venv dbt-env # create the environment
source dbt-env/bin/activate # activate the environment
pip install dbt
Setup the project
dbt init test-dbt
models: where we define our SQL queries
macros: Consider this as your User Defined Functions
dbt_project.yml: Here we need to define the profile or the connection details
Under your home directory you will find a .dbt folder. Open profiles.yml file
/Users/ajith.shetty/.dbt/profiles.yml
Connection to Snowflake
I have created a trial snowflake account for free in here for testing DBT.
https://www.snowflake.com/login/
Now once the account is setup let’s add a new snowflake profile:
https://docs.getdbt.com/reference/warehouse-profiles/snowflake-profile
Under your project directory open dbt_project.yml
And update
name: which would be your project name
profile: matching with the name you have specified in the above, in our case it would be snowflake.
Run below command to check the configuration and connectivity
dbt debug
Let’s execute our models.
There are 2 models under models directory.
dbt run
Now the run is successful let’s validate the data in snowflake.
Model 1 is a table
Model 2 is a view
We can see a record for both.
So we have done these 3 steps in no time.
- Initialise a project
- create a profile
- Run our models
Above code can be found in my github:
Reference:
Ajith Shetty
Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.
Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓
Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data