Let’s do it DBT way!

source: https://github.com/dbt-labs/dbt

Shift from ETL to ELT

Ever since the beginning of Bigdata bubble, we have all been only talking about how we are going to Extract the Data Transform the Data and Load the Data.

But the problem here is the data and compute are tightly coupled. And we pay for when we do not need as well.

Data warehousing is something where we spend most of our Dollar to store and maintain.

But even after spending a huge chunk of money are we able to get the same amount output.

With the help of Fivetran or Stich we could answer the Ingestion problems which takes away all the problem.

For Data warehousing we could take Redshift or Bigquery.

Now comes the main problem. How are we going to Democratise the data?

How can we publish the data for analysts?

How we can we automate the process with boiler plate code instead of Analysts writing the same code over and over again

How can we build the lineage or dependency between the tables and queries.

Now you may say, use Python and Airflow for writing the dependency of your queries.

The pain out for the Analysts here is they would need to wait for the Data engineers to write the code, build the dependency, schedule the jobs and monitor before Analysts could know about the existence of the data.

Enters DBT(Data Build Tool) which is going to answer this perticular problem in the 21st century.

The only requirement for you to start using DBT is to know SQL.

Well DBT gives more power to Analysts so that they can work just similar to Software Engineers like using version control, testing and documenttions.

What is dbt?

dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turning these select statements into tables and views.

dbt does the T in ELT (Extract, Load, Transform) processes – it doesn’t extract or load data, but it’s extremely good at transforming data that’s already loaded into your warehouse.

The role of dbt within a modern data stack is discussed in more detail here.

dbt also enables analysts to work more like software engineers, in line with the dbt Viewpoint.

Ref: https://docs.getdbt.com/docs/introduction

Need of Transformation

  1. You need to remove the duplicate the data
  2. Write it in a format, the way you want to read for reporting
  3. Pre-Aggregate the tables
  4. Simple answer would be you do not trust the data, so you need to write your own transformation

Where does DBT fit?

DBT sits right on top of your data warehousing and pushes your query to your warehouse.

The core job is to convert your data from raw to transformed.

DBT Core Vs DBT Cloud

DBT core

  1. Its an open source data transformation tool
  2. You can have CLI as an interface
  3. It includes core SQL compilation logic
  4. Jinja templating(https://jinja.palletsprojects.com/en/2.10.x/templates/)

DBT cloud

  1. Its fully managed
  2. Supports Logging and alerting
  3. You can orchestrate the jobs
  4. User authentication and SSO

Core Concepts

Everything is a Select statements

Using DBT we write SQL statements and and build the dags and dependencies.

In DBT language we call it as models. we write models and schedules them and run them when needed.

Model

Model is a select statement. You create your sql and give it a model name.

The name of the file is used as the model name

Materialisation

Materialisation basically used to store your models in the warehouse.

There are four types of materialisations built within dbt. They are:

  • table: You create tables everytime you run your dbt model as create table statement.
  • view: Similar to table, here the model will be saved as views whenever you run your models.
  • incremental: models do support incremental data since we have run it last time.
  • ephemeral: models are not directly built into the database. Instead, dbt will interpolate the code from this model into dependent models as a common table expression.

DBT Generates the DDL at run time

You write a dbt model

{{ config (materialized=table) }}

Select * from testdb.test_table where order > 1000

It gets converted and runs as

create table new_db.new_table as ( select * from testdb.test_table where order > 1000);

Ref

Ref is mostly used to create the dependency between your models.

Models are basically queries. You write 100 different models for your transformation. Using ref you can create the dependency and refer 1 model from another.

Example:

select *from public.raw_dataselect *from {{ref('model_a')}}

Ref takes care of 2 things:

Dependency: Creates the dependency and refer 1 model from another

Parameterisation: Interpolating the schema into your model file so that you can change your configuration for every deployments.

DBT Lineage Graph

creates the lineage and dependency plus it takes care of the parallelisation wherever possible

Source:https://docs.getdbt.com/docs/building-a-dbt-project/documentation#

Easily build and test

DBT will also help you in testing your code and data.

You write your logics and rules in yml file. And at run time these logics will be compiled to SQL and shall validates your data.

You can write a rules like:

  1. column values should be unique
  2. should contain between x and y
  3. Foreign key constraints
  4. etc.

Example YAML file:

version: 2models:- name: orderscolumns:- name: order_idtests:- unique- not_null- name: statustests:- accepted_values:values: ['placed', 'shipped', 'completed', 'returned']- name: customer_idtests:- relationships:to: ref('customers')field: id

Documentation

dbt provides a way to generate documentation for your dbt project and render it as a website. The documentation for your project includes:

  • Information about your project: including model code, a DAG of your project, any tests you’ve added to a column, and more.
  • Information about your data warehouse: including column data types, and table sizes. This information is generated by running queries against the information schema.

ref: https://docs.getdbt.com/docs/building-a-dbt-project/documentation

Reusability

In dbt, you can combine SQL with Jinja, a templating language.

Using Jinja turns your dbt project into a programming environment for SQL, giving you the ability to do things that aren’t normally possible in SQL. For example, with Jinja you can:

  • Use control structures (e.g. if statements and for loops) in SQL
  • Use environment variables in your dbt project for production deployments
  • Change the way your project builds based on the current target.

ref:https://docs.getdbt.com/docs/building-a-dbt-project/jinja-macros

DEMO TIME

Below command to install in MAC OS

brew update

brew install git

brew tap dbt-labs/dbt

brew install dbt

You can use pip as well to install

python3 -m venv dbt-env             # create the environmentsource dbt-env/bin/activate         # activate the environmentpip install dbt

Setup the project

dbt init test-dbt

models: where we define our SQL queries

macros: Consider this as your User Defined Functions

dbt_project.yml: Here we need to define the profile or the connection details

Under your home directory you will find a .dbt folder. Open profiles.yml file

/Users/ajith.shetty/.dbt/profiles.yml

Connection to Snowflake

I have created a trial snowflake account for free in here for testing DBT.

https://www.snowflake.com/login/

Now once the account is setup let’s add a new snowflake profile:

https://docs.getdbt.com/reference/warehouse-profiles/snowflake-profile

Under your project directory open dbt_project.yml

And update

name: which would be your project name

profile: matching with the name you have specified in the above, in our case it would be snowflake.

Run below command to check the configuration and connectivity

dbt debug

Let’s execute our models.

There are 2 models under models directory.

dbt run

Now the run is successful let’s validate the data in snowflake.

Model 1 is a table

Model 2 is a view

We can see a record for both.

So we have done these 3 steps in no time.

  1. Initialise a project
  2. create a profile
  3. Run our models

Above code can be found in my github:

Reference:

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data

--

--

Ajith Shetty

Ajith Shetty

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/