SODA SQL, much needed for your DATA

source: https://github.com/sodadata/soda-sql

Year 2020 and 2021 is specially known for the demand for Data Engineering.

A big boom in the industry. This means almost every company has started valuing their data and get the most out it.

Having the data and getting predictions done using ML is not really it.

First thing we need to consider is how accurate your data is.

The prediction model is dependent on the data what you feed.

So now the Concentration of the Data driven company has moved from Predicting the future to Cleaning the data.

Data testing, Monitoring and profiling were so under rated but luckily these are the talking points in 2021 beyond.

But the question here is how well we know the data and how sure are we about the authenticity of it.

There is no way we can say the data we feed is always correct.

An extra character would mess up your whole data pipeline.

We need to have an automated way to test, monitor and profile the data at a regular basis.

Enter, SODA SQL.

Introduction

SODA SQL is an open source tool which takes input from the user about the data source you want to connect and prepares the SQL to run against the data to find the anomalies.

You shall define the condition to test any column like:

  1. Count
  2. Avg
  3. min length
  4. Valid format for Text, Number, Dates or even IP Address

Using SODA SQL you can Test your data, monitor the data, profile the data.

You can integrate SODA SQL with airflow to schedule and run at intervals.

Nuances in SODA SQL

SODA SQL has main 2 components.

  1. Sources
  2. Warehouse YML
  3. SCANS
  4. Metrics

Sources

Currently below are the sources, the SODA SQL connects to.

source: https://github.com/sodadata/soda-sql

Warehouse YML

In your warehouse yml file you define the source you want to connect to.

For each source there will be 1 yml file.

But you can create a single yml file for different schema under the same source.

$ soda create studentdatasource -d sodasql -u sodasql -w soda_sql_demo 
  • -d name of the data source
  • -u the username to log in to the data source
  • -w the name of the warehouse directory

You need to define the credeintails in ~/.soda/env_vars.yml

soda_sql_demo:
POSTGRES_USERNAME: testuser
POSTGRES_PASSWORD: dummypassword

Scan is a command that executes the tests and extracts the information about the sources you connect.

To start with you will define a scan YAML file.

For every data source you connect to, you need to define a YAML file.

Let’s say you have a data source Redshfit for which you have created a YAML file warehouse.yml

When you run soda analyse in the current folder, it forms a SQL command based on the yml file and prepares an in input for the SCAN which is again an yml file.

Metric is basically the property in your data source, where as the measurement is the value for your property.

You would use the metrics to define your tests.

Example. row_count > 0

Here row_count is your metrics which you want to test and the value 0 is the measurement against which you want to validate.

You may have different types of metrics,

  1. Dataset metrics: Eg. row_count which will execute against the whole data
  2. column metrics: where you define the coukn against which you want to run a test. eg. avg, disticnt etc.
  3. valid format metrics: Valid formats apply only to columns using data type TEXT
  4. custom metrics: where you can write your own metrics to test.

In this example you are trying to compare your 1 column with another metric.

SODA SQL execution Path

When the user triggers a scan command.

soda scan my_warehouse_source.yml tables/student.yml
  1. my_warehouse_source.yml will have the details of the source you connect to like redshift or postgres.
  2. student.yml file contains the testing condition with metrics details.
  3. SODA SQL will prepare a SQL query based on above 2 files
  4. Scans the source based on the my_warehouse_source.yml and tests the condition based on student.yml plus gets the metrics.
  5. Displays the output.

DEMO

  1. install postgres(locally)
  2. setup user, access and dummy table with data
  3. install sodasql and the postgres connector
  4. setup the required files and connections
  5. analyse the data

https://www.postgresql.org/download/macosx/

psql postgres
sudo /usr/local/Cellar/python@3.9/3.9.7_1/bin/python3 -m pip install soda-sql-postgresql
soda create postgres -d sodasql -u sodasql -w soda_sql_demo
/Users/ajith.shetty/.soda/env_vars.yml
warehouse.yml
soda analyze

Analyze command will create the template yml file where you would add your metrics and test details.

copy the world.yml and create your own configuration, eg. demo.yml

soda scan warehouse.yml tables/demo.yml

Here in the output we can see the measurements on all the columns like min, max, avg, etc.

And we have defined to test the row count and it should be greater than 0 which got succeeded as the row count is 10.

Learn more about Soda Cloud:

Reference:

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||LinkedIn📊||Profile Page📚||Git Repo👓

Interested in getting the weekly newsletter on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/