Expecting Great from Great Expectations

source:https://greatexpectations.io/

Expectation is a very strong word. In the world of Data the expectation grows exponentially to your data size.

But how easy or difficult to meet these expectations.

Data Engineers spend most of their time in creating the pipeline, managing them, giving the data to right set of people.

But can we guarantee that the data is consistent and their are no outlier?

Photo by Yosep Surahman on Unsplash

Consider we are running a job every hour to read the temperature of a container which is in transit. And the container went offline for a period of time and it did not give the output back for that particular minute or well let’s say it has given an outlier temperature due to bug.

At this moment can we still say that the pipeline we created is stable and tested.

We have tested our code, we have tested the pipeline but the live data cannot be predicted and there could be N different cases of outlier.

Here at this moment arrives, The Great Expectations

Since it’s pretty long blog, so here is the agenda.

History

Superconductive is the company behind the The Great Expectation.

It’s publicly launched in 2018 and it uses the popular open source libraries for testing the pipeline.

Great Expectation

The point to highlight is Expectation.

Great Expectation is an assertion for your data. We define the expectation or the rule as in which we want our data to be tested.

Great expectation is declarative an expressive which makes it easier to define the expectation and validate.

we can define the expectation like below.

The validation will be run against variety of data sources like pandas, Spark or OLTP and cloud based redshift using SQL Alchemy.

Pandas — Great for in-memory machine learning pipelines!

Spark — Good for really big data.

Postgres — Leading open source database

BigQuery — Google serverless massive-scale SQL analytics platform

Databricks — Managed Spark Analytics Platform

MySQL — Leading open source database

AWS Redshift — Cloud-based data warehouse

AWS S3 — Cloud based blob storage

Snowflake — Cloud-based data warehouse

Apache Airflow — An open source orchestration engine

Other SQL Relational DBs — Most RDBMS are supported via SQLalchemy

Jupyter Notebooks — The best way to build Expectations

Slack — Get automatic data quality notifications!

Expectation in general could be anything but w.r.t data:

Expectation is an assertion of the data so that we could produce validation.

This will give us the profiling which states how our data looks like.

Now with the help of data doc, it will become easier to share the information about the data to the team.

Data Context as a whole will put all these things together.

Expectations

Expectations is basically what do we expect from the data.

For example we expect our columns

Using expectation configuration we parameterise all the values like the columns to be validated, and the values to compare with etc.

Now we group different expectations and its configurations together and give them a user defined name. This is called as an expectation suite.

Data Source

Data source is basically a connection to the environment from where the data will be fetched.

It could be spark, sql, OLTP or cloud sources like redshfit etc.

Validation

We need the data assets to run our expectations.

We use expectation suite which contains the expectations and its configurations to run the validation on top of the data assets which gives us the results.

Profiling

An Expectation Suite is simply a set of Expectations. You can create Expectation Suites by writing out individual statements, such as the one above, or by automatically generating them based on profiler results.

Data doc

Great Expectations renders Expectations to clean, human-readable documentation called Data Docs. These HTML docs contain both your Expectation Suites as well as your data validation results each time validation is run — think of it as a continuously updated data quality report.

Data Context

Data Context manages your project configuration in a yaml file which we can share with the team.

It stores the database connection info, plugging information etc.

Validation action

Actions are Python classes with a run method that takes the result of validating a Batch against an Expectation Suite and does something with it (e.g., save validation results to disk, or send a Slack notification).

Using validation action we can trigger an email or update the data doc etc.

How does it work

source:https://greatexpectations.io/
source:https://greatexpectations.io/

Demo Time

Local CLI Installation

pip install great_expectations

great_expectations init

What data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
: 1

What are you processing your files with?
1. Pandas
2. PySpark
: 2
Great Expectations relies on the library `pyspark` to connect to your data, but the package `pyspark` containing this library is not installed.
Would you like Great Expectations to try to execute `pip install pyspark` for you? [Y/n]: Y
pip install pyspark [ — — — — — — — — — — — — — — — — — — ] 0%

Enter the path (relative or absolute) of the root directory where the data files are stored.
: /Users/ajith.shetty/data

Give your new Datasource a short name.
[data__dir]: spark-data-source

Now since we have created the data source to which we shall connect and run our great expectation validation lets create a suite to bind the expectation and data source together.

great_expectations suite new

This will open up a jupyter.

Define a rule you want to test your data with.

Let’s give an expectation saying we need a column to be present in the data we are reading. In our example we expect to have a column Road Number

Run the Command.

This will open a UI which will give us the result and its status.

Now in case you want to update or add a new expectation, click on How to Edit This Suite

Copy the command and run in your command prompt.

This will again open a jupyter notebook and we shall add another rule.

Once you run the command it will validate all the expectations you have specified and shall show the output.

In our case 1 expectation is successful and the other one failed.

Spark Databricks

Install the library.

Create the data source.

Create the context to bind your data with the expectations.

Let’s run the expectations.

Create a new Expectation Suite by profiling from another Data source

2. define the suite name and store the suite

Now let’s run the suite on the same data but with the pre created suite.

Below is the output:

The job ran for all the columns we have defined in the suite with the given rule.

In the bottom you can see:

“statistics”: { “evaluated_expectations”: 3, “successful_expectations”: 1, “unsuccessful_expectations”: 2, “success_percent”: 33.33333333333333 }

Out of 3 column validation, 2 have failed and 1 succeeded. So the success percent is 33.33

Full output:

TLDR

References:

You may find the above Code in my repo:

Ajith Shetty

Bigdata Engineer — Love for Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/