Expecting Great from Great Expectations

8 min readSep 4, 2021

Expectation is a very strong word. In the world of Data the expectation grows exponentially to your data size.

But how easy or difficult to meet these expectations.

Data Engineers spend most of their time in creating the pipeline, managing them, giving the data to right set of people.

But can we guarantee that the data is consistent and their are no outlier?

Consider we are running a job every hour to read the temperature of a container which is in transit. And the container went offline for a period of time and it did not give the output back for that particular minute or well let’s say it has given an outlier temperature due to bug.

At this moment can we still say that the pipeline we created is stable and tested.

We have tested our code, we have tested the pipeline but the live data cannot be predicted and there could be N different cases of outlier.

Here at this moment arrives, The Great Expectations

Since it’s pretty long blog, so here is the agenda.

History of Great Expectations
Explanation and details
Internals
Local cli installation
Example — databricks
Example — databricks using pre created suite

History

Superconductive is the company behind the The Great Expectation.

It’s publicly launched in 2018 and it uses the popular open source libraries for testing the pipeline.

Great Expectation

The point to highlight is Expectation.

Great Expectation is an assertion for your data. We define the expectation or the rule as in which we want our data to be tested.

Great expectation is declarative an expressive which makes it easier to define the expectation and validate.

we can define the expectation like below.

expect_column_to_exist(“column1”)
expect_table_row_count_to_be_between(1,1000)
expect_column_values_to_be_unique(“column1”)
expect_column_values_to_not_to_be_null(“column1”)
Many more

The validation will be run against variety of data sources like pandas, Spark or OLTP and cloud based redshift using SQL Alchemy.

Pandas — Great for in-memory machine learning pipelines!

Spark — Good for really big data.

Postgres — Leading open source database

BigQuery — Google serverless massive-scale SQL analytics platform

Databricks — Managed Spark Analytics Platform

MySQL — Leading open source database

AWS Redshift — Cloud-based data warehouse

AWS S3 — Cloud based blob storage

Snowflake — Cloud-based data warehouse

Apache Airflow — An open source orchestration engine

Other SQL Relational DBs — Most RDBMS are supported via SQLalchemy

Jupyter Notebooks — The best way to build Expectations

Slack — Get automatic data quality notifications!

Expectation in general could be anything but w.r.t data:

Expectation is an assertion of the data so that we could produce validation.

This will give us the profiling which states how our data looks like.

Now with the help of data doc, it will become easier to share the information about the data to the team.

Data Context as a whole will put all these things together.

Expectations

Expectations is basically what do we expect from the data.

For example we expect our columns

not to be empty
to be unique
should be between x and y
should match with regex
and many more

Using expectation configuration we parameterise all the values like the columns to be validated, and the values to compare with etc.

Now we group different expectations and its configurations together and give them a user defined name. This is called as an expectation suite.

Data Source

Data source is basically a connection to the environment from where the data will be fetched.

It could be spark, sql, OLTP or cloud sources like redshfit etc.

Validation

We need the data assets to run our expectations.

We use expectation suite which contains the expectations and its configurations to run the validation on top of the data assets which gives us the results.

Profiling

An Expectation Suite is simply a set of Expectations. You can create Expectation Suites by writing out individual statements, such as the one above, or by automatically generating them based on profiler results.

Data doc

Great Expectations renders Expectations to clean, human-readable documentation called Data Docs. These HTML docs contain both your Expectation Suites as well as your data validation results each time validation is run — think of it as a continuously updated data quality report.

Data Context

Data Context manages your project configuration in a yaml file which we can share with the team.

It stores the database connection info, plugging information etc.

Validation action

Actions are Python classes with a run method that takes the result of validating a Batch against an Expectation Suite and does something with it (e.g., save validation results to disk, or send a Slack notification).

Using validation action we can trigger an email or update the data doc etc.

How does it work

Demo Time

Local CLI Installation

pip install great_expectations

great_expectations init

What data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
: 1

What are you processing your files with?
1. Pandas
2. PySpark
: 2
Great Expectations relies on the library `pyspark` to connect to your data, but the package `pyspark` containing this library is not installed.
Would you like Great Expectations to try to execute `pip install pyspark` for you? [Y/n]: Y
pip install pyspark [ — — — — — — — — — — — — — — — — — — ] 0%

Enter the path (relative or absolute) of the root directory where the data files are stored.
: /Users/ajith.shetty/data

Give your new Datasource a short name.
[data__dir]: spark-data-source

Now since we have created the data source to which we shall connect and run our great expectation validation lets create a suite to bind the expectation and data source together.