Expecting Great from Great Expectations
Expectation is a very strong word. In the world of Data the expectation grows exponentially to your data size.
But how easy or difficult to meet these expectations.
Data Engineers spend most of their time in creating the pipeline, managing them, giving the data to right set of people.
But can we guarantee that the data is consistent and their are no outlier?
Consider we are running a job every hour to read the temperature of a container which is in transit. And the container went offline for a period of time and it did not give the output back for that particular minute or well let’s say it has given an outlier temperature due to bug.
At this moment can we still say that the pipeline we created is stable and tested.
We have tested our code, we have tested the pipeline but the live data cannot be predicted and there could be N different cases of outlier.
Here at this moment arrives, The Great Expectations
Since it’s pretty long blog, so here is the agenda.
- History of Great Expectations
- Explanation and details
- Internals
- Local cli installation
- Example — databricks
- Example — databricks using pre created suite
History
Superconductive is the company behind the The Great Expectation.
It’s publicly launched in 2018 and it uses the popular open source libraries for testing the pipeline.
Great Expectation
The point to highlight is Expectation.
Great Expectation is an assertion for your data. We define the expectation or the rule as in which we want our data to be tested.
Great expectation is declarative an expressive which makes it easier to define the expectation and validate.
we can define the expectation like below.
- expect_column_to_exist(“column1”)
- expect_table_row_count_to_be_between(1,1000)
- expect_column_values_to_be_unique(“column1”)
- expect_column_values_to_not_to_be_null(“column1”)
- Many more
The validation will be run against variety of data sources like pandas, Spark or OLTP and cloud based redshift using SQL Alchemy.
Pandas — Great for in-memory machine learning pipelines!
Spark — Good for really big data.
Postgres — Leading open source database
BigQuery — Google serverless massive-scale SQL analytics platform
Databricks — Managed Spark Analytics Platform
MySQL — Leading open source database
AWS Redshift — Cloud-based data warehouse
AWS S3 — Cloud based blob storage
Snowflake — Cloud-based data warehouse
Apache Airflow — An open source orchestration engine
Other SQL Relational DBs — Most RDBMS are supported via SQLalchemy
Jupyter Notebooks — The best way to build Expectations
Slack — Get automatic data quality notifications!
Expectation in general could be anything but w.r.t data:
Expectation is an assertion of the data so that we could produce validation.
This will give us the profiling which states how our data looks like.
Now with the help of data doc, it will become easier to share the information about the data to the team.
Data Context as a whole will put all these things together.
Expectations
Expectations is basically what do we expect from the data.
For example we expect our columns
- not to be empty
- to be unique
- should be between x and y
- should match with regex
- and many more
Using expectation configuration we parameterise all the values like the columns to be validated, and the values to compare with etc.
Now we group different expectations and its configurations together and give them a user defined name. This is called as an expectation suite.
Data Source
Data source is basically a connection to the environment from where the data will be fetched.
It could be spark, sql, OLTP or cloud sources like redshfit etc.
Validation
We need the data assets to run our expectations.
We use expectation suite which contains the expectations and its configurations to run the validation on top of the data assets which gives us the results.
Profiling
An Expectation Suite is simply a set of Expectations. You can create Expectation Suites by writing out individual statements, such as the one above, or by automatically generating them based on profiler results.
Data doc
Great Expectations renders Expectations to clean, human-readable documentation called Data Docs. These HTML docs contain both your Expectation Suites as well as your data validation results each time validation is run — think of it as a continuously updated data quality report.
Data Context
Data Context manages your project configuration in a yaml file which we can share with the team.
It stores the database connection info, plugging information etc.
Validation action
Actions are Python classes with a run method that takes the result of validating a Batch against an Expectation Suite and does something with it (e.g., save validation results to disk, or send a Slack notification).
Using validation action we can trigger an email or update the data doc etc.
How does it work
Demo Time
Local CLI Installation
pip install great_expectations
great_expectations init
What data would you like Great Expectations to connect to?
1. Files on a filesystem (for processing with Pandas or Spark)
2. Relational database (SQL)
: 1
What are you processing your files with?
1. Pandas
2. PySpark
: 2
Great Expectations relies on the library `pyspark` to connect to your data, but the package `pyspark` containing this library is not installed.
Would you like Great Expectations to try to execute `pip install pyspark` for you? [Y/n]: Y
pip install pyspark [ — — — — — — — — — — — — — — — — — — ] 0%
Enter the path (relative or absolute) of the root directory where the data files are stored.
: /Users/ajith.shetty/data
Give your new Datasource a short name.
[data__dir]: spark-data-source
Now since we have created the data source to which we shall connect and run our great expectation validation lets create a suite to bind the expectation and data source together.
great_expectations suite new
This will open up a jupyter.
Define a rule you want to test your data with.
Let’s give an expectation saying we need a column to be present in the data we are reading. In our example we expect to have a column Road Number
Run the Command.
This will open a UI which will give us the result and its status.
Now in case you want to update or add a new expectation, click on How to Edit This Suite
Copy the command and run in your command prompt.
This will again open a jupyter notebook and we shall add another rule.
Once you run the command it will validate all the expectations you have specified and shall show the output.
In our case 1 expectation is successful and the other one failed.
Spark Databricks
Install the library.
Create the data source.
Create the context to bind your data with the expectations.
Let’s run the expectations.
Create a new Expectation Suite by profiling from another Data source
- Lets read the data and create a datasource
2. define the suite name and store the suite
Now let’s run the suite on the same data but with the pre created suite.
Below is the output:
The job ran for all the columns we have defined in the suite with the given rule.
In the bottom you can see:
“statistics”: { “evaluated_expectations”: 3, “successful_expectations”: 1, “unsuccessful_expectations”: 2, “success_percent”: 33.33333333333333 }
Out of 3 column validation, 2 have failed and 1 succeeded. So the success percent is 33.33
Full output:
TLDR
References:
You may find the above Code in my repo:
Ajith Shetty
Bigdata Engineer — Love for Bigdata, Analytics, Cloud and Infrastructure.
Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓
Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data