That’s WhyLogs

Photo by Markus Spiske on Unsplash

Tell me a time where you got into a situation where everything is falling apart in your pipeline and you went and fixed it all by just looking at it.

Probably never.

Logs are so under rated. No matter how much we depend on the logs, we give so little importance to it.

We generally produce more logs that the data we have, but it all gets flushed to a dump, though we are aware it gives you so much of power w.r.t the job you have been running, the performance, the result, the trend and many more.

The second most important data we have in this current generation is Logs.

We need Data Logging.

Need of Logging

The effort it requires to validate your data in the ongoing pipeline, validate, monitor and fixing is enormous.

What if we can streamline the debugging, testing and monitoring the activities.

Solution is data logging.

We can capture the logs at each infrastructure stack and analyse.

What do we need in the data logging:

  1. How fresh the data is and if there is a schema changes
  2. What is the count of the data
  3. Statistics of each of the columns
  4. What is the data distribution drift
  5. Samples of record for debugging

Over and above It should support:

  1. lightweight
  2. Configurable
  3. Should run parallel to our pipeline

Enter WhyLogs.

WhyLogs: “A Data and Machine Learning Logging Standards”


WhyLogs in developed by the com-any called WhyLabs.

WhyLogs is n open-sourced project which will enable you to infrastructure agnostic observability for machine learning and big data technologies.

Currently built on top of Python and Java which is quite common and popular in the current machine learning world.

Out of the box,

  1. It can run in jupyter notebooks
  2. supports integration with Kafka, or spark or flink
  3. Visualisation
  4. many more.

It provides a light weight statistical profiling of the data distribution in the distributed manner.

How is it different from traditional analysis systems

it decouples statics collection. Your job can run in a container and collect the logs, but the analysis could happen at a later stage in our jupyter notebook for an example.

Features of WhyLogs

Out of the box feature of WhyLogs is it collects the statistics and sketches of data on a column.

Metrics like:

  1. Counter: Boolean, null values
  2. summary Statistic: min, max, median
  3. unique values or cardinality
  4. Top frequent records


Whylogs can be integrated directly with

  1. Spark
  2. Kafka
  3. S3
  4. Numpy
  5. Pandas
  6. Many more

Demo Integration with Spark

Setup the data

Create a profile over the data

Or use an aggregator over the data.

In the second part we will go in depth of the WhyLogs and shall visualise the logs we have profiled.


Above code can be found in my github.

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data




Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Washington Post’s Electoral Maps: How we built it

What Are Post-Decision States and What Do They Want From Us?

The COVID-19 Spread in Australia

Identification of Bona Fide DNA Sequences with Deep Learning

Review Stuffing services : Really worth it ? An exploratory analysis

Getting started with Spark (part 2)

The Ironic Sophistication of Naive Bayes Classifiers

Stop The Alarming Trend Where KPIs Are Set As Your Data Strategy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ajith Shetty

Ajith Shetty

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In:

More from Medium

Datalake Ingestion Tool: A Small, Simple and yet Powerful Scala Spark Code Makes Apache Sqoop…

Data Lakes: An overview

Know your Data using OpenMetadata

What is the best storage option for your data?

Illustration of a laptop on a desk with a lamp, cactus, and mug beside it