That’s WhyLogs

3 min readOct 17, 2021

Tell me a time where you got into a situation where everything is falling apart in your pipeline and you went and fixed it all by just looking at it.

Probably never.

Logs are so under rated. No matter how much we depend on the logs, we give so little importance to it.

We generally produce more logs that the data we have, but it all gets flushed to a dump, though we are aware it gives you so much of power w.r.t the job you have been running, the performance, the result, the trend and many more.

The second most important data we have in this current generation is Logs.

We need Data Logging.

Need of Logging

The effort it requires to validate your data in the ongoing pipeline, validate, monitor and fixing is enormous.

What if we can streamline the debugging, testing and monitoring the activities.

Solution is data logging.

We can capture the logs at each infrastructure stack and analyse.

What do we need in the data logging:

How fresh the data is and if there is a schema changes
What is the count of the data
Statistics of each of the columns
What is the data distribution drift
Samples of record for debugging

Over and above It should support:

lightweight
Configurable
Should run parallel to our pipeline

Enter WhyLogs.

WhyLogs: “A Data and Machine Learning Logging Standards”

Background

WhyLogs in developed by the com-any called WhyLabs.

WhyLogs is n open-sourced project which will enable you to infrastructure agnostic observability for machine learning and big data technologies.

Currently built on top of Python and Java which is quite common and popular in the current machine learning world.

Out of the box,

It can run in jupyter notebooks
supports integration with Kafka, or spark or flink
Visualisation
many more.

It provides a light weight statistical profiling of the data distribution in the distributed manner.

How is it different from traditional analysis systems

it decouples statics collection. Your job can run in a container and collect the logs, but the analysis could happen at a later stage in our jupyter notebook for an example.

Features of WhyLogs

Out of the box feature of WhyLogs is it collects the statistics and sketches of data on a column.

Metrics like:

Counter: Boolean, null values
summary Statistic: min, max, median
unique values or cardinality
Top frequent records

Integration

Whylogs can be integrated directly with

Spark
Kafka
S3
Numpy
Pandas
Many more

Demo Integration with Spark

Setup the data

Create a profile over the data

Or use an aggregator over the data.

In the second part we will go in depth of the WhyLogs and shall visualise the logs we have profiled.

Reference:

Above code can be found in my github.

GitHub - ajithshetty/whylogs_demo

Contribute to ajithshetty/whylogs_demo development by creating an account on GitHub.

github.com

GitHub - whylabs/whylogs: Profile and monitor your ML data pipeline end-to-end , Join us in slack @…

whylogs is an open source standard for data and ML logging whylogs logging agent is the easiest way to enable logging…

github.com

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data