Photo by Markus Spiske on Unsplash

Tell me a time where you got into a situation where everything is falling apart in your pipeline and you went and fixed it all by just looking at it.

Probably never.

Logs are so under rated. No matter how much we depend on the logs, we give so little importance to it.

We generally produce more logs that the data we have, but it all gets flushed to a dump, though we are aware it gives you so much of power w.r.t the job you have been running, the performance, the result, the trend and many more.

The second most important data we have in this current generation is Logs.

We need Data Logging.

Need of Logging

The effort it requires to validate your data in the ongoing pipeline, validate, monitor and fixing is enormous.

What if we can streamline the debugging, testing and monitoring the activities.

Solution is data logging.

We can capture the logs at each infrastructure stack and analyse.

What do we need in the data logging:

  1. How fresh the data is and if there is a schema changes
  2. What is the count of the data
  3. Statistics of each of the columns
  4. What is the data distribution drift
  5. Samples of record for debugging

Over and above It should support:

  1. lightweight
  2. Configurable
  3. Should run parallel to our pipeline

Enter WhyLogs.

WhyLogs: “A Data and Machine Learning Logging Standards”

Background

WhyLogs in developed by the com-any called WhyLabs.

WhyLogs is n open-sourced project which will enable you to infrastructure agnostic observability for machine learning and big data technologies.

Currently built on top of Python and Java which is quite common and popular in the current machine learning world.

Out of the box,

  1. It can run in jupyter notebooks
  2. supports integration with Kafka, or spark or flink
  3. Visualisation
  4. many more.

It provides a light weight statistical profiling of the data distribution in the distributed manner.

How is it different from traditional analysis systems

it decouples statics collection. Your job can run in a container and collect the logs, but the analysis could happen at a later stage in our jupyter notebook for an example.

Features of WhyLogs

Out of the box feature of WhyLogs is it collects the statistics and sketches of data on a column.

Metrics like:

  1. Counter: Boolean, null values
  2. summary Statistic: min, max, median
  3. unique values or cardinality
  4. Top frequent records

Integration

Whylogs can be integrated directly with

  1. Spark
  2. Kafka
  3. S3
  4. Numpy
  5. Pandas
  6. Many more

Demo Integration with Spark

Setup the data

Create a profile over the data

Or use an aggregator over the data.

In the second part we will go in depth of the WhyLogs and shall visualise the logs we have profiled.

Reference:

Above code can be found in my github.

Ajith Shetty

Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.

Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓

Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/