Tell me a time where you got into a situation where everything is falling apart in your pipeline and you went and fixed it all by just looking at it.
Logs are so under rated. No matter how much we depend on the logs, we give so little importance to it.
We generally produce more logs that the data we have, but it all gets flushed to a dump, though we are aware it gives you so much of power w.r.t the job you have been running, the performance, the result, the trend and many more.
The second most important data we have in this current generation is Logs.
We need Data Logging.
Need of Logging
The effort it requires to validate your data in the ongoing pipeline, validate, monitor and fixing is enormous.
What if we can streamline the debugging, testing and monitoring the activities.
Solution is data logging.
We can capture the logs at each infrastructure stack and analyse.
What do we need in the data logging:
- How fresh the data is and if there is a schema changes
- What is the count of the data
- Statistics of each of the columns
- What is the data distribution drift
- Samples of record for debugging
Over and above It should support:
- Should run parallel to our pipeline
WhyLogs: “A Data and Machine Learning Logging Standards”
WhyLogs in developed by the com-any called WhyLabs.
WhyLogs is n open-sourced project which will enable you to infrastructure agnostic observability for machine learning and big data technologies.
Currently built on top of Python and Java which is quite common and popular in the current machine learning world.
Out of the box,
- It can run in jupyter notebooks
- supports integration with Kafka, or spark or flink
- many more.
It provides a light weight statistical profiling of the data distribution in the distributed manner.
How is it different from traditional analysis systems
it decouples statics collection. Your job can run in a container and collect the logs, but the analysis could happen at a later stage in our jupyter notebook for an example.
Features of WhyLogs
Out of the box feature of WhyLogs is it collects the statistics and sketches of data on a column.
- Counter: Boolean, null values
- summary Statistic: min, max, median
- unique values or cardinality
- Top frequent records
Whylogs can be integrated directly with
- Many more
Demo Integration with Spark
Setup the data
Create a profile over the data
Or use an aggregator over the data.
In the second part we will go in depth of the WhyLogs and shall visualise the logs we have profiled.
Above code can be found in my github.
GitHub - ajithshetty/whylogs_demo
Contribute to ajithshetty/whylogs_demo development by creating an account on GitHub.
GitHub - whylabs/whylogs: Profile and monitor your ML data pipeline end-to-end , Join us in slack @…
whylogs is an open source standard for data and ML logging whylogs logging agent is the easiest way to enable logging…
Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.
Interested in getting the weekly updates on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data