Keep it simple, keep it Airbyte
Remember the time we have been copying the data from 1 place to another and doing some query on top of it and calling it as the ETL?
So old school.
We thought we need something better, so we have come up with an automation.
Say an ETL tool. which would do that job for us. Like an example, Talend.
Now we needed an open source, free and something which can do a magic on Big data.
Let’s say Spark does, what it does the best.
We needed an automation to Run, a perfect example would be an Airflow.
Now how often you are asked to write a pipeline to pull data from 1 source and write it to a destination incrementally.
You will be writing the code again and again, write the transformation/incremental logic and put to destination and bla bla bla.
We Data engineers do this more often than anything else.
What if we can just build ETL pipeline in just click of a button and in Minutes.
It would be so amazing.
Enter, Airbyte, doing what it does the best.
What is Airbyte
The data integration platform that can scale with your custom or high-volume needs
From high-volume databases to the long tail of API sources.
Source: https://airbyte.io/
It’s as simple as that.
Its a platform which can help you to connect your source and destination in just click of a button.
And keeping all complexity like incremental load, automation, connection and all those pre/post requisites away from you, so that you can purely concentrate on writing business logics and not worrying about creating the pipelines.
Origin
Airbyte was founded by Michel and John in January 2020. Our first idea was to help companies exchange data, starting by building a customer portal for data providers so their clients can easily evaluate and pull data however and wherever they want. We applied to YC W20 (January to March 2020) with that idea.
source: https://handbook.airbyte.io/company/our-story
Airbyte Deploy modes
Airbyte can be deployed by any of the following ways:
- Locally
- AWS
- GCP
Airbyte Features
1. Multi connector
Airbyte is preloaded with different source and destination connectors.
You can just setup and start playing around.
There are 3 categories to differentiate.
Certified: robust and extensive testing
- Source: Redshift, MySQL, Oracle DB, Postgres and many more
- Destination: BigQuery, Postgres, Redshift and many more
Beta: Which is released recently. Still work in progress w.r.t edge cases
- Source: AWS Cloud trail, Bigquery, Clickhouse and many more
- Destination: Databricks, Mysql, etc
Alpha: Not fully tested.
- Source: S3, Amazon seller partent etc
- Destination: Azur Blob Storage, GCS etc
2. Automate schedule
You can not only create the link between source to destination but also you create your own schedule to run them incrementally.
3. Customise or Create your own
There are some cases where you would not be able to find your requird source or the destination.
But using Airbyte Development Kit it is easier than ever to create your own custom connector.
4. Monitoring in realtime
Airbyte provides the realtime monitoring systems for you to be able to see how your pipeline is running and validate them step by step.
5. Alerts
The most common feature what we look for in the ETL is to have an alert machanism during failure or may be even when succeded.
Airbyte supprts it out of the box.
6. Debugging on the go
It is very much easier to debug your ETL job at the exact place where it failed as you can monitor them realtime and dubg at the same time.
7. Configurable
Airbyte is Highly configurable. You can choose the write mode as incremental or full, you can schedule based on your need, you can put the alerting system, or you can just write your own connector.
Support for Change Data Capture
CDC is the most important thing in the current time as we keep changing the source by adding or removing the data and we need our destination to be aware of this changes.
This has already been integrated with the Airbyte.
Demo time
In this DEMo we will be creating a Postgres source and LOCAL JSON destination.
And to keep it interesting we will be taking the CDC(Change Data Capture) use case.
3 lines of magic
git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up
And hit http://localhost:8000
Lets create a source: Postgres
docker run — rm — name airbyte-source -e POSTGRES_PASSWORD=password -p 2000:5432 -d postgres
ALTER SYSTEM SET wal_level =’logical’;
docker restart airbyte-source
docker exec -it airbyte-source psql -U postgres
CREATE TABLE world(country_id INTEGER, country_name VARCHAR(200), PRIMARY KEY(country_id));INSERT INTO world VALUES(1,'USA');
INSERT INTO world VALUES(2,'UK');SELECT pg_create_logical_replication_slot('slot1','pgoutput');
CREATE PUBLICATION pub1 FOR ALL TABLES;
Lets create the source from the UI
Select Logical replication as we need CDC(Change Data Capture)
Setup the destination as Local JSON for testing.
Choose the Database what you have just created with Sync mode as Append and Sync refresh as MANUAL.
Trigger the JOB
Once Completed you can view the data in local “/tmp/airbyte_local/json_data”
Let’s make a change in the source data.
INSERT INTO world VALUES(3,’AUSTRALIA’);
DELETE FROM world WHERE country_name=’AUSTRALIA’;
And run the job again.
And there it is. The change has been captured.
So here in this demo we have setup the platform locally. Then wee have created a source and a destination connection in just few clicks and without any hassle we have captured the Changed Data as well.
Reference:
Ajith Shetty
Bigdata Engineer — Bigdata, Analytics, Cloud and Infrastructure.
Subscribe✉️ ||More blogs📝||Linked In📊||Profile Page📚||Git Repo👓
Interested in getting the weekly newsletter on the big data analytics around the world, do subscribe to my: Weekly Newsletter Just Enough Data