Keep it simple, keep it Airbyte

6 min readOct 24, 2021

source: https://github.com/airbytehq/airbyte

Remember the time we have been copying the data from 1 place to another and doing some query on top of it and calling it as the ETL?

So old school.

We thought we need something better, so we have come up with an automation.

Say an ETL tool. which would do that job for us. Like an example, Talend.

Now we needed an open source, free and something which can do a magic on Big data.

Let’s say Spark does, what it does the best.

We needed an automation to Run, a perfect example would be an Airflow.

Now how often you are asked to write a pipeline to pull data from 1 source and write it to a destination incrementally.

You will be writing the code again and again, write the transformation/incremental logic and put to destination and bla bla bla.

We Data engineers do this more often than anything else.

What if we can just build ETL pipeline in just click of a button and in Minutes.

It would be so amazing.

Enter, Airbyte, doing what it does the best.

What is Airbyte

The data integration platform that can scale with your custom or high-volume needs

From high-volume databases to the long tail of API sources.

Source: https://airbyte.io/

It’s as simple as that.

Its a platform which can help you to connect your source and destination in just click of a button.

And keeping all complexity like incremental load, automation, connection and all those pre/post requisites away from you, so that you can purely concentrate on writing business logics and not worrying about creating the pipelines.

Origin

Airbyte was founded by Michel and John in January 2020. Our first idea was to help companies exchange data, starting by building a customer portal for data providers so their clients can easily evaluate and pull data however and wherever they want. We applied to YC W20 (January to March 2020) with that idea.

source: https://handbook.airbyte.io/company/our-story

Airbyte Deploy modes

Airbyte can be deployed by any of the following ways:

Locally
AWS
GCP

Airbyte Features

1. Multi connector

Airbyte is preloaded with different source and destination connectors.

You can just setup and start playing around.

There are 3 categories to differentiate.

Certified: robust and extensive testing

Source: Redshift, MySQL, Oracle DB, Postgres and many more
Destination: BigQuery, Postgres, Redshift and many more

Beta: Which is released recently. Still work in progress w.r.t edge cases

Source: AWS Cloud trail, Bigquery, Clickhouse and many more
Destination: Databricks, Mysql, etc

Alpha: Not fully tested.

Source: S3, Amazon seller partent etc
Destination: Azur Blob Storage, GCS etc

List of all connectors

2. Automate schedule

You can not only create the link between source to destination but also you create your own schedule to run them incrementally.

3. Customise or Create your own

There are some cases where you would not be able to find your requird source or the destination.

But using Airbyte Development Kit it is easier than ever to create your own custom connector.

4. Monitoring in realtime

Airbyte provides the realtime monitoring systems for you to be able to see how your pipeline is running and validate them step by step.

5. Alerts

The most common feature what we look for in the ETL is to have an alert machanism during failure or may be even when succeded.

Airbyte supprts it out of the box.

6. Debugging on the go

It is very much easier to debug your ETL job at the exact place where it failed as you can monitor them realtime and dubg at the same time.

7. Configurable

Airbyte is Highly configurable. You can choose the write mode as incremental or full, you can schedule based on your need, you can put the alerting system, or you can just write your own connector.

Support for Change Data Capture

CDC is the most important thing in the current time as we keep changing the source by adding or removing the data and we need our destination to be aware of this changes.

This has already been integrated with the Airbyte.

Demo time

In this DEMo we will be creating a Postgres source and LOCAL JSON destination.

And to keep it interesting we will be taking the CDC(Change Data Capture) use case.

3 lines of magic

git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up

And hit http://localhost:8000

Lets create a source: Postgres

docker run — rm — name airbyte-source -e POSTGRES_PASSWORD=password -p 2000:5432 -d postgres
ALTER SYSTEM SET wal_level =’logical’;
docker restart airbyte-source
docker exec -it airbyte-source psql -U postgres
CREATE TABLE world(country_id INTEGER, country_name VARCHAR(200), PRIMARY KEY(country_id));INSERT INTO world VALUES(1,'USA');
INSERT INTO world VALUES(2,'UK');SELECT pg_create_logical_replication_slot('slot1','pgoutput');
CREATE PUBLICATION pub1 FOR ALL TABLES;

Lets create the source from the UI

Select Logical replication as we need CDC(Change Data Capture)

Setup the destination as Local JSON for testing.

Choose the Database what you have just created with Sync mode as Append and Sync refresh as MANUAL.

Trigger the JOB

Once Completed you can view the data in local “/tmp/airbyte_local/json_data”

Let’s make a change in the source data.

INSERT INTO world VALUES(3,’AUSTRALIA’);
DELETE FROM world WHERE country_name=’AUSTRALIA’;

And run the job again.

And there it is. The change has been captured.

So here in this demo we have setup the platform locally. Then wee have created a source and a destination connection in just few clicks and without any hassle we have captured the Changed Data as well.

Reference:

GitHub — airbytehq/airbyte: Airbyte is an open-source EL(T) platform that helps you replicate your…

Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases. …

github.com

Airbyte | Open-Source Data Integration Pipelines | ELT

Airbyte is the turnkey open-source data integration platform that gives your infrastructure super powers to move data…

airbyte.io

Deploy Airbyte

If you have any questions about the Airbyte Open-Source setup and deployment process, head over to our on our Discourse…

docs.airbyte.io

Replicate your Data Between PostgreSQL Databases | Airbyte

Replicate your data between your Postgres databases in under 10 minutes.