Airbyte ETL in real cases

Felipe Veloso
4 min readNov 8, 2021

It is the same version but in English link

I recently started working in a company dedicated to logistics intelligence https://www.simpliroute.com/ in which the need to migrate and gather data in a unified source began after investigating within the organization the way they use and obtain the information. I issued an architectural proposal for the data solution, both batch and stream.

Parallel to this, https://airbyte.io/ reached my ears and my immediate reaction was… “With this I will no longer have to code so much” and honestly, the need to write code with ETL adhoc was greatly reduced.
Airbyte for many is a tool of which few have heard of and mainly it is a tool for ETL which, from various data sources, supports us by making a transformation and subsequent storage of this data in multiple destinations.

With a simple and intuitive interface, this tool allowed me to generate multiple migrations from sources such as:

  • Google Cloud Storage (Json)
  • RDS (Postgres)
  • MongoDb
  • S3 (Json)

To better known destinations such as:

  • Bigquery
  • Google Cloud Storage

So after processing about 8tb of information, migrating millions of GPS data to generate use cases for the Machine Learning area, I will describe my conclusions for this tool point by point.

The benefits:

  • Since the architecture raised the need to rub shoulders with Google Composer, it made my development times too high, airbyte allowed me to quickly organize ETL to destinations that I control with just a couple of clicks.
  • Airbyte allows me to ingest the data in 3 ways, denormalized (airbyte metadata + all migrated data in a single column), the basic normalized option (100% mirror migration) and transformations customized by me, which with a little study I can mount from gitlab and handle the data ingestion as I need.
  • Quick implementation for a POC and with a little knowledge it could become something productive for internal use (mount on a vm and only have internal ip, with a network tunnel to be able to access it).
  • Visual results (before version 0.30.29-alpha) previously a benefit and a problem at the same time for airbyte was to be able to see how the data reading progressed, which when doing large ingests (more than 300m records) became heavy see the logs (and performance sacrifice of the vm/cluster) but you were aware of what was happening.
  • Closeness to the airbyte team, both in terms of help and interest in using it, the ease of contributing to the community with bugs that it was detecting.
  • Airflow integration with a connector, it may sound redundant but the simplicity that airbyte proposes, eliminates several airflow steps and simplifies complete migrations.

Against:

  • It is a tool that is oriented to a schedule, therefore it does not allow events, in an event-oriented organization, this may not be the best option by itself (if we use the connector with airflow this may change).
  • It does not have security in its open version and its paid versions are not yet available (for which I would give you my money quickly) so it is your duty to securitize this machine a bit and that implies greater knowledge when mounting in productive environments.
  • Limits in migrations, without a precise number but some of the ETLs (approx 2500 mm of records) have failed after 72 hours of processing, which without much explanation tries to retry the etl starting from 0, I would expect that in a more mature version it will be do microbatch ingest of the data to avoid having to retry these migrations completely (at least I have not seen that it can be done yet).
  • Tool still under construction and constant updates, a problem arose when updating the version of bigquery connector, where in a previous version a metadata field wrote it in string and after the update it did it in timestamp therefore my future ETLs failed (700m approx).

I leave you invited to try and consult about the use of this wonderful tool that facilitates my life as a Data Engineer and clearly Data Science teams find it great for its ease of use, when generating datasets from different clouds or resources.

https://docs.airbyte.io/quickstart/deploy-airbyte

https://github.com/airbytehq/airbyte

--

--

Felipe Veloso

Training to be a Dakar Pilot - ML Engineer and Data Engineer