Data Silos: ETL
Extract, Transform, Load
The process of consolidating your data from multiple different products and sources is known as Extract, Transform and Load (ETL). This should be a familiar phrase for anyone working in data since up to 90% of your time as a data analyst or engineer can be spent doing ETL. Unlike some of the terms we review here, this one is named for exactly what it involves:
- Extract. Downloading your data from software tools is easier today than it was years ago, but it still can be complex. Some tools lets you download everything at once via either their user interface or an API query, others require you to make a request and wait a few hours (or days) and even others have no easy way to get your data. This difficulty is why data collection tools (see Data Collection Tools) were created and are very popular today – they allow you to avoid this problem.
- Transform. Unfortunately, almost every product and tool you use will store its data in a different format. In order to be able to consolidate you need to transform the data from the format of the product into a common format that will allow you to combine it with others. Sometimes this is as simple as changing some terms (e.g. referring to California as “CA” instead of “CALIFORNIA”) and others it means changing the data format entirely (e.g. moving from JSON to SQL). There are many tools that can help with this process, but it is generally the hardest part.
- Load. Finally, you need to load the transformed data from all of your sources into one place. Typically this is a SQL-based system (see Data Storage Tools), but there are many different options available today.
While ETL is conceptually simple, the difficulties lie in all of the details involved in the data for each product you use. It can take a while to refine your ETL process even for a single product, and quite a long time to reliably ETL the data from all of your products. Many companies have dedicated people working on nothing but ETL to ensure they have the best data available at all times.
Even when your ETL is complete you haven’t broken down your data silos! The data from each source will be in the same place but still be separate, until you find some way to join it together. That’s what we’ll cover in the next few sections: different ways to combine your data from different sources after you have it in one place.
Tomorrow we get started with the easiest way, which is joins based on unique identifiers.
Quote of the Day: “You may say I’m a dreamer, but I’m not the only one. I hope someday you’ll join us. And the world will live as one.” ― John Lennon