Building an End-to-End Data Pipeline with Google Cloud, dbt, and Looker

In this blog post, I'll walk you through the journey of building an end-to-end data pipeline using Google Cloud Storage, BigQuery, dbt, and Looker. This project was a fascinating exploration of modern data engineering practices, leveraging powerful tools to transform raw data into actionable insights.

Problem Definition

In this challenge, my goal was to understand more about the World Bank's economic indicators via analytics. To achieve this, I needed a dataset that captures valuable economic indicators such as GDP, inflation, and labor force participation rate.

Extract, Transform, Load (ETL) Process

The ETL process took place in Mage AI, a data orchestration tool similar to Apache Airflow. Mage allows you to integrate and synchronise data from third-party sources, build real-time and batch pipelines to transform data, and load the data into various cloud platforms all from one software tool.

1. Extracting Data from Source

The first step in any ETL process is to extract data from various sources. For this project, I connected to the World Bank's databank to retrieve data on the world development indicators. I selected the countries of interest, indicators of interest, and time span, and exported a CSV file, which I saved to my GitHub. I then used the URL to access the data.

Ensuring proper connections and handling data extraction issues was critical. Robust data extraction scripts and assertion tests were employed to fetch the data efficiently and ensure no issues arose.

2. Transforming Data

Once the data was extracted, the next step was to clean, filter, and transform it to meet the project's requirements. This involved:

  • Renaming columns
  • Unpivoting the data from a wide to long format
  • Filtering out unnecessary information
  • Cleaning the data to ensure consistency
  • Structuring the data in a usable format

Data transformation is crucial as it directly affects the quality and reliability of the data.

3. Loading Data into Google Cloud Storage (GCS)

After transforming the data, it was loaded into Google Cloud Storage. GCS provides a scalable and reliable solution for storing large datasets. Proper organisation and security measures were implemented to ensure data integrity and accessibility.

4. Creating an External Table in BigQuery

With the data stored in GCS, the next step was to create an external table in BigQuery. This allowed direct access to the data in GCS without the need to duplicate it. Configuring the external table correctly was essential for seamless data access and optimal performance.

5. Designing Dimension and Fact Tables in BigQuery

To organize the data for analytics, dimension and fact tables were designed and created in BigQuery. This step involved careful planning to ensure that the schema was robust and efficient for querying. Dimension tables store descriptive attributes, while fact tables store measurable, quantitative data.

Data Transformation with dbt (Data Build Tool)

6. Connecting dbt to Google Cloud Project

The next phase of the project involved connecting dbt to the Google Cloud Project. dbt is a powerful tool that allows for the transformation and management of data within a data warehouse. Setting up dbt required configuring connections and ensuring it could access the necessary resources.

7. Creating a Staging Folder in dbt

In the dbt models directory, a staging folder was created to manage intermediate data transformations. This folder acted as a workspace for loading raw data from the source.

8. Defining Source Configurations with a YAML File

A source YAML file was created to define the source configurations for the dbt project. This file included connections to the source data, ensuring accurate and consistent data loading.

9. Writing SQL Files for Data Loading

SQL scripts were written to load data from the source into the staging tables. Writing efficient SQL queries was key to ensuring data loading performance and accuracy.

10. Documenting Tables with a Model File

A model file was created to document each table and its respective columns. This documentation provided clear metadata for the project, aiding in understanding the data structure and facilitating collaboration.

11. Loading Data into Marts/Core Model Directory

Data was then moved and transformed into the marts/core model directory in dbt. This step prepared the data for analytics by further transforming it as needed.

12. Creating Dimension and Fact Tables in dbt

Dimension and fact tables were created in dbt to organise the data for analytics. This required careful design to ensure the tables supported the necessary analytics and reporting needs.

13. Adding Tests in dbt

To ensure data quality, tests were added in dbt. These tests included checks for unique values, not null constraints, and other data integrity measures. Adding tests helped maintain data quality and catch any issues early in the process.

Deployment and Visualisation

14. Deploying to BigQuery

Once the data models and transformations were complete, they were deployed to BigQuery. This made the data available for querying and analysis.

15. Connecting to Google Looker

Finally, the data was connected to Google Looker for analytics and visualisation. Looker is a powerful BI tool that allows for the creation of interactive dashboards and reports. Setting up Looker involved configuring connections and creating visualisations to derive insights from the data.

Conclusion

This project was a comprehensive exercise in modern data engineering. By leveraging Google Cloud Storage, BigQuery, dbt, and Looker, we were able to transform raw data into actionable insights. Each step of the process, from extraction to visualisation, played a crucial role in ensuring data quality and usability.

The experience provided valuable insights into the intricacies of data engineering, from handling data extraction issues to designing robust schemas and ensuring data quality with dbt. Connecting the data to Looker allowed for powerful visualisations that can drive informed decision-making.

This walkthrough provides a clear understanding of the process for myself. One area I noticed when conducting this project is the importance of analysing what you expect or wish to accomplish with the star schema or analytics at the beginning. Once this question is concrete, the process to follow becomes easier as it will be easier to conceptualise the model by drawing it out and then adding logic to eventually make the model a reality. Once done, you can find the data sources and create the pipeline to extract the data sources and then load and transform or load directly the raw data into the data warehouse where you can stage the data and improve it with transformations before eventually building the star schema for the end result, which is providing insights for the organisation.

Github