Hrishikesh Telang

Research Assistant, St. Francis Institute of Technology

telanghrishi [AT] gmail.com

AWS ETL Pipeline Manager for YouTube User Data Analysis

Scalable Data Pipeline for YouTube Insights and Advertising Success

Problem Statement

With the goal of launching a data-driven advertising campaign on YouTube, the customer needed insights to categorize videos based on comments and statistics and to identify factors affecting video popularity. Key metrics like views, likes, shares, and comments were crucial for analyzing user engagement. The dataset, sourced from Kaggle, included structured (CSV) and semi-structured (JSON) data on top trending YouTube videos and their categories. The challenge lay in building an efficient pipeline to process, clean, and analyze this data for actionable insights.


Project Objectives


Methodology and Steps

1. Data Lake Setup

2. Data Processing

3. Data Transformation

4. Visualization


Tools and Technologies


Challenges Faced

  1. Handling Nested JSON Files:
    • Directly querying nested JSON structures was infeasible. This required implementing a custom ETL step using AWS Lambda to flatten and convert JSON to Parquet.
  2. Data Cleansing:
    • Identifying and resolving issues with join keys between the JSON and CSV datasets was critical for successful integration.
  3. Optimization:
    • Ensuring that the data pipeline minimized latency and optimized storage without compromising query performance.

Outcome


What I Learned

This project was a significant learning experience in:

  1. Data Engineering:
    • Designing and implementing a scalable data lake.
    • Building ETL pipelines to process and transform large datasets.
  2. AWS Services:
    • Leveraging S3, Glue, Lambda, Athena, and QuickSight for end-to-end data management and visualization.
  3. Data Integration:
    • Handling structured and semi-structured data to create a unified dataset.
  4. Visualization:
    • Creating interactive BI dashboards to communicate key insights effectively.

While the BI dashboard was an outcome, the focus was on mastering the ETL pipeline and data warehousing techniques, making this an excellent portfolio project for roles in Data Engineering, Data Analytics, and Data Science.


Future Enhancements