Team Amrutha.pptx

1 of

Published on Jun 22, 2024

Page 1 (0s)

Logo Description automatically generated. Telecom Customer Data Analysis.

Page 2 (10s)

Table of Content. Introduction. Pipeline design. Data Cleaning.

Page 3 (25s)

Introduction. Telecom companies collect extensive data from their customers and devices, which can be analyzed to gain insights into customer behavior, device performance, and revenue trends. This project focuses on processing and analyzing telecom data to provide valuable business insights. The tasks include data ingestion, transformation, and storage, followed by comprehensive analysis to understand patterns and trends within the telecom industry..

Page 4 (53s)

Targeting High-Volume Data Scalability. Partitioning and replication allow Kafka to handle high throughput and ensure data durability by distributing data and load across multiple brokers..

Page 5 (1m 30s)

Producer. CRM.CSV DEV.CSV REV.CSV. Kafka broker stores rev.csv.

Page 6 (1m 50s)

Why ETL?. We have made use of ETL pipeline while proposing our architecture for smooth data loading, cleaning, validation & loading for analysis..

Page 7 (2m 17s)

Data Cleaning. Data cleaning is crucial because it ensures the accuracy and reliability of the analysis, leading to more meaningful and actionable insights..

Page 8 (2m 28s)

Batch Extraction & Cleaning CRM & DEVICE. The code is structured as follows: load : Performs extraction of data. transforms : Implements cleaning logic on data tests : Validate data against requirements. unload : Transfer valid data to sink. main.py : Orchestrates and orders all stages. Files for CRM and DEVICE data were sourced from AWS S3 files, which are managed and updated by an instance of the s3Manager class Meta-data columns are added to the tables before transferring to database..

Page 9 (2m 52s)

The CRM table had a total of 13,627,494 rows. There were 3.3% NULL elements among all fields. Except GENDER and YEAR_OF_BIRTH all other columns had complete data..

Page 10 (3m 20s)

Transformations - GENDER. : 1 :. PROBLEM. NOTE: A lower bound of 0.7 was set as threshold for classification.

Page 11 (3m 58s)

Transformations -YEAR_OF_BIRTH. SOLUTION. The set of observed YOBs had a cardinality of 124 There were 0.02% NULL values. There were 9 rows that had a year of birth post 2024.

Page 12 (4m 26s)

Transformations - PK. MSISDN was required to be the unqiue Primary-Key of the CRM table. There were 0 NULL values in MSISDN. There were 2,264,513 duplicate MSISDNs, corresponding to 20.1% of the elements..

Page 13 (4m 57s)

Interactive Cleaning - DEVICE TABLE. Device table was specified to have a composite primary key, consisting of the columns (MSISDN, IMEI_TAC) There were 9.6% NULL elements among all fields. There were no further “client” constraints or requirements for validation in DEVICE table..

Page 14 (5m 15s)

Exploratory Analysis of DEVICE Table. RESULTS. Approximately 9% NULL values could be imputed using the defined mappings..

Page 15 (5m 38s)

Imputation Trade-Offs. A ‘factor’ significantly enhanced the overall imputation..

Page 16 (5m 59s)

NOSQL Storage Choice Comparison. PROS Offers capability for high throughput with an architecture that is fault-tolerant, without any single point of failure. Clusters may be reconfigured and resized dynamically which provides high availability of data..

Page 17 (6m 32s)

Data pipeline. Data pipelines are essential because they automate the efficient and seamless flow of data from various sources to analytics tools, ensuring timely and accurate insights..

Page 18 (6m 43s)

Batch Producer - S3 to Kafka. Fetches CSV data from S3.

Page 19 (7m 0s)

streaming data rev1.csv. Producer-system analysis.

Page 20 (7m 19s)

SPARK BATCH INGESTION & VALIDATION. Rev validation.

Page 21 (7m 52s)

Spark Consumers & Producers. TOPIC REVENUE TOPIC REVENUE_OFFSETS.

Page 22 (8m 8s)

Kafka Topic. Spark - Streaming consumer workflow.

Page 23 (8m 29s)

Data Pre-processing and Testing: Customer relations management (CRM) and Device (DEV) tables are pre-processed and tested against various integrity constraints in MongoDB before ingestion in the Spark session. Join Operation: An outer join has been performed on the two tables, ensuring all records from both datasets are included, providing a comprehensive view of the data. Schema Generation and Data Storage: The final schema for analysis is generated in the Spark session before the data is pushed into Snowflake for analytical purposes, enabling efficient data retrieval and analysis..

Page 24 (8m 59s)

Data transformations. Snowflake Ingestion. Load data in Kakfa topic.

Page 25 (9m 15s)

Data Warehouse Choice Comparison. Snowflake's architecture is designed to operate seamlessly across multiple cloud platforms (AWS, Azure, GCP).

Page 26 (9m 41s)

De-duplication. Revenue Aggregation. REVENUE. UNIFIED REVENUE.

Page 27 (10m 4s)

A Python script, orchestrated by a DAG, to activate the Snowflake connector and generate analytical views within the Snowflake data warehouse based on provided queries, automating the entire process for in-depth analysis..

Page 28 (10m 29s)

Snowflake-to-S3 Glacier Data Migration Orchestration.

Page 29 (10m 52s)

Future Scope - Section I. REV CLEANER CONSUMER-PRODUCER SYSTEM.

Page 30 (11m 22s)

DBT stands for Data Build Tool. It allows data analysts and engineers to transform data in their warehouse more effectively..

Page 31 (11m 32s)

Vizualization Stagging Raw Tables Cleaned Tables AWS S3.

Page 32 (11m 40s)

Extraction. CRM.csv DEVICE.csv REV.csv Amazon S3 Integration External Stage Snowpipe.

Page 33 (11m 58s)

Loading. crm. Device. rev. External Stage. Data-Warehouse.

Page 34 (12m 12s)

Snowflake Structure. Raw. Schema. Cleaned. Analyse.

Page 35 (12m 21s)

DBT Structure. Models Directory: Organized into subdirectories for specific tasks, such as data_cleaning and analysis. Data Cleaning Subdirectory: Includes models for cleaning CRM, revenue, and device data. Analysis Subdirectory: Contains models for analyzing cleaned data. Tests Directory: Holds test files to validate data quality. Seeds Directory: Contains raw data files (CSV format). Macros Directory: Includes reusable SQL snippets..

Page 36 (12m 43s)

Transformation.

Page 37 (12m 48s)

Seeding the DataSet. Purpose: Enhance data cleaning by standardizing gender values. DBT Configuration: Create seed_gender_mapping.yml in the models directory. Specify schema, column names, data types, and constraints. Execution: Run dbt seed command to load data from CSV to the cleaned.gender_mapping table. Verification: Query the cleaned.gender_mapping table in Snowflake to confirm successful seeding..

Page 38 (13m 9s)

Cleaning. Gender Cleaning. Msisdn duplicates handle.

Page 39 (13m 25s)

Test and Validation. Valid Gender: Implemented accepted_values test for gender column. Ensures values are 'MALE', 'FEMALE', or NULL. Unique and Non-null MSISDN: Enforced not_null and unique constraints on msisdn columns. Ensures data completeness and integrity. Valid Year of Birth: Ensures YEAR_OF_BIRTH format and range are correct. Validates the year is less than the current year..

Page 40 (13m 45s)

Orchestration of DBT through Airflow. The workflow seeds the gender_mapping data with seed_dbt, to transforms crm and device with run_dbt_transform, validates it with run_dbt_tests, and conducts comprehensive analysis with run_dbt_analysis..

Page 41 (13m 58s)

Page 42 (14m 4s)

Vizualizations.

Page 43 (14m 10s)

Total number of devices & active devices: week-wise.

Page 44 (14m 18s)

Total Revenue by Male vs Female customers. Week-wise.

Page 45 (14m 27s)

Total Revenue generated by each age group. Total Revenue distribution by each age group.

Page 46 (14m 36s)

Total Revenue generated by each mobile type. Total Revenue distribution by mobile_type.

Page 47 (14m 45s)

Total Revenue distribution by mobile_type. Prepaid.

Page 48 (14m 52s)

Total Revenue generated by each brand. Total Revenue generated by each os_vendor.

Page 49 (15m 1s)

Data Normalization. organizing data to minimize redundancy and dependency, enhancing consistency and integrity in a database..

Page 50 (15m 11s)

Removed duplicate rows. Now it’s in 1st Normal Form.