Logo Description automatically generated. Telecom Customer Data Analysis.
Table of Content. Introduction. Pipeline design. Data Cleaning.
Introduction. Telecom companies collect extensive data from their customers and devices, which can be analyzed to gain insights into customer behavior, device performance, and revenue trends. This project focuses on processing and analyzing telecom data to provide valuable business insights. The tasks include data ingestion, transformation, and storage, followed by comprehensive analysis to understand patterns and trends within the telecom industry..
Targeting High-Volume Data Scalability. Partitioning and replication allow Kafka to handle high throughput and ensure data durability by distributing data and load across multiple brokers..
Producer. CRM.CSV DEV.CSV REV.CSV. Kafka broker stores rev.csv.
Why ETL?. We have made use of ETL pipeline while proposing our architecture for smooth data loading, cleaning, validation & loading for analysis..
Data Cleaning. Data cleaning is crucial because it ensures the accuracy and reliability of the analysis, leading to more meaningful and actionable insights..
Batch Extraction & Cleaning CRM & DEVICE. The code is structured as follows: load : Performs extraction of data. transforms : Implements cleaning logic on data tests : Validate data against requirements. unload : Transfer valid data to sink. main.py : Orchestrates and orders all stages. Files for CRM and DEVICE data were sourced from AWS S3 files, which are managed and updated by an instance of the s3Manager class Meta-data columns are added to the tables before transferring to database..
The CRM table had a total of 13,627,494 rows. There were 3.3% NULL elements among all fields. Except GENDER and YEAR_OF_BIRTH all other columns had complete data..
Transformations - GENDER. : 1 :. PROBLEM. NOTE: A lower bound of 0.7 was set as threshold for classification.
Transformations -YEAR_OF_BIRTH. SOLUTION. The set of observed YOBs had a cardinality of 124 There were 0.02% NULL values. There were 9 rows that had a year of birth post 2024.
Transformations - PK. MSISDN was required to be the unqiue Primary-Key of the CRM table. There were 0 NULL values in MSISDN. There were 2,264,513 duplicate MSISDNs, corresponding to 20.1% of the elements..
Interactive Cleaning - DEVICE TABLE. Device table was specified to have a composite primary key, consisting of the columns (MSISDN, IMEI_TAC) There were 9.6% NULL elements among all fields. There were no further “client” constraints or requirements for validation in DEVICE table..
Exploratory Analysis of DEVICE Table. RESULTS. Approximately 9% NULL values could be imputed using the defined mappings..
Imputation Trade-Offs. A ‘factor’ significantly enhanced the overall imputation..
NOSQL Storage Choice Comparison. PROS Offers capability for high throughput with an architecture that is fault-tolerant, without any single point of failure. Clusters may be reconfigured and resized dynamically which provides high availability of data..
Data pipeline. Data pipelines are essential because they automate the efficient and seamless flow of data from various sources to analytics tools, ensuring timely and accurate insights..
Batch Producer - S3 to Kafka. Fetches CSV data from S3.
streaming data rev1.csv. Producer-system analysis.
SPARK BATCH INGESTION & VALIDATION. Rev validation.
Spark Consumers & Producers. TOPIC REVENUE TOPIC REVENUE_OFFSETS.
Kafka Topic. Spark - Streaming consumer workflow.
Data Pre-processing and Testing: Customer relations management (CRM) and Device (DEV) tables are pre-processed and tested against various integrity constraints in MongoDB before ingestion in the Spark session. Join Operation: An outer join has been performed on the two tables, ensuring all records from both datasets are included, providing a comprehensive view of the data. Schema Generation and Data Storage: The final schema for analysis is generated in the Spark session before the data is pushed into Snowflake for analytical purposes, enabling efficient data retrieval and analysis..
Data transformations. Snowflake Ingestion. Load data in Kakfa topic.
Data Warehouse Choice Comparison. Snowflake's architecture is designed to operate seamlessly across multiple cloud platforms (AWS, Azure, GCP).
De-duplication. Revenue Aggregation. REVENUE. UNIFIED REVENUE.
A Python script, orchestrated by a DAG, to activate the Snowflake connector and generate analytical views within the Snowflake data warehouse based on provided queries, automating the entire process for in-depth analysis..
Snowflake-to-S3 Glacier Data Migration Orchestration.
Future Scope - Section I. REV CLEANER CONSUMER-PRODUCER SYSTEM.
DBT stands for Data Build Tool. It allows data analysts and engineers to transform data in their warehouse more effectively..
Vizualization Stagging Raw Tables Cleaned Tables AWS S3.
Extraction. CRM.csv DEVICE.csv REV.csv Amazon S3 Integration External Stage Snowpipe.
Loading. crm. Device. rev. External Stage. Data-Warehouse.
Snowflake Structure. Raw. Schema. Cleaned. Analyse.
DBT Structure. Models Directory: Organized into subdirectories for specific tasks, such as data_cleaning and analysis. Data Cleaning Subdirectory: Includes models for cleaning CRM, revenue, and device data. Analysis Subdirectory: Contains models for analyzing cleaned data. Tests Directory: Holds test files to validate data quality. Seeds Directory: Contains raw data files (CSV format). Macros Directory: Includes reusable SQL snippets..
Transformation.
Seeding the DataSet. Purpose: Enhance data cleaning by standardizing gender values. DBT Configuration: Create seed_gender_mapping.yml in the models directory. Specify schema, column names, data types, and constraints. Execution: Run dbt seed command to load data from CSV to the cleaned.gender_mapping table. Verification: Query the cleaned.gender_mapping table in Snowflake to confirm successful seeding..
Cleaning. Gender Cleaning. Msisdn duplicates handle.
Test and Validation. Valid Gender: Implemented accepted_values test for gender column. Ensures values are 'MALE', 'FEMALE', or NULL. Unique and Non-null MSISDN: Enforced not_null and unique constraints on msisdn columns. Ensures data completeness and integrity. Valid Year of Birth: Ensures YEAR_OF_BIRTH format and range are correct. Validates the year is less than the current year..
Orchestration of DBT through Airflow. The workflow seeds the gender_mapping data with seed_dbt, to transforms crm and device with run_dbt_transform, validates it with run_dbt_tests, and conducts comprehensive analysis with run_dbt_analysis..
Vizualizations.
Total number of devices & active devices: week-wise.
Total Revenue by Male vs Female customers. Week-wise.
Total Revenue generated by each age group. Total Revenue distribution by each age group.
Total Revenue generated by each mobile type. Total Revenue distribution by mobile_type.
Total Revenue distribution by mobile_type. Prepaid.
Total Revenue generated by each brand. Total Revenue generated by each os_vendor.
Data Normalization. organizing data to minimize redundancy and dependency, enhancing consistency and integrity in a database..
Removed duplicate rows. Now it’s in 1st Normal Form.