Improving KPI troubleshooting. Sphoorthi Shetty.
Topics. Animation Current Process Drawback of current Process ORA Dashboard grafana_links script Benefits References.
[Audio] Hi, Meet Team C-S-O. A group of people standing together Description automatically generated.
[Audio] The team is considering increasing workloads, so we need to find ways to reduce manual effort while ensuring changes are implemented smoothly..
[Audio] Sphoorthi reaches out to Arun with a few cool ideas..
[Audio] Arun thinks the ideas are great, as they're really helping the team, and he tells Sphoorthi to get started on them..
[Audio] Sphoorthi comes up with a plan but needs some help since some parts are new to her. Arun suggests checking with Don, who worked on a similar project..
[Audio] Sphoorthi learns a new scripting language and gets the whole project ready in 5 months. Collects the data for a month to see the reduction in manual effort..
[Audio] Sphoorthi gets to work on the coding, and soon enough, the project is ready to go..
[Audio] The whole team is happy after the changes are rolled out!.
Current Process. Change Safety: KPI Troubleshooting - https://www.nocc.akamai.com/alertproc/view.cgi?id=9296 Example: https://gbrm.akamai.com/#/phase/view/58330 21 hour checks had 8 KPI failures in which one failed metric was ORA related KPI The current process takes ~40 minutes for each KPI failure.
Drawbacks using ORA in the current process. The team currently manually fills in fields and checks the corresponding ORA dashboard during KPI failure troubleshooting. This manual process is prone to errors and may result in incomplete checks if additional ORA dashboards are overlooked. There is a risk of missing the ORA dashboard check since it is required only for 5 specific KPI failure sets. The manual process is time-consuming and inefficient..
Drawbacks using Grafana in the current process. Currently, the team manually selects the breakdown and combination for each rollout when KPI failures occur. They check the configuration file to locate the correct metric since the Lighthouse metric differs from the Grafana metric. The team manually identifies the corresponding Grafana dashboard for each KPI failure. This manual workflow is time-consuming and prone to errors, including: Number of failures Time taken to load the dashboards Finding a graph related to the specific KPI metric from the configuration file.
ORA Dashboard. ORA dashboard Verification using the ORA dashboard. 20 KPI metrics out of 88, approximately 25% of metrics in ESSL 20 KPI metricsout of 114, approximately 15% of metrics in FF The current process requires manual parameter setting, which consumes time. https://track.akamai.com/jira/browse/CSO-331 - Phase 1 The KPI failure alert was modified to have the ORA dashboard details in the alert description This will ensure we go directly to ORA dashboard from the alert details itself, which makes troubleshooting easy and quick..
grafana_links script. https://track.akamai.com/jira/browse/CSO-357 - Phase 2 Chapi request id as input @lsg-snr7 ~]$ grafana_links -r 731710 Output Request ID with details Network Regionset Failed Metrics etc.. Grafana links with graphs for each metric. The script will collect all data from chapi and generate Grafana links for both GBRM and non-GBRM KPI failures, enabling CSO to start troubleshooting instantly and report the failures to the requestor. This significantly reduces time consumption during troubleshooting..
Benefits. 50% time reduced KPI troubleshooting - Time Consumed - Collected data CSO Team: Can use this script for SCM KPI Failures in both GBRM and Non-GBRM Rollouts CSD Team: Can use this script for SCM (Software Configuration Management) with ghost restarts. PRE Team: Can use this script for installs..
Reference. https://track.akamai.com/jira/browse/CSO-331 https://track.akamai.com/jira/browse/CSO-357.
A black text on a white background Description automatically generated.