PowerPoint Presentation

1 of

Published on Dec 01, 2023

Page 1 (0s)

[Virtual Presenter] Good afternoon everyone. Today, we will discuss the CARE Assessment at Unilever conducted by HCL. We will take a look at the current state of maturity of the 5 applications running on Azure Cloud, the gaps identified, and the recommendations from the experts at HCL to adhere to SRE principles..

Page 2 (22s)

[Audio] We have already examined the CARE objective, taken the environment assessment into account and calculated the overall score. We have a complete report comprising of major insights as well as a plan of action founded on these insights. This gives us an all-round outlook of the scenario and allows us to initiate suitable plans to tackle the matters unveiled through our evaluation. This empowers us to spot the assets, flaws and probable avenues for process enhancement..

Page 3 (52s)

[Audio] HCL has been engaged by Unilever to assess the technical and operational security aspects of identified applications running on the Azure Cloud. Our initial assessment identified the current state of maturity, collected evidence from Azure DevOps and ticket data from the operations team. We have reviewed the results and prepared a comprehensive executive summary with our recommendations on SRE principles. Progress review for the five applications is now completed and the Recovery Priority has been set for each application. We are confident that the CARE Assessment will bring Unilever closer to achieving their desired cloud security goals..

Page 4 (1m 34s)

[Audio] Application visibility, availability, reliability, and resiliency are essential for customer satisfaction, loyalty, and trust. To maintain a reliable and resilient system, application and platform teams should collaborate to prevent outages, productivity loss, and customer mistrust. This will, in turn, create a positive experience for the customer, lower the churn rate, and reinforce brand loyalty. Moreover, a reliable system will enable the team to allocate time to innovation, which consequently results in timely product launches..

Page 5 (2m 10s)

[Audio] Without greetings, without beginning with Today, and without thanks: For the CARE Assessment Summary Phase1 of Vivek Sharma, Gaurav Khanna, Pawan and Yogesh, a network hub and spoke model was configured for the Vnet design and Express route was utilized to establish on prem connectivity. Security measures were implemented with Azure firewall, NSG and ASG, Qualys guard tool, IAM and PAM, QRadar, Trend Micro, WIZ, Akamai WAF, Azure security center, and Azure defender and sentinel, all to ensure safe access. For monitoring, Commvault, Networker and Azure Backup with different retention policies were used to create backups and restore VM and data..

Page 6 (2m 57s)

[Audio] Unilever Business needs to monitor the performance of its critical applications end-to-end, with defined Service Level Objectives and Service Level Indicators guaranteeing service quality. To achieve a unified operations system with an integrated application and infrastructure, faster Mean Time To Restore, and improved user experience, Unilever should benchmark application performance, carry out chaos engineering to identify and analyze the applications' internal, external, direct, and indirect dependencies, develop custom and hardened images, document service level agreements, enable full stack observability to match business KPIs, and set up DevOps for the Crown Jewel and KFS applications. Finally, Infrastructure as Code and automation processes should be adopted and matured to reduce manual operations..

Page 7 (3m 47s)

[Audio] We have identified the need for a more mature cloud automation tool as the current one has certain limitations. Additionally, a comprehensive infrastructure monitoring tool is necessary, and collaboration among Business, Infra & Apps teams should be better. APM tuning needs to be done with the help of appropriate tools, and an observability layer should be onboarded to provide automated RCA with root cause details. Additionally, a full stack observability dashboard should be enabled to view the application performance. Mature cloud baseline monitoring should be enabled as well as centralized monitoring for applications. Furthermore, practices for Load/Stress/Chaos testing and infrastructure security scanning, DR testing, RPO/RTO monitoring need to be enabled and regular DR failover tests should be done for TWS DB2. Finally, performance benchmarking of applications needs to be planned and SLOs should be defined on key indicators, with a streamlined process for feedback and design suggestions from Operations to Engineering teams..

Page 8 (4m 56s)

[Audio] The CARE Assessment Summary Phase1 revealed a few key insights. The data table suggests a need to enhance and leverage the existing DR tool capabilities to automate disaster recovery for all critical applications. This would reduce the risk of human errors and allow the applications to meet the disaster recovery SLAs. Automation should be implemented in deployment processes for application, databases and underlying infra. Additionally, a catalog-based provisioning and upgrade approach should be followed. This would reduce the cost of deployment and delayed time to market. It would also enable inter tower collaboration. Moreover, a MOM tool should be utilized to correlate Infra and Apps related logs to identify a single root cause of any issue and its impact profile. This would enable faster resolution. Additionally, capacity management should be implemented to infer future problems and reduce unplanned incidents. Finally, a holistic blameless postmortem process needs to be introduced across apps and infra to improve the process, people and technology culture..

Page 9 (6m 10s)

[Audio] Good morning. Today, I'd like to discuss the findings of our CARE assessment, from the 27th of November, 2023. Our team has observed a number of potential risks, and taken action to mitigate them. These actions are separated by priority, with P1 being the highest. Our team has identified that the team does not have application and infrastructure view for critical applications, and this could cause a longer Mean Time To Recovery when dealing with P1 and P2 issues. To mitigate this, we recommend using PowerBI or a similar requirement tool. We have also observed a lack of predictive and anomaly detection tools, and suggest configuring dashboards to leverage existing monitoring solutions. To reduce operational toil, we propose using tools such as BigFix and Ansible to automate patching of Linux and DB. Finally, we recommend establishing a streamlined process, using Infrastructure as Code toolsets to reduce operational toil, and automating service requests. Thank you..

Page 10 (7m 19s)

[Audio] We have implemented a CARE Assessment, Summary Phase1 as of today. This assessment includes a table with the following information: observation Actions, Priortiy, Risk Impact, Monitoring TWS Capacity and Performance Monitoring, jobs scheduled per unit of time, number of users concurrently working on the DWC console, database growth and storage capacity. In addition, process level monitoring, agent, appservman, batchman jobman, mailman, monman and netman have been taken into account. Database monitoring, as per workshop discussion, and basic OS level monitoring for the DB2 servers have also been introduced. Lastly, region outages have not been tested, as per workshop discussion, as part of the assessment..

Page 11 (8m 11s)

[Audio] Without greetings, beginning with Today, or thanks, the following text would be: The slide details the Observation, Actions, Priority, Risk Impact, Resiliency, Automation, and Process associated to the CARE Assessment Summary Phase1. In order to guarantee disaster recovery and business continuity, it is essential to configure VM level replication, as well as implementing self-service portals and auto-detection features. Additionally, DB backups should be scheduled to adhere to the RPO and make sure the SLAs are met. By utilizing existing tools or Native Backup tools, crucial tasks can be administered effectively. It is important to define a process that works with the Application Owners, as well as manually updating CMDB to add and protect the newly-created datasets. Doing this will guarantee access to essential application data for restoration and business continuity..

Page 12 (9m 11s)

[Audio] At our CARE Assessment Summary Phase1 on 27 November 2023, Vivek Sharma, Gaurav Khanna, Pawan, and Yogesh pinpointed 15 parameters as essential measuring points. These parameters are quantity of packets received/second, duplicates dropped/second, packets expired/second, and milliseconds per packet. In addition, the team also selected active queue length, conflict check queue length, discovers/second, offers/second, requests/second, informs/second, acks/second, nacks/second, and declines/second. Lastly, the team determining the DHCP server's ability to shut down without any errors. This data enables us to gauge and track the performance of our DHCP server..

Page 13 (10m 1s)

[Audio] The slide provides data on Device/Server monitoring parameters and alarm severity classification collected during the CARE assessment summary phase 1 on November 27th, 2023. Parameters range from DNS forwarder conditional forward localhost - NSLookup, classified as major configuration compliance to DNS WMI Validation - Configuration Check, classified as minor configuration compliance, giving an overview of the CARE assessment summary..

Page 14 (10m 31s)

[Audio] Our CARE assessment summary phase one provides a table of alarm severity classification associated with the listed monitoring parameters. This classification helps the team prioritize tasks that need to be addressed and the list of alarms that must be acted on to ensure quality service. With the use of available tools and alerting systems, we can systematically monitor and assess the stability of the systems and services being monitored. This is the first step towards building a reliable, convenient and secure service for our clients..

Page 15 (11m 5s)

[Audio] The next slide contains a table with the results of the CARE assessment. It shows that all of the performance metrics, such as SQL recompilations/sec, buffer cache hit ratio, page life expectancy, requests workers percentage, sessions percentage, deadlocks, connection failed, blocked by firewall, user connections, successful and failed logins, have reached their thresholds. These results show that our systems are operating in an effective way..

Page 16 (11m 35s)

[Audio] The CARE assessment summary Phase1 of 27-Nov 2023 lists various metrices categories with respective thresholds. These include Resource Utilization, Disk Usage, Locks/Blocking, Lock Waits/sec, Index Health, Fragmentation, Deadlocks, Resource Pool Memory, CPU%, Disk Read IO/sec, Disk Write IO/sec, Buffer Cache Hit Ratio, Page Life Expectancy and Checkpoint Pages/sec. Meeting all these goals in time will guarantee the best functioning of the system..

Page 17 (12m 11s)

[Audio] The results of the CARE assessment indicate that our scores in business aligned operations, culture, and security and vulnerability are all above average, coming in at 49%, 51%, and 80% respectively. Despite these promising results, our grand total score is still below the desired range of 75-100%, coming in at 38%. To increase our score and strengthen our SRE framework maturity, we need to take the necessary measure. With hard work and dedication, we are confident that we can reach the desired range..

Page 18 (12m 48s)

[Audio] Our current CARE assessment evaluation is shown on this slide. As of 27th November 2023, the score for Tools is 40%, and we need to improve our scores in Applications, Infrastructure, and Processes in order to reach the full 100%. This is our next challenge..

Page 19 (13m 8s)

[Audio] The performance engineering score for this assessment was 37% out of a maximum score of 100%. The application scored 21%, processes scored 32%, and infrastructure scored 59%. It is evident that some improvements are required to increase the overall score..

Page 20 (13m 30s)

[Audio] The Automation care assessment of Vivek Sharma, Gaurav Khanna, Pawan, and Yogesh reveals a total score of 27% out of the maximum score of 100%. Processes, tools, and skillset demonstrate competence at 30%, 31%, and 33%, respectively, while there is no proficiency in application and a low level of infrastructure at 25%. These results serve as a benchmark for improvement..

Page 21 (13m 59s)

[Audio] Capacity Management for Summary Phase1 was assessed and Application scored the lowest with 22%, Processes the highest with 50%, Operations 8%, Infrastructure 7% and the Grand Total 21%..

Page 22 (14m 14s)

[Audio] Slide 22 reveals the results of our Security and Vulnerability Care Assessment from 27th November 2023 with a maximum score of 100%. Application scored the highest of 100%, followed by Infrastructure at 91%, IdAM with 83%, Tools with 79%, and Processes at 74%. The end total was 80%. This data is essential for our company as it provides an insight into our system security and vulnerabilities. It is the result of hard work and dedication from Vivek Sharma, Gaurav Khanna, Pawan and Yogesh, and we are thankful to them. Knowing this information will enable us to provide our customers with the best service with a secure system..

Page 23 (15m 3s)

[Audio] The Action Plan consists of Self-Remediation, a self-service catalogue, a golden signal for monitoring, App Insights for Database Monitoring, Full Stack Observability, and Training and Skill Development on DevOps and Automation Tools. The aim is for SRE Squad members to have better knowledge of their roles and responsibilities, optimize capacity and performance, and improve operations..

Page 24 (15m 28s)

Page 25 (15m 33s)

[Audio] I am proud of the work we have showcased. We have gone through the CARE assessment summary phase one of 27-November of 2023 in detail and discussed the achievements, conclusions, and findings from the screenshot of the computer and its generated description. I want to thank the hard work and commitment of my team members Vivek Sharma, Gaurav Khanna, Pawan, and Yogesh. Thank you all for your attention and commitment..