How we're using AIOps to analyse our data
One of our colleagues, Andrew Longmuir was recently interviewed by CA Technologies on how AIOps is helping William Hill analyze large volumes of digital data—making it easier to solve hard problems. With a new services model, it's imperative that to have comprehensive visibility of IT operational data across the entire digital delivery chain to speed service delivery, increase IT efficiency and deliver a superior user experience. But there’s also the human factor. There is a real skills gap and we need to rely on automation and machines even down to driving self-writing code, since there will not be enough developers around. This blog is reproduced with CA's permission.
Why traditional approaches to IT Ops monitoring aren’t working anymore?
By moving to modern technologies and cloud we create a proliferation of object, services and metrics, a metric explosion if you like. Traditional approaches to monitoring will not work. You can’t have a static threshold or policy defined for every scenario, managed by humans, it’s impossible. As humans, we are all subject to cognitive limitation, algorithms on the other hand are capable of processing millions of events and deriving meaning from large datasets. William Hill have been on this journey for 18 months and it’s a challenging problem. Services are the new focal point, it’s a different paradigm. I think good example is that traditionally we would light up dashboards like a Christmas tree if something went down. This is perfectly normal in a containerized environment due to their ephemeral nature. It’s all about how the service is impacted. The days of a traditional event console are numbered.
What monitoring and analytics capabilities do you need to pursue an AI and machine learning initiative?
You need a strategy in place that supports this. For example, at William Hill we have a data lake that acts as a data fabric (glue) that we push all important events to. This could be operational (through CA tooling), business related, social media related, customer behaviour, security postures, the list goes on. This ensures we don’t have any blind spots across our organization and provides that true holistic view. You then need toolkits to develop your models (we use Python) that we can train on our data. Our ML models are like code and pushed as such. For example, does our version 2 model provide more accurate results than version 1?
How can AI and machine learning help increase automation across your toolchain?
In a lot of ways, it improves collaboration across the entire SDLC and can also reduce time to market for your application delivery teams. We have been on this journey at William Hill for a while, we call it silo busting, whether that be technology or business silos. Let’s get that data source integrated and inherit that value. It’s like trying to make a jigsaw with some key pieces missing otherwise. Without all these pieces of the jigsaw you can’t really formulate an effective self-healing strategy or have any confidence that self-healing decisions will be accurate.
How can AI and machine learning reduce operational complexity?
In a complex world simplicity is very much the key to achieving more. The introduction of Machine Learning into your organisation will facilitate that people and machines work together, find clarity in the chaos and accelerate innovation. Many operations teams still rely on legacy tech and workflows (this is slowly being changed across IT). Using Machine learning and Artificial Intelligence coupled with smart collaboration features will help human operators to learn faster and recover quickly from outages and downtime. Understanding and learning those relationships and using ML techniques such as cluster mapping will help in filtering out noise and identifying root cause. Like I described in William Hills approach all the data we need to make decisions is in one place and allows us to provide that holistic event topology to undertake these actions.
How can AI and machine learning enable faster remediation to problems?
This is an easy one. It will drive the automatic reduction of alert volumes and eliminate operational noise. Proactively detect problems through smart correlation and alert clustering from the entire toolchain and ecosystem. Streamline collaboration and workflow automation and codify knowledge and automate sharing. It really does lend itself to continuous improvement and a true proactive approach.
How can AI and machine learning help improve user experience?
This is an interesting question and something we have only just started considering following our machine learning journey. We are looking to use CA App Experience Analytics data to drive data improvements using advanced Machine Learning. For example, do we get a better transaction throughput if we move this button, change the design of this page, modify the page journey for this app. Find answers to those types of questions and being able to back them up. These ideas are in the incubation stage for us.
How does the organizational structure evolve with the impact of AI and machine learning?
Great question. This is something I have just been through recently with my team, so we can represent the increased role responsibilities and also reflect new technologies and capabilities that we have adopted (roles such as Capacity and Machine Learning Engineer). We are already seeing that across the industry with roles such as SRE, Automation engineers, service assurance engineers. I would argue that everyone’s role should be about site reliability, or their direct responsibility. For example, server reliability engineers, network reliability engineers, are in essence having the same goals. Let’s take it back to Monitoring, this has always been about automation and service integrity running things like automatic actions to clear down logs or restart services. True self-healing involves those complex problems and fixing those links in the chain which is just taking this one step further. All of these capabilities are needed and the skillsets of an SRE include a large percentage of expertise across capacity and the tooling chain to be really affect. The lines of responsibility across roles are becoming blurred. Monitoring has always been capable and about this but takes a company like Google before anybody really stands up and takes notice.
However, I think we are in some exciting times where we have a real skills gap but are being driven to do more and monitoring is front and centre to help achieve this. We need to rely on automation and machines even down to driving self-writing code, there will not be enough developers around.
Where do you recommend others start their journey?
It really is a learning process, the clue is in the name. Find an important use case to trial this out on and stick with it. We started with game and bonus abuse as this type of activity could really impact bottom line revenue across our gaming verticals. Set up a dedicated and driven small team who are tenacious. We had to go on a real journey trying and throwing away ideas and solutions until we found one that worked. Choosing the right metric candidates and Machine Learning approach really is key and don’t be afraid to start again until you find the solution that works for your use case and business.
About Andrew: Andrew Longmuir heads up the Capacity and Monitoring Engineering team at William Hill in the UK. He is an enterprise management professional with over 20 years’ experience in the field of Service, Systems and Network engineering across both enterprise and open source tooling. Experience includes capacity planning/analysis and the architecture, scoping and implementation of monitoring toolsets and cross platform integrations deploying to blue chip companies throughout the UK and Europe. These include a fully referenceable track record in the industry with some implementations being critically acclaimed and appearing in national computing press and also showcased at vendor symposiums.