State of Finops 2024: Reducing waste

I was happy to see that waste reduction made it to the top priority for organisations having migrated to the Cloud. I was not happy to see Leadership adoption lagging behind since these two are closely related: if the leadership prioritises waste reduction, they need to lead by example and demonstrate this on a daily basis by their behaviours and decisions. In the Finops community it is agreed that about 30% of the Cloud Spend is waste. As was once stated: “this money can eradicate world hunger”, instead companies happily pay AWS, Azure and GCP invoices with even flicking a eyebrow.

(c) State op Finops 2024, Finops Foundation

The definition of waste has not changed since they Great Cloud Migration kicked off. If you have been in IT as long as I have, you will know it has always been there. Just not enough people like me who recognise it and act on it. Finops is also applicable to the ‘old’ Datacenter world and in a hybrid environment: just look after the money!

If you take care of the pounds, the pennies will look after themselves

Proverb

A few examples from my Datacenter days (1980-2020):
– hundreds expired tapes being retained because they are still in the catalog: I uncataloged them so they could be reused, delaying buying new tapes;
– temporary files being saved on disk (and backed-up): I deleted the backups and changed the JCL accordingly;
– servers and PCs not decommissioned by e-mail: I submitted the formal requests and saved 400k in my first year;
– VMs not decommissioned after the project was finished or the SLA was cancelled: I had 100 servers decommissioned delaying investment in additional capacity (Capex). And retagged a 100 to the correct code. I also found interesting ways to let others pay for your infrastructure.

And a few from my Cloud Days (from 2022):
– an failing function app costing 2000 euro per month which served no purpose: It was removed;
– unattached disks which were forgotten: I have these removed on a regular basis, saving 10k per year;
– pointless Azure Virtual Desktop backups which will never be used: I had the backup disabled and saved 50k per year. And discovered that expired and obsolete backups were not purged; another 20k saved;
– 5TB Provisioned storage which contained one 3MB file: it was resized to 100GB and saved 11k per year. I know, still too large but it was the quickest solution;

– a 4-core VM which had an OS-upgrade to an erroneous 64-core VM: I had it right-sized to 4-cores and saved 18k per year.
– Stopped VMs had been left running after a policy update. The sudden drop in cost suggested too many VMs had been running for 10 weeks: 52k wasted.

My point is if you look for waste, you will find it. If no-one has inspected an area for a year or two, waste will be found. These examples above demonstrate it. Regular Application Management should take care of this, but you know how it goes: if it works, don’t touch it. Engineer A leaves not having documented and handed over properly, engineer B takes over and does not have the courage to touch it. After a few iterations knowledge has evapourated.
I did find my self in these situations and spent considerable effort to discover how the application worked by decomposing the sequences, checking the Cobol code, debugging a VTOC to find the correct DCB option codes. But that was me, a personal drive I have never lost.

Looking at just Cloud, where do I see waste now? It’s a long list but I’ll discuss just three. ‘Unused or wasted’ resources take many shapes and forms in the Cloud. And elsewhere.

Idle VMs comes first. When do you run your Dev & Test VMs? When you develop and test. Which is not after office hours, not in weekends, not on National Holidays, not during lunch or office parties. In this example there are (in my opinion) too many Dev/test VM-hours in the weekend.

Deallocate those as standard and only start them when they are needed. I see many of these VMs running 24/7 for reasons ranging from “I might need it when I feel like it” to “I can’t be bothered”.
It’s easy to grant privilege to engineers to start/stop their own VMs, but Infrastructure-as-Code (IaC) or Monitoring may get confused about the properties and status of that VM. It will generate alerts like ‘VM down’ or redeploy the wrong size of resource.
Azure CLI can be used to start a VM when required. For stop and deallocate VMs there are similar commands:

Second comes oversizing of VMs and SQL Instances: too many cores. Azure Advisor gives good clues for follow-up:

Often the sizing is done on that one peak per day between 09:30 and 10:30 resulting in oversizing for up to 23h per day. Why not increase before the peak and decrease after? This is what I did in the 80s with JCL: allocate increased capacity at the start and release it once the batch is finished. In Azure a simple CLI does the trick (from Microsoft training site) but again in case of IaC this is problematic (at least I am told it is).

For SQL Managed Instance it is slightly more complicated, but by updating the $vcores-variable and deploy the code the same objective is achieved: less cores hence less cost!

It took me 15 minutes to find this coding, why isn’t everyone doing it? It does not get priority! If no-one is feeling the pain of cost overrrun or is incentivised to control cost, nothing will happen. It’s just a waste of company cash.

Third is Licenses. Not using Azure Hybrid Usage Benefit (AHUB) for Windows Server is not picking the low hanging ripe fruit. If you have Windows Server VMs running 24/7 with 8 cores or more, get AHUB and save about 75% on license cost compared to pay-as-you-go. For SQL Server it’s more complicated, but there are companies who can help you with this.

Since Cloud is about Opex and real cash, I recommend to follow the money and focus where the cost is highest. Based on the pricinciples:
– only switch it on when you need it (also applicable to production workloads);
– right-size at all times.
I found it helpful to know which resources incur the highest cost. Which meters run fastest. I collect this from Cloudhealth-Flexreports: cost per resourceid for Meter Category, Meter Subcategory and Meter name extracted into PowerBI. The results were enlightening!

I expected the largest SQL Managed Instances at the top, but it was ‘Devops Test Licenses’. For which I found a receptive Product Owner who prioritised a conversion to ‘Basic Licences’ as a standard, only ‘Test Licenses’ for those who need it. This saved 30k per year.

I found overprovisioned ‘Provisioned Premium LRS’ storage which I had decreased in steps of 0,5TB since the team told me they needed the IOPS for performance reasons. I think they had picked the wrong storage (…). I found this also for oversized VMs: they needed 8 cores for the IOPS. In my opinion again a case of picking the wrong SKU (…).

Given the way I presented the data in PowerBI it was easy to look at the cost from various angles giving me great insight what makes the money flow and what can be done about it. This can also be done in Azure Portal: amortized cost per resource grouped by Meter. and repeat this for Meter (Sub)category. Please note I did stumble upon an error for Amortized cost in Azure Portal: it shows the use of Reserved Instances as 0,00. More about this in a later post, Microsoft still maintains it ‘works as designed’. Luckily Cloudhealth does this correct.

I found a meter I had never seen before. It was an error during a change: a wrong option was activated for MySQL ‘Paid IO LRS’, by the time this was corrected 4133 euro had been burned.

Now the tricky part ‘Leadership buy-in’. In a classic Datacenter, the cost level is very predictable: cost per service is known, depreciation as well. Cost allocation prediction is a flat line which will be true if costallocation is done on ‘1/12th of the budget per month’. If it is done on ‘actuals cost based on used services’ some might be encouraged to optimise Opex. Usually Opex savings cannot be used for Capex activities like improvement projects due to bookkeeping constraints. So not much for the Leadership to worry about.

In Cloud all infrastructure spend is Opex, freed-up Spend can be used for project infrastructure. Or not spent at all. It requires a financial control process in which optimising Cloud is encouraged and incentivised. For instance setting reducing targets throughout the year per quarter. A tried and tested mechanism is a mandatory follow-up of cost-issues with in a predefined period. Once the Cloud CCoE identifies waste the clock starts ticking. And the leadership will hold the POs to account. The same person who feels the pain if the budget is overrun. Who cannot deploy a new and exciting feature because the budget does not allow it. Who is not allowed to purchase an IaaS-based service because Cloud Native solutions have not been investigated. The leadership must sometimes play the ‘bad guy’ to show they mean business. That they take Cloud Spend serious. To lead by example. This is a culture change which takes time and might require a change in leadership and/or their behaviour.

Some argue ‘but others benefit more from Reservations than I do’. The picture below shows quite a consistent consumption from RI/ASP per application. In any case, talk to your suppliers to allow a different SKU so you can qualify for Reservations. If the application has not been designed for Cloud, find a better one or move it on-premise. Surprise: such application will also have trouble running on a physical VM-host, it prefers bare-metal. Or commit and ask the Cloud CcOE to purchase a reservation just for you. Alternatively ignore the supplier and follow company policies. And be assured: if an application runs on a D8as_v4, it will run on a D8ds_v5. I witnessed a massive Life-Cycle OS upgrade and in the end ONE application did not run instantly on the standard SKUs. ONE out of a few hundred! And after a bit of tinkering it complied to the standard.

As I want to demonstrate above, waste is everywhere. Every organisation needs to regular spring-clean on the infrastructure, its processes, behaviours and people. In industries where money is tight and margins are low this will happen sooner than in industries where money is plentiful. To control spend properly buy-in is required from top to bottom. The leadership can easily calculate how much 30% of their Cloud invoice is (remember, this is the general waste level in Cloud) and challenge their direct reports to harvest it. Make them feel the pain. Keep up the pressure. Stop wasting the company’s money.

This should result in a factory style Cloud Cost Management:
– CCoE and Devops teams identify waste and opportunities for optimisation. This is a daily task which can be automated using AI/ML;
– Devops teams analyse and prioritise the cost User Stories;
– The PO feels the accountability and will plan regular User Stories for inspection and optimisation;
– Budget is reduced after deployment of the cost saving User Story;
– Executives will keep the pressure on cost control.
In an organisation like this the cost control wheel will start and keep spinning. The spend level is under control, outliers are predictable, total cost of ownership is part of all business cases. Reservations are made on services which are actually used, not to get a discount on waste. It is not complicated, just start!

I have severely limited myself but at closing I give you a some more examples of waste based on my experience:
– staff booking time on (incorrect) Opex-code instead of Capex. As a result there is less to depreciate and tax-advantages are reduced
– keeping ‘gold’ support contracts for services which are not critical
– paying for too many licenses after the end-user base has shrunk
– paying pay-as-you-go licenses where the supplier has a corporate deal
– paying for a product which you don’t really use
– going along with price increases without considering an alternative (e.g. Oracle Java SE)
– not using discovery software to maintain the CMDB
– not using IAM to its full extent and sticking to manual AD-group maintenance tickets
– not having a continuous improvement attitude
– having agile ceremonies too often
– performing tasks because ‘we have always done it like this’
Think about these when you are looking at your own organisation and processes. There may be something in it for you. My present to the community, free of charge!

Useful links:
Azure Cost Optimisation Guide https://azure.microsoft.com/en-us/solutions/cost-optimization
Azure CLI https://learn.microsoft.com/en-us/cli/azure/
Oracle Java SE Universal subscription ‘how to count Employees’ https://www.oracle.com/a/ocom/docs/corporate/pricing/java-se-subscription-pricelist-5028356.pdf
Finops Foundation https://www.finops.org/
State of Finops https://data.finops.org/#:~:text=The%202024%20State%20of%20FinOps,and%20sustainability%20show%20further%20insights.
Microsoft training https://learn.microsoft.com/en-us/
Picture borrowed from https://andrewmatveychuk.com/how-to-find-unused-or-orphan-resources-in-azure/


Leave a comment