Optimize: Eliminate Waste and Inefficiency

This is where the rubber hits the road. If the Inform stage shows you what’s happening with your cloud data costs and why, the Optimize stage is all about finding and fixing where you’re overspending—waste and inefficiency, variable pricing decisions, anywhere you’re paying more than you need to.

There are literally hundreds of thousands of ways and places, big and small (and a lot of small that adds up to big), to optimize cloud data workloads for cost. A DataFinOps approach finds and identifies all these cost-optimization opportunities, using the same underlying financial and performance data used to build the visualization dashboards.

Opportunities to optimize costs are found at different levels

  • Job, or application, level This is where costs first begin to be incurred, and it happens thousands of times a week. It’s all the configuration and code details that determine how much that particular job (or collection of sub-jobs) will cost to run.

    It’s here at the job level where you’re overspending in your budget the most, where there’s the most (inadvertent) cost waste and inefficiency. Those thousands and thousands of individual “spending decisions” that data engineers have to make on the fly about the number, size, type (and now cost) of resources to request, configuration details, code, etc.
    Chances are, you’re wasting about 30-40% of your cloud data budget. A good chunk of that—probably two-thirds, maybe more—is down at the job level.
  • A highly related but slightly different way to look at cost optimization at the job level is User level.
  • Platform level The way you have all these individual jobs run on whatever platform they’re on—Databricks or Snowflake, Amazon EMR, BigQuery or Dataproc—carries a price tag too.
  • Pipeline level Similarly, how everything altogether works at the pipeline level can have a significant impact on cost.
  • Cluster level When differentiated from the “platform level” and “pipeline level,” this is really all about your cloud service provider costs. For simplicity’ sake, most DataFinOps people lump all three into a single “cluster level” category.

There are a lot of factors that go into why you might be overspending to run a particular data workload in the cloud. But it boils down to (a) how long it takes for that workload to run successfully and (b) the price tag of the individual services and machines used to run the workload.

MegaBankCorp took a DataFinOps approach by thinking about things in terms of “per unit” cost controls. They were looking to take the financial and performance data, and establish the per-unit cost today of, say, running each of Jon’s Fraud Detection workloads, then ensure that it’s the most cost-effective way. Control the unit costs where they’re incurred, and improvement cascades upwards and outwards.

But they ran into a brick wall when it came to putting FinOps principles into practice for their data environment. They finally had a good handle on how much they were spending and where, but figuring out whether that was too much or not took way more data engineering time and expertise than they had available.