Pitfalls, Land Mines, Booby Traps

Many organizations make the (understandable) common mistake of trying to “borrow” for DataOps teams the same observability and cost-management tools that are working pretty well elsewhere for DevOps software teams.

Data applications are a completely different animal than web applications. They are built and behave differently, run in a very different kind of environment, on totally different technologies, for different purposes.

For example, the leading application performance management (APM) solutions like Datadog, Dynatrace, AppDynamics, New Relic, et al. do an outstanding job with web apps but fall short when dealing with data apps. APM tools were never intended or designed for the specific requirements of data applications:

  • Capture a different kind of telemetry data generated by the modern data stack, down to the granular level of sub-parts of jobs processing in parallel
  • Apply a different kind of analysis to all the data to understand, find, and fix a totally different class of problems, root causes, and remediations—things like degree of parallelism, load imbalance, skew, and code execution

There’s no shortage of so-called cloud cost management products available (Cloudability, Apptio, Cloudcheckr) to help visualize where the money is going, but they don’t really go any deeper than a bird’s-eye overview of aggregated spend.

Even the cloud provider- and platform-native point tools (Overwatch, AWS Cost Explorer, Microsoft Cost Management) still leave it to you to do most of the heavy lifting when it comes to making smarter spending decisions about how you’re running things.

Both Raj and Sammi (MegaBankCorp’s engineering team leads) have worked with Datadog for years in one capacity or another, and they see what Dynatrace does for their web application counterparts on the other side of the house. But neither APM solution captured the depth of detail needed to visualize and optimize data costs—maybe 10-15%, at best. Often the APM information actually muddied the picture, because they wound up trying to compare apples and oranges. The bank was using Cloudabiity to track cloud costs, but they weren’t able to get much deeper information on the data costs in particular than what was on their AWS bill. A lot of Databricks information was available from Overwatch, but most of the granular job- and user-specific details were deep in the Spark UI.