Automate Alerts and “Circuit Breaker” Actions

When a cost/usage guardrail is violated, the next step could be as benign as sending out an alert or as aggressive as taking some sort of corrective action proactively.

Sammi wanted each individual MegaBankCorp data engineer to get an alert whenever one of his or her individual jobs exceeded a guardrail threshold. If he’s going to hold them individually accountable for their usage and cost, they have to know when they have gone “out of bounds.” The DataFinOps team set up the governance policy to trigger a Slack alert to that data engineer, telling them that their job is taking too long and will miss its SLA or costs too much money and that they to find less expensive options, rewrite it to be more efficient, reschedule it, etc. Sammi and Raj are also notified via alerts (Sammi, email; Raj, Slack) for performance/cost violations, usually DBU usage, that put one budget or another in jeopardy of being exceeded. Michael and Nan want heads-up alerts for potential budget overruns at a higher, more aggregated level. For any Fraud Detection workload cost-governance violations, Jon also gets an alert. 

Sometimes violations of MegaBankCorp’s cost-governance guardrails require more immediate action to trigger automated “circuit breakers” like killing a particular job or shutting down a Databricks cluster (say, when a fatal data-quality problem has gotten into the pipeline). 

An example of automated alerts and autonomous actions can be seen in Unravel’s AutoActions & Alerts demo video and self-guided tour here.

Alerts

Alerts should be flexible and customizable for different users, teams, applications/pipelines, business units, projects. The threshold criteria for a complex ML model or fraud detection analysis is very different from a smaller, discrete data science workload. Best practice is to keep alerts meaningful so as to avoid alert storms and alarm fatigue. 

When sent up the chain of command, automated alerts help team leaders understand where the trouble spots and problems areas lie—identify unapproved spend, rein in rogue users, put the brakes on runaways jobs, flag potential budget overruns, and generally nip cost overruns in the bud.

Autonomous actions

Some guardrail violations might call for triggering preemptive corrective “circuit breaker’ actions to terminate jobs or applications altogether, request configuration changes, etc.

Raj doesn’t have enough people on his Ops teams at MegaBankCorp to tune every application that violates some guardrail threshold. They’re already overwhelmed with trouble tickets and service requests, and they have their hands full firefighting their Cloudera environment. Still, there are quite a few times when some ungoverned application can run up the cloud data bill very quickly. Unfortunately and usually, Raj doesn’t find out about these runaway jobs till after the fact, after unnecessary costs have already been incurred. For a whole class of jobs (defined by user, workload, team, priority, etc.), he can have runaway and rogue jobs killed automatically if they break loose of the guardrails. Or request a configuration change for a container with more memory if that’s what is causing a job to run too long (and therefore cost too much).