YongJin Lee

Engineering Data, Investing in Tomorrow, Journeying Through Life.

How to create an alert for BigQuery Slowness or Bottleneck

Posted by:

|

On:

|

Why do we want to create an alert for Slow BigQuery

As a Data Engineer and someone who supports the users, it is not great when you start your morning with Slack channel messages about BigQuery or other databases being slow or hanging. I will list some steps to create an alert to detect BigQuery Slowness using Google Cloud Monitoring. 

While the users’ reports can be seen as organic alerts, I always prefer to know things proactively, especially recurring ones.

Most of the time, the cases I have seen are: 1) A User running a query that takes all the slots: cartesian joins, joins on improper conditions, etc. 2) Rarely, there is something going wrong with BigQuery itself. Then it is time to file a report to Google.

How to create a simple alert in Google Cloud Monitoring.

Let’s talk about setting up simple alerting on Google Cloud Monitoring to detect the slowness. This alerting policy will trigger when the 99th percentile of the execution tie of a BigQuery query exceeds a user-defined limit, based on Google’s documentation. I thought this would be enough as when there is slowness; it also shifts everyone’s query execution time as well.

Steps to create an alert

  1. Within the GCP Console and your desired project, go to Monitoring. (I usually use the Search bar on the top to get there). 
  2. On the left side panel, there is an “Alerting” tab. Click it! 
  3. Click the Create Policy button to create a new alerting policy. 
  4. Within Select a Metric, go to BigQuery Project > Query > Query Execution times. Click Query Execution times and hit the Apply button. 
  5. We are now on the Select a Metric tab. For the metric, here is the metric that Google recommends. However, for where I work, due to the batch processing running mostly daily (fewer hourly jobs than my previous orgs), I had to smooth the metrics out by choosing the rolling window of 6 hours (still enough to capture the previously reported slowness). Choose sum for the Rolling window function. Click the Across time series carrot down button and choose 99th percentile as the time series analysis. 
  6. Click Next 
  7. By going through the chart on the right in various time range tabs (1 hr, 6 hr, 1 day, 7 days, 30 days, etc), choose the appropriate threshold value. Too many alerts will make the alert easily ignored by your team and other users. Too little alerting will cause you not to detect the meaningful ones. Now, let’s fill out the blanks:
    Condition Type: Threshold. 
    Alert Triggers: Any time series violates. 
    Threshold position: Above Threshold, 
    Set the threshold value to what you decided.
    In Advanced Options: set up a retest window. This can be narrower than your window range for the alert.
    Then name the condition and click the `Next` button. (make sure to set up a retest window so the incident – it will help the incident get auto-resolved once it is no longer an issue).
  8. Choose how you want to get notified.  (More on how to set up PagerDuty alert in the next section). Fill out the details for the incident creation.
  9. Review Alert, then hit Create Policy!

Integrating Alerts with PagerDuty

Having an alert is great, but the way it notifies can make all the difference, especially during critical scenarios. That’s where PagerDuty comes in. If you’ve ever been on-call, you’re probably familiar with PagerDuty. It’s a robust incident management tool that can integrate with numerous platforms, including Google Cloud Monitoring, to make sure the right people get notified at the right time.

Why Integrate with PagerDuty?

  1. Immediate Attention: Every minute counts when BigQuery slows down, and it impacts your users or other applications. Getting a phone call or a loud notification can make all the difference compared to just an email.
  2. On-call Rotations: With PagerDuty, you can set up on-call rotations, ensuring that there’s always someone responsible for checking alerts.
  3. Escalation Policies: If the first person doesn’t respond, it can automatically escalate to another team member or even a manager.
  4. Centralized Incident Management: Track incidents and postmortems and analyze trends all in one place.

Steps to Integrate Google Cloud Monitoring with PagerDuty:

  1. Setup PagerDuty Service:
    • If you haven’t already, create a PagerDuty account.
    • Inside PagerDuty, go to Services and click on New Service.
    • Give it a name, like “BigQuery Alerts”, and choose an escalation policy.
    • Once created, go to Integrations tab within your service. Click “Add another integration”. Click Event API V2.
    • From your newly created Event API V2, grab your Integration Key.
  2. Integrate with Google Cloud Monitoring:
    • Return to your GCP Console and the Alerting policy you created.
    • In the ‘Notification Channels’ section, click on Add Notification Channel.
    • Choose PagerDuty Service from the list and enter the “Integration Key” from the previous step into “Service Key” section.
  3. Test The Integration:
    • Trigger a test alert to make sure the integration works. This can be done by temporarily setting your thresholds to a level that will trigger an alert or by using a test environment.
  4. Tune and Optimize:
    • Over time, keep an eye on how many alerts you’re getting. Adjust your thresholds or filter out non-essential alerts if it’s too noisy.
    • Utilize PagerDuty’s features, like setting up “quiet hours” for non-critical alerts or creating different severity levels based on the alert’s impact.

Final Thoughts:

While Slack notifications or emails might work for some alerts, when it comes to critical issues like BigQuery slowdowns, having a tool like PagerDuty ensures that your team doesn’t miss the alert. With its advanced features, you can optimize your incident management process, ensuring quick resolution times and better user experiences.

Remember, the goal is to get alerted and respond and rectify the situation as quickly as possible. With Google Cloud Monitoring and PagerDuty working in tandem, you’re setting up a robust and efficient alerting mechanism for your BigQuery environment.

How to create an alert for BigQuery Slowness or Bottleneck

That’s it! Now you and your team are set up to proactively know whenever there is some slowness to the BigQuery.

Noting I also set up the alert based on the average execution times as well. You can also set up the alert based on the slot utilization. Please feel free to play around with alert metrics and thresholds!

Please let me know if you have any questions!