YongJin Lee

Engineering Data, Investing in Tomorrow, Journeying Through Life.

Tracking Consumption History in BigQuery Tables: A Guide to Retrieving the Latest Access Date

Posted by:

|

On:

|

In today’s post, I’ll share insights into tracing the consumption history of BigQuery tables. This activity isn’t just for the curious; it’s a necessary step in optimizing the data management within BigQuery and ensuring its relevance. By understanding table utilization, organizations can streamline storage and computation costs, deprecate unused data, and offer more up-to-date datasets.

Imagine this: a team dives into a pool of available data tables only to find out they’re analyzing stale data with outdated business logic. While creating new data pipelines and products can be exhilarating, maintaining them is equally vital. It may not always steal the limelight like a newly developed ML model or complex data pipelines, but regular maintenance ensures that resources are not wasted on obsolete pipelines.

consumption history

BigQuery Audit Logs Sink

Google Cloud’s Audit Logs are more than just logs; they’re a detailed record of your Google Cloud service interactions. Although INFORMATION_SCHEMA views like JOBS_BY_ORGANIZATION or JOBS_BY_PROJECT give us a glance, they seem to fall short in showing tables accessed outside BigQuery.

Seeking a solution, I turned to the Google Community forum and discovered that Cloud Audit Logs hold the key. These logs capture both internal and external operations, including tasks like sink jobs to GCS, BigTable, or data accessed via platforms like Spark.

To tap into these logs, follow these steps:

  1. Access through Cloud Logging: Here’s the filter you’ll need:
protoPayload.metadata."@type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata" AND logName="projects/{project-id}/logs/cloudaudit.googleapis.com%2Fdata_access"
  1. Set up a Sink:
gcloud logging sinks create my-example-sink bigquery.googleapis.com/projects/my-project-id/datasets/auditlog_dataset \ --log-filter='protoPayload.metadata."@type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata" AND logName="projects/{project-id}/logs/cloudaudit.googleapis.com%2Fdata_access"'

Harnessing Audit Logs for Latest Bigquery Table Access Data from Consumption History

Here’s a SQL script to extract the latest access date:

WITH ExtractedData AS (
  SELECT
    DATE(timestamp) as log_date,
    ARRAY_ELEMENT_AT(SPLIT(protopayload_auditlog.resourceName, '/'), 6) AS project_id,
    ARRAY_ELEMENT_AT(SPLIT(protopayload_auditlog.resourceName, '/'), 8) AS dataset_id,
    ARRAY_ELEMENT_AT(SPLIT(protopayload_auditlog.resourceName, '/'), 10) AS table_name
  FROM 
    `{project_id}.{dataset_id}.cloudaudit_googleapis_com_data_access` 
  WHERE 
    timestamp >= TIMESTAMP("{YYYY-MM-DD}") 
    AND JSON_QUERY(protopayload_auditlog.metadataJson, "$.tableDataRead") IS NOT NULL
)

SELECT
  project_id,
  dataset_id,
  table_name,
  MAX(log_date) AS last_access_date
FROM 
  ExtractedData
GROUP BY 
  project_id, dataset_id, table_name
ORDER BY 
  last_access_date ASC;

For those tables with naming conventions ending in @YYYYMMDD or $YYYYMMDD, the below version might be more useful:

WITH ExtractedData AS (
  SELECT
    DATE(timestamp) as log_date,
    ARRAY_ELEMENT_AT(SPLIT(SPLIT(SPLIT(protopayload_auditlog.resourceName, '$')[SAFE_OFFSET(0)], '@')[SAFE_OFFSET(0)], '/'), 0) AS parts
  FROM 
    `{project_id}.{dataset_id}.cloudaudit_googleapis_com_data_access` 
  WHERE 
    TIMESTAMP_TRUNC(timestamp, DAY) >= TIMESTAMP("2023-04-10") 
    AND JSON_QUERY(protopayload_auditlog.metadataJson, "$.tableDataRead") IS NOT NULL
)

SELECT
  parts[SAFE_OFFSET(1)] AS project_id,
  parts[SAFE_OFFSET(3)] AS dataset_id,
  parts[SAFE_OFFSET(5)] AS table_name,
  MAX(log_date) AS last_access_date
FROM 
  ExtractedData
GROUP BY 
  1, 2, 3
ORDER BY 
  4 ASC

However, I’d recommend refining the query further for optimal performance.
* Please verify if protopayload_auditlog.methodName values match what you expect to track consumption. You also might want to filter out some SA using `protopayload_auditlog.authenticationInfo.principalEmail` column. I had to filter out some due to some of my organization SAs accessing tables for maintenance tasks.

Wrapping Up

By leveraging these steps, you can effortlessly fetch the consumption history of BigQuery tables within your GCP projects. This practice empowers data managers to treat data as dynamic assets, maintaining their relevance and value over time.