In today’s post, I’ll share insights into tracing the consumption history of BigQuery tables. This activity isn’t just for the curious; it’s a necessary step in optimizing the data management within BigQuery and ensuring its relevance. By understanding table utilization, organizations can streamline storage and computation costs, deprecate unused data, and offer more up-to-date datasets.
Imagine this: a team dives into a pool of available data tables only to find out they’re analyzing stale data with outdated business logic. While creating new data pipelines and products can be exhilarating, maintaining them is equally vital. It may not always steal the limelight like a newly developed ML model or complex data pipelines, but regular maintenance ensures that resources are not wasted on obsolete pipelines.
BigQuery Audit Logs Sink
Google Cloud’s Audit Logs are more than just logs; they’re a detailed record of your Google Cloud service interactions. Although INFORMATION_SCHEMA
views like JOBS_BY_ORGANIZATION
or JOBS_BY_PROJECT
give us a glance, they seem to fall short in showing tables accessed outside BigQuery.
Seeking a solution, I turned to the Google Community forum and discovered that Cloud Audit Logs hold the key. These logs capture both internal and external operations, including tasks like sink jobs to GCS, BigTable, or data accessed via platforms like Spark.
To tap into these logs, follow these steps:
- Access through Cloud Logging: Here’s the filter you’ll need:
protoPayload.metadata."@type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata" AND logName="projects/{project-id}/logs/cloudaudit.googleapis.com%2Fdata_access"
- Set up a Sink:
- Manually via the Cloud Logging UI.
- For those using Terraform, set up a BigQuery sink. More details here.
- Alternatively, use the CLI:
gcloud logging sinks create my-example-sink bigquery.googleapis.com/projects/my-project-id/datasets/auditlog_dataset \ --log-filter='protoPayload.metadata."@type"="type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata" AND logName="projects/{project-id}/logs/cloudaudit.googleapis.com%2Fdata_access"'
Harnessing Audit Logs for Latest Bigquery Table Access Data from Consumption History
Here’s a SQL script to extract the latest access date:
WITH ExtractedData AS (
SELECT
DATE(timestamp) as log_date,
ARRAY_ELEMENT_AT(SPLIT(protopayload_auditlog.resourceName, '/'), 6) AS project_id,
ARRAY_ELEMENT_AT(SPLIT(protopayload_auditlog.resourceName, '/'), 8) AS dataset_id,
ARRAY_ELEMENT_AT(SPLIT(protopayload_auditlog.resourceName, '/'), 10) AS table_name
FROM
`{project_id}.{dataset_id}.cloudaudit_googleapis_com_data_access`
WHERE
timestamp >= TIMESTAMP("{YYYY-MM-DD}")
AND JSON_QUERY(protopayload_auditlog.metadataJson, "$.tableDataRead") IS NOT NULL
)
SELECT
project_id,
dataset_id,
table_name,
MAX(log_date) AS last_access_date
FROM
ExtractedData
GROUP BY
project_id, dataset_id, table_name
ORDER BY
last_access_date ASC;
For those tables with naming conventions ending in @YYYYMMDD
or $YYYYMMDD
, the below version might be more useful:
WITH ExtractedData AS (
SELECT
DATE(timestamp) as log_date,
ARRAY_ELEMENT_AT(SPLIT(SPLIT(SPLIT(protopayload_auditlog.resourceName, '$')[SAFE_OFFSET(0)], '@')[SAFE_OFFSET(0)], '/'), 0) AS parts
FROM
`{project_id}.{dataset_id}.cloudaudit_googleapis_com_data_access`
WHERE
TIMESTAMP_TRUNC(timestamp, DAY) >= TIMESTAMP("2023-04-10")
AND JSON_QUERY(protopayload_auditlog.metadataJson, "$.tableDataRead") IS NOT NULL
)
SELECT
parts[SAFE_OFFSET(1)] AS project_id,
parts[SAFE_OFFSET(3)] AS dataset_id,
parts[SAFE_OFFSET(5)] AS table_name,
MAX(log_date) AS last_access_date
FROM
ExtractedData
GROUP BY
1, 2, 3
ORDER BY
4 ASC
However, I’d recommend refining the query further for optimal performance.
* Please verify if protopayload_auditlog.methodName
values match what you expect to track consumption. You also might want to filter out some SA using `protopayload_auditlog.authenticationInfo.principalEmail` column. I had to filter out some due to some of my organization SAs accessing tables for maintenance tasks.
Wrapping Up
By leveraging these steps, you can effortlessly fetch the consumption history of BigQuery tables within your GCP projects. This practice empowers data managers to treat data as dynamic assets, maintaining their relevance and value over time.