The ML.DESCRIBE_DATA function

This document describes the ML.DESCRIBE_DATA function, which you can use to generate descriptive statistics for the columns in a table or subquery. For example, you might want to know statistics for a table of training or serving data that you plan to use with a machine learning (ML) model. You can use the data output by this function for such purposes as feature preprocessing or model monitoring.

Syntax

ML.DESCRIBE_DATA(
  { TABLE `project_id.dataset.table` | (query_statement) },
  STRUCT(
    [num_quantiles AS num_quantiles]
    [, num_array_length_quantiles AS num_array_length_quantiles]
    [, top_k AS top_k])
)

Arguments

ML.DESCRIBE_DATA takes the following arguments:

project_id: your project ID.
dataset: the BigQuery dataset that contains the table.
table: the name of the input table that contains the training or serving data to calculate statistics for.
query_statement: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of the query_statement clause, see GoogleSQL query syntax.
num_quantiles: an INT64 value that specifies the number of quantiles to return for numerical, ARRAY, and ARRAY> columns. This affects the number of results shown in the quantiles output column. These quantiles describe the distribution of the data in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. The num_quantiles value must be in the range [1, 100,000]. The default value is 2.
num_array_length_quantiles: an INT64 value that specifies the number of quantiles to return for ARRAY columns. This affects the number of results shown in the array_length_quantiles output column. These quantiles describe the distribution of the length of the arrays in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. The num_array_length_quantiles value must be in the range [1, 100,000]. The default value is 10.
top_k: an INT64 value that specifies the number of top values to return for categorical and ARRAY columns. This affects the number of results shown in the top_values output column. The top values are the values that are shown most frequently in the column. The top_k value must be in the range [1, 10,000]. The default value is 1.

Details

ML.DESCRIBE_DATA handles input columns as follows:

ARRAY columns are unnested before statistics are computed on them.
ARRAY>. The INT64 value is the index, and the numerical value is the value. For statistics computation, BigQuery ML treats columns of this type as ARRAY based on the value. The value of the dimension column in the output is MAX(index) + 1.
STRUCT fields are expanded, and then categorical columns are cast to STRING and numerical columns are cast to FLOAT64.
Columns of the following data types are cast to STRING and return the same statistics as STRING columns:
- BOOL
- BYTE
- DATE
- DATETIME
- TIME
- TIMESTAMP Columns of the following data types are cast to FLOAT64 and return the same statistics as FLOAT64 columns:
- INT64
- NUMERIC
- BIGNUMERIC

Output

ML.DESCRIBE_DATA returns one row for each column in the input data. ML.DESCRIBE_DATA output contains the following columns:

name: a STRING column that contains the name of the input column.
num_rows: an INT64 column that contains the total number of rows for the input column.
num_nulls: an INT64 column that returns the number of NULL values found in the column.
num_zeros: an INT64 column that contains one of the following:
- For numerical, ARRAY, and ARRAY> input columns, returns the number of 0 values found in the column.
- For categorical or ARRAY input columns, returns NULL.
min: a STRING column that contains the MIN value for the column.
max: a STRING column that contains the MAX value for the column.
mean: a FLOAT64 column that contains one of the following:
- For numerical, ARRAY, and ARRAY> input columns, returns the mean value calculated for the column.
- For categorical or ARRAY input columns, returns NULL.
stdev: a FLOAT64 column that contains one of the following:
- For numerical, ARRAY, and ARRAY> input columns, returns the standard deviation value calculated for the column.
- For categorical or ARRAY input columns, returns NULL.
median: a FLOAT64 column that contains one of the following:
- For numerical, ARRAY, and ARRAY> input columns, returns the median value calculated for the column.
- For categorical or ARRAY input columns, returns NULL.
quantiles: an ARRAY column that contains information about the quantiles in an input column, as computed by the APPROX_QUANTILES function. The quantiles column contains one of the following values:
- For numerical, ARRAY, and ARRAY> input columns, returns the quantiles computed for the column.
- For categorical or ARRAY input columns, returns NULL.
unique: an INT64 column that contains information about the number of unique values in an input column, as computed by the APPROX_COUNT_DISTINCT function. The unique column contains one of the following values:
- For numerical, ARRAY, and ARRAY> input columns, returns NULL.
- For categorical or ARRAY input columns, returns the number of unique values in the input column.
avg_string_length: a FLOAT64 column that contains one of the following:
- For numerical, ARRAY, and ARRAY> input columns, returns NULL.
- For categorical or ARRAY input columns, returns the average length of the values in the column.
num_values: an INT64 column that contains the number of array elements for ARRAY columns, and the number of values in the column for other types of columns.
top_values: a ARRAY> column that contains information about the top values and number of occurrences in an input column, as computed by the APPROX_TOP_COUNT function. The top_values column contains the following fields:
- top_values.value: a STRING field that contains one of the following values:
  - For numerical, ARRAY, and ARRAY> input columns, returns NULL.
  - For categorical or ARRAY input columns, returns one of the top values in the input column.
- top_values.count: an INT64 field that contains one of the following values:
  - For numerical, ARRAY, and ARRAY> input columns, returns NULL.
  - For categorical or ARRAY input columns, returns the number of times the related top value appears.
min_array_length: an INT64 column that contains one of the following values:
- For ARRAY input columns, returns the minimum length of an array in the column.
- For other types of input columns, returns NULL.
max_array_length: an INT64 column that contains one of the following values:
- For ARRAY input columns, returns the maximum length of an array in the column.
- For other types of input columns, returns NULL.
avg_array_length: a FLOAT64 column that contains one of the following values:
- For ARRAY input columns, returns the average length of an array in the column.
- For other types of input columns, returns NULL.
total_array_length: an INT64 column that contains one of the following values:
- For ARRAY input columns, returns the sum of the size of the arrays in the column.
- For other types of input columns, returns NULL.
array_length_quantiles: an ARRAY column that contains the information about the quantiles for the array length in an input column, as computed by the APPROX_QUANTILES function. The array_length_quantiles column contains one of the following values:
- For ARRAY input columns, returns the quantiles for the array length computed for the column.
- For other types of input columns, returns 0.
dimension: an INT64 column that contains one of the following:
- For ARRAY> input columns, returns the dimension computed for the column, which is MAX(index) + 1 for sparse input.
- For other types of input columns, returns NULL.

Example

The following example returns statistics for a table with five quantiles calculated for numeric columns and three top values returned for non-numeric columns:

SELECT *
FROM ML.DESCRIBE_DATA(
  TABLE `myproject.mydataset.mytable`,
  STRUCT(5 AS num_quantiles, 3 AS top_k)
);

Limitations

Input data for the ML.DESCRIBE_DATA function can only contain columns of the following data types:

Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
ARRAY> (a sparse tensor)
STRUCT columns that contain any of the following types:
- Numeric types
- STRING
- BOOL
- BYTE
- DATE
- DATETIME
- TIME
- TIMESTAMP
ARRAY columns that contain any of the following types:
- Numeric types
- STRING
- BOOL
- BYTE
- DATE
- DATETIME
- TIME
- TIMESTAMP