The ML.DESCRIBE_DATA function
This document describes the ML.DESCRIBE_DATA
function, which you can use to
generate descriptive statistics for the columns in a table or subquery. For
example, you might want to know statistics for a table of training or serving
data that you plan to use with a machine learning (ML) model. You can use the
data output by this function for such purposes as
feature preprocessing or model
monitoring.
Syntax
ML.DESCRIBE_DATA( { TABLE `project_id.dataset.table` | (query_statement) }, STRUCT( [num_quantiles AS num_quantiles] [, num_array_length_quantiles AS num_array_length_quantiles] [, top_k AS top_k]) )
Arguments
ML.DESCRIBE_DATA
takes the following arguments:
project_id
: your project ID.dataset
: the BigQuery dataset that contains the table.table
: the name of the input table that contains the training or serving data to calculate statistics for.query_statement
: a query that generates the training or serving data to calculate statistics for. For the supported SQL syntax of thequery_statement
clause, see GoogleSQL query syntax.num_quantiles
: anINT64
value that specifies the number of quantiles to return for numerical,ARRAY
, andARRAY
columns. This affects the number of results shown in the> quantiles
output column. These quantiles describe the distribution of the data in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. Thenum_quantiles
value must be in the range[1, 100,000]
. The default value is2
.num_array_length_quantiles
: anINT64
value that specifies the number of quantiles to return forARRAY
columns. This affects the number of results shown in thearray_length_quantiles
output column. These quantiles describe the distribution of the length of the arrays in the column. Specify a lower value for coarser-grained distribution information and a higher value for finer-grained distribution information. Thenum_array_length_quantiles
value must be in the range[1, 100,000]
. The default value is10
.top_k
: anINT64
value that specifies the number of top values to return for categorical andARRAY
columns. This affects the number of results shown in thetop_values
output column. The top values are the values that are shown most frequently in the column. Thetop_k
value must be in the range[1, 10,000]
. The default value is1
.
Details
ML.DESCRIBE_DATA
handles input columns as follows:
ARRAY
columns are unnested before statistics are computed on them.ARRAY
. The> INT64
value is the index, and the numerical value is the value. For statistics computation, BigQuery ML treats columns of this type asARRAY
based on the value. The value of the dimension column in the output isMAX(index) + 1
.STRUCT
fields are expanded, and then categorical columns are cast toSTRING
and numerical columns are cast toFLOAT64
.- Columns of the following data types are
cast to
STRING
and return the same statistics asSTRING
columns:BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
Columns of the following data types are cast toFLOAT64
and return the same statistics asFLOAT64
columns:INT64
NUMERIC
BIGNUMERIC
Output
ML.DESCRIBE_DATA
returns one row for each column in the input data.
ML.DESCRIBE_DATA
output contains the following columns:
name
: aSTRING
column that contains the name of the input column.num_rows
: anINT64
column that contains the total number of rows for the input column.num_nulls
: anINT64
column that returns the number ofNULL
values found in the column.num_zeros
: anINT64
column that contains one of the following:- For numerical,
ARRAY
, andARRAY
input columns, returns the number of> 0
values found in the column. - For categorical or
ARRAY
input columns, returnsNULL
.
- For numerical,
min
: aSTRING
column that contains theMIN
value for the column.max
: aSTRING
column that contains theMAX
value for the column.mean
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY
, andARRAY
input columns, returns the mean value calculated for the column.> - For categorical or
ARRAY
input columns, returnsNULL
.
- For numerical,
stdev
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY
, andARRAY
input columns, returns the standard deviation value calculated for the column.> - For categorical or
ARRAY
input columns, returnsNULL
.
- For numerical,
median
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY
, andARRAY
input columns, returns the median value calculated for the column.> - For categorical or
ARRAY
input columns, returnsNULL
.
- For numerical,
quantiles
: anARRAY
column that contains information about the quantiles in an input column, as computed by theAPPROX_QUANTILES
function. Thequantiles
column contains one of the following values:- For numerical,
ARRAY
, andARRAY
input columns, returns the quantiles computed for the column.> - For categorical or
ARRAY
input columns, returnsNULL
.
- For numerical,
unique
: anINT64
column that contains information about the number of unique values in an input column, as computed by theAPPROX_COUNT_DISTINCT
function. Theunique
column contains one of the following values:- For numerical,
ARRAY
, andARRAY
input columns, returns> NULL
. - For categorical or
ARRAY
input columns, returns the number of unique values in the input column.
- For numerical,
avg_string_length
: aFLOAT64
column that contains one of the following:- For numerical,
ARRAY
, andARRAY
input columns, returns> NULL
. - For categorical or
ARRAY
input columns, returns the average length of the values in the column.
- For numerical,
num_values
: anINT64
column that contains the number of array elements forARRAY
columns, and the number of values in the column for other types of columns.top_values
: aARRAY
column that contains information about the top values and number of occurrences in an input column, as computed by the> APPROX_TOP_COUNT
function. Thetop_values
column contains the following fields:top_values.value
: aSTRING
field that contains one of the following values:- For numerical,
ARRAY
, andARRAY
input columns, returns> NULL
. - For categorical or
ARRAY
input columns, returns one of the top values in the input column.
- For numerical,
top_values.count
: anINT64
field that contains one of the following values:- For numerical,
ARRAY
, andARRAY
input columns, returns> NULL
. - For categorical or
ARRAY
input columns, returns the number of times the related top value appears.
- For numerical,
min_array_length
: anINT64
column that contains one of the following values:- For
ARRAY
input columns, returns the minimum length of an array in the column. - For other types of input columns, returns
NULL
.
- For
max_array_length
: anINT64
column that contains one of the following values:- For
ARRAY
input columns, returns the maximum length of an array in the column. - For other types of input columns, returns
NULL
.
- For
avg_array_length
: aFLOAT64
column that contains one of the following values:- For
ARRAY
input columns, returns the average length of an array in the column. - For other types of input columns, returns
NULL
.
- For
total_array_length
: anINT64
column that contains one of the following values:- For
ARRAY
input columns, returns the sum of the size of the arrays in the column. - For other types of input columns, returns
NULL
.
- For
array_length_quantiles
: anARRAY
column that contains the information about the quantiles for the array length in an input column, as computed by theAPPROX_QUANTILES
function. Thearray_length_quantiles
column contains one of the following values:- For
ARRAY
input columns, returns the quantiles for the array length computed for the column. - For other types of input columns, returns
0
.
- For
dimension
: anINT64
column that contains one of the following:- For
ARRAY
input columns, returns the dimension computed for the column, which is> MAX(index) + 1
for sparse input. - For other types of input columns, returns
NULL
.
- For
Example
The following example returns statistics for a table with five quantiles calculated for numeric columns and three top values returned for non-numeric columns:
SELECT * FROM ML.DESCRIBE_DATA( TABLE `myproject.mydataset.mytable`, STRUCT(5 AS num_quantiles, 3 AS top_k) );
Limitations
Input data for the ML.DESCRIBE_DATA
function can only contain columns of the
following data types:
- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
ARRAY
(a sparse tensor)> STRUCT
columns that contain any of the following types:- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP
ARRAY
columns that contain any of the following types:- Numeric types
STRING
BOOL
BYTE
DATE
DATETIME
TIME
TIMESTAMP