Schema discovery#

Schema discovery identifies and registers tables and views that are newly added to a known schema location. For example, when a logging process creates a new log file every hour, rolling over from the previous hour’s log file, schema discovery locates the newly added files so that SEP can query them.

Note

Schema discovery is available as a public preview. Contact Starburst Support with questions or feedback.

Enable schema discovery#

Schema discovery must be explicitly enabled for a catalog by adding the following catalog configuration property:

hive.experimental.schema-discovery.enabled=true

Before enabling schema discovery, review the limitations and warnings section at the end of this topic.

Run schema discovery#

Run schema discovery to analyze a schema in an object storage location and return the structure of any discovered tables using a system table or a table function in that location.

To run schema discovery using a system table, specify the system table in the FROM clause, and the URI and schema in the WHERE clause, along with any options as shown in the following statement:

SELECT sql FROM hive.schema_discovery.discovery
WHERE uri = 'hdfs://hadoop-master:9000/user/hive/warehouse/customer_orders'
AND schema = 'starburst_demo'
AND options = 'sampleFilesPerTableModulo=1,excludePatterns=**/_SUCCESS'

To run schema discovery using a table function, you must call the schema_discovery.discovery function in the FROM clause, and specify the URI and schema along with any options as follows:

SELECT sql FROM TABLE(schema_discovery.discovery(uri => 's3://test-bucket/path/to/discoverable/table',
schema => 'demo_schema', options => 'sampleFilesPerTableModulo=1,excludePatterns=**/_SUCCESS'))

The following sections describe the available options.

Default options#

Schema discovery is run by default with the following options and their default values. Specify a different value as shown in the previous examples to override the default values.

Option

Default value

dateFormat

yyyy-MM-dd

positiveInf

Inf

generatedHeadersFormat

COL%d

nanValue

NaN

locale

US

lineSeparator

nullValue

complexHadoop

false

sampleLinesModulo

3

maxSampleFilesPerTable

5

skipFileExtensionsCheck

false

quote

includePatterns

**

delimiter

,

escape

\

headers

true

ignoreTrailingWhiteSpace

false

discoveryMode

NORMAL

maxSampleLines

10

excludePatterns

.*

sampleFilesPerTableModulo

3

partitionProjection

false

negativeInf

-Inf

encoding

UTF-8

inferJsonStringDecimal

false

supportBuckets

false

ignoreLeadingWhiteSpace

false

maxSampleTables

2147483647

hiveSkipHiddenFiles

false

timestampFormat

yyyy-MM-dd HH:mm:ss[.SSSSSSS]

comment

#

Standard options#

The following options are set at runtime, and apply to all schema discovery queries:

Option

Description

encoding

Charset to use when reading files.

locale

Locale for date parsing.

dateFormat

Date format pattern.

timestampFormat

Time format pattern.

nanValue

Value to use for not-a-number.

positiveInf

Value to use for positive infinity.

negativeInf

Value to use for negative infinity.

includePatterns

HDFS include GLOB, split by.

excludePatterns

HDFS exclude GLOB, split by.

maxSampleLines

Max lines to sample from each file.

sampleLinesModulo

Each SAMPLE_LINES_MODULO line is sampled. i.e. if 3, every 3rd line is sampled.

sampleFilesPerTableModulo

Each SAMPLE_FILES_PER_TABLE_MODULO file is sampled. i.e. if 3, every 3rd file is sampled.

maxSampleTables

Max Tables to discovery.

maxSampleFilesPerTable

Max files per table to sample.

supportBuckets

If “true” attempt to infer buckets.

forceTableFormat

Optional = force the table format [JSON, CSV, ORC, PARQUET, ICEBERG, DELTA_LAKE, ERROR].

discoveryMode

Discovery mode, NORMAL - default discovery mode, directories in top level tables [RECURSIVE_DIRECTORIES, NORMAL].

partitionProjection

If “true” attempt to infer partition projection.

skipFileExtensionsCheck

Whether to skip looking at file extension when discovering file format. Use in case of mismatched file format/extension.

inferJsonStringDecimal

Try to parse string json values into Trino’s DECIMAL type hiveSkipHiddenFiles Whether to skip files/directories starting with _ or.

CSV and text file Options#

The following options are set at runtime, and apply only to schema discovery queries on CSV and text files:

Option

Description

headers

If “true” treat first line as columns names.

generatedHeadersFormat

Pattern to use if column names are auto-generated.

delimiter

Text file delimiter.

quote

Text file quote.

escape

Text file escape.

comment

Text file comment.

nullValue

Text file null value.

ignoreLeadingWhiteSpace

If “true” ignore leading white space.

ignoreTrailingWhiteSpace

If “true” ignore trailing white space.

complexHadoop

If “true” try to discover arrays, structs and maps.

lineSeparator

Text line separator.

Supported formats#

Schema discovery identifies objects stored in the Iceberg, Delta Lake, and Hive table formats. Schema discovery does not identify Hudi tables.

Note

For Iceberg and Delta Lake tables, schema discovery registers tables using the register_table procedure. For Hive tables, schema discovery registers tables using the table metadata.

Schema discovery identifies tables and views that are stored in the following file formats:

  • JSON in Hive-JSON-SerDe format with one JSON object per line and no line termination characters separating the objects.

  • CSV

  • ORC

  • PARQUET

Schema discovery identifies tables and views that use the following compression codecs:

  • ZSTD

  • LZ4

  • SNAPPY

  • GZIP

  • DEFLATE

  • BZIP2

  • LZO

  • LZOP

Note

Schema discovery identifies compression codecs by file extension only. Ensure that your compression codec ends with the proper file extension.

Limitations and warnings#

  • Schema Discovery only creates hive tables.

  • Only JSON, CSV, ORC, and PARQUET file formats are supported.

  • Only ZSTD, LZ4, SNAPPY, GZIP, DEFLATE, BZIP2, LZO, and LZOP compression codes are supported.

Warning

The public preview of this feature is not subject to role-based access limitations; it can discover tables and views for all schema locations is has access to. If you do not have access to a specific table, you can still use schema discovery to create a new table with the data from the restricted table if:

  • You have access to schema discovery.

  • You know the location of the restricted table.