Starburst Delta Lake connector#

The Delta Lake connector allows querying data stored in Delta Lake format, including Databricks Delta Lake. It can natively read the Delta transaction log and thus detect when external systems change data.

Note

This connector is currently available as beta release for usage on Azure Storage. Google Cloud Storage is not yet supported. Deployments using AWS or HDFS are fully supported. Work with the Starburst support team, if you are planning to use this connector in production.

Note

The connector requires a valid Starburst Enterprise Presto license.

Configuration#

The connector requires a Hive metastore for table metadata and supports the same metastore configuration properties as the Hive connector. At a minimum, hive.metastore.uri must be configured.

The connector recognizes Delta tables created in the metastore by the Databricks runtime. If non-Delta tables are present in the metastore, as well, they will not be visible to the connector.

To configure the Delta Lake connector, create a catalog file, for example etc/catalog/delta.properties, that references the delta-lake connector. Update the hive.metastore.uri with the URI of your Hive metastore Thrift service:

connector.name=delta-lake
hive.metastore.uri=thrift://example.net:9083

If you are using AWS Glue as Hive metastore, you can simply set the metastore to glue:

connector.name=delta-lake
hive.metastore=glue

The Delta Lake connector reuses certain functionalities from the Hive connector, including the metastore thrift and glue configuration, detailed in the Hive connector documentation.

To configure access to S3, S3-compatible storage, Azure storage, and others, consult the Amazon S3 section of Hive connector documentation or the Azure storage documentation, respectively.

Configuration properties#

The following configuration properties are all using reasonable, tested default values. Typical usage does not require you to configure them.

Delta Lake configuration properties#

Property name

Description

Default value

delta.metadata.cache-ttl

Frequency of checks for metadata updates, equivalent to transactions, to update the metadata cache

5 min

delta.metadata.live-files.cache-size

Amount of memory allocated for caching information about files. Needs to be specified in data size values such as 64 MB.

Calculated to 10% of the maximum memory allocated to the coordinator JVM.

delta.max-partitions-per-writer

Maximum number of partitions per writer.

100

delta.hide-non-delta-lake-tables

Hide information about tables that are not managed by Delta Lake.

false

delta.validate-table-locations

Determines whether or not the connector checks that the location contains a valid transaction log entry when creating a new table.

true

The following table describes performance tuning catalog properties for the connector.

Warning

Performance tuning configuration properties are considered expert-level features. Altering these properties from their default values is likely to cause instability and performance degradation. We strongly suggest that you use them only to address non-trivial performance issues, and that you keep a backup of the original values if you change them.

Delta Lake performance tuning configuration properties#

Property name

Description

Default value

delta.domain-compaction-threshold

Sets the number of transactions to act as threshold. Once reached the connector initiates compaction of the underlying files and the delta files. A higher compaction threshold means reading less data from the underlying data source, but a higher memory and network consumption.

100

delta.max-outstanding-splits

The target number of buffered splits for each table scan in a query, before the scheduler tries to pause.

1000

delta.max-splits-per-second

Sets the maximum number of splits used per second to access underlying storage. Reduce this number if your limit is routinely exceeded, based on your filesystem limits. This is set to the absolute maximum value, which results in Presto maximizing the parallelization of data access by default. Attempting to set it higher results in Presto not being able to start.

Integer.MAX_VALUE

delta.max-initial-splits

For each query, the coordinator assigns file sections to read first at the initial-split-size until the max-initial-splits is reached. Then, it starts issuing reads of the max-split-size size.

200

delta.max-initial-split-size

Sets the initial data size for a single read section assigned to a worker until max-initial-splits have been processed. You can also use the corresponding catalog session property <catalog-name>.max_initial_split_size.

32 MB

delta.max-split-size

Sets the largest size for a single read section assigned to a worker after max-initial-splits have been processed. You must specify units. You can also use the corresponding catalog session property <catalog-name>.max_split_size.

64 MB

The following table describes catalog session properties supported by the Delta Lake connector to configure processing of Parquet files.

Parquet session properties#

Property name

Description

<catalog-name>.parquet_max_read_block_size

The maximum block size used when reading Parquet files.

<catalog-name>.parquet_writer_block_size

The maximum block size created by the Parquet writer.

<catalog-name>.parquet_writer_page_size

The maximum page size created by the Parquet writer.

<catalog-name>.parquet_optimized_writer_enabled

Enables the experimental, native Parquet writer.

Special columns#

In addition to the defined columns, the Delta Lake connector automatically exposes metadata in a number of hidden columns in each table. You can use these columns in your SQL statements like any other column, e.g., they can be selected directly or used in conditional statements.

  • $path

    Full file system path name of the file for this row.

  • $file_modified_time

    Date and time of the last modification of the file for this row.

  • $file_size

    Size of the file for this row.

Creating tables#

When Delta tables exist in storage, but not in the metastore, Presto can be used to register them:

CREATE TABLE delta.default.my_table (
   dummy bigint
 )
WITH (
   location = '...'
)

Note that the columns listed in the DDL, such as dummy in the above example, are ignored. The table schema is read from the transaction log, instead. If the schema is changed by an external system, Presto automatically uses the new schema.

The Delta Lake connector also supports creating tables using the CREATE TABLE AS syntax. A set of partition columns can optionally be provided using the partitioned_by table property. For example:

CREATE TABLE delta.default.my_partitioned_table
WITH (
   location = '...',
   partitioned_by = ARRAY['regionkey']
)
AS SELECT name, comment, regionkey FROM tpch.tiny.nation;

Performance#

The connector includes a number of performance improvements, detailed in the following sections.

Table statistics#

The Delta Lake specification defines a number of per file statistics that can be included in the transaction log. When they are present, the connector uses them to expose table and column level statistics, as documented in Table Statistics. Only the number of distinct values for a column is not provided by the transaction log, and can therefore not be used for cost based optimizations. This results in less efficient optimization.

The Databricks runtime automatically collects and records statistics, while they are missing for tables created by the open source Delta Lake implementation.

Security#

The connector includes a number of security-related features, detailed in the following sections.

Authorization#

The connector supports standard Hive security for authorization. For configuration properties, see the ‘Authorization’ section in Hive Connector Security Configuration.

Limitations#

Inserting data can only be done during table creation using CREATE TABLE AS, Removing and updating data is not supported. The connector does not support DDL statements, with the exception of CREATE TABLE, as described above, and CREATE SCHEMA.