9.9. Starburst Delta Lake Connector#

The Delta Lake connector allows querying data stored in Delta Lake format, including Databricks Delta Lake. It can natively read the Delta transaction log and thus detect when external systems change data.

Note

This connector is currently available as beta release for usage on Azure Storage. Google Cloud Storage is not yet supported. Deployments using AWS or HDFS are fully supported. Work with the Starburst support team, if you are planning to use this connector in production.

Configuration#

The connector requires a Hive metastore for table metadata and supports the same metastore configuration properties as the Hive connector. At a minimum, hive.metastore.uri must be configured.

The connector recognizes Delta tables created in the metastore by the Databricks runtime. If non-Delta tables are present in the metastore, as well, they will not be visible to the connector.

To configure the Delta Lake connector, create a catalog file, for example etc/catalog/delta.properties, that references the delta-lake connector. Update the hive.metastore.uri with the URI of your Hive metastore Thrift service:

connector.name=delta-lake
hive.metastore.uri=thrift://example.net:9083

If you are using AWS Glue as Hive metastore, you can simply set the metastore to glue:

connector.name=delta-lake
hive.metastore=glue

The Delta Lake connector reuses certain functionalities from the Hive connector, including the metastore thrift and glue configuration, detailed in the Hive connector documentation.

To configure access to S3, S3-compatible storage, Azure storage, and others, consult the Amazon S3 section of Hive connector documentation or the Azure storage documentation, respectively.

Authorization#

The connector supports standard Hive security for authorization. For configuration properties, see the ‘Authorization’ section in Hive Security Configuration.

Delta Lake-specific Configuration Properties#

Configuration Properties#

Property name

Description

Default value

delta.metadata.cache-ttl

How frequently the connector checks for new Delta transactions.

5 min

delta.metadata.live-files.cache-size

Amount of memory allocated for caching information about Delta files.

10% of the maximum memory allocated to the master node JVM

Creating Tables#

When Delta tables exist in storage, but not in the metastore, Presto can be used to register them:

CREATE TABLE delta.default.my_table (
   dummy bigint
 )
WITH (
   location = '...'
)

Note that the columns listed in the DDL, such as dummy in the above example, are ignored. The table schema is read from the transaction log, instead. If the schema is changed by an external system, Presto automatically uses the new schema.

Statistics#

The Delta Lake specification defines a number of per file statistics that can be included in the transaction log. When they are present, the connector uses them to expose table and column level statistics, as documented in Table Statistics. Only the number of distinct values for a column is not provided by the transaction log, and can therefore not be used for cost based optimizations. This results in less efficient optimization.

The Databricks runtime automatically collects and records statistics, while they are missing for tables created by the open source Delta Lake implementation.

Limitations#

Inserting, removing and updating data is not supported. The connector does not support DDL statements, with the exception of CREATE TABLE, as described above, and CREATE SCHEMA.

Warning

The column type TIMESTAMP in Delta Lake is currently mapped to TIMESTAMP in Presto. In Delta Lake this type includes timezone information, which is currently lost. An upcoming release changes the mapping to the correct TIMESTAMP WITH TIME ZONE type in Presto.