9.9. Starburst Delta Lake Connector#
The Delta Lake connector allows querying data stored in Delta Lake format, including Databricks Delta Lake. It can natively read the Delta transaction log and thus detect when external systems change data.
This connector is currently available as beta release for usage on Azure Storage. Google Cloud Storage is not yet supported. Deployments using AWS or HDFS are fully supported. Work with the Starburst support team, if you are planning to use this connector in production.
The connector requires a Hive metastore for table metadata and supports the same
metastore configuration properties as the Hive connector. At a minimum,
hive.metastore.uri must be configured.
The connector recognizes Delta tables created in the metastore by the Databricks runtime. If non-Delta tables are present in the metastore, as well, they will not be visible to the connector.
To configure the Delta Lake connector, create a catalog file, for example
etc/catalog/delta.properties, that references the
hive.metastore.uri with the URI of your Hive metastore Thrift
If you are using AWS Glue as Hive metastore, you can simply set the metastore to
The Delta Lake connector reuses certain functionalities from the Hive connector, including the metastore thrift and glue configuration, detailed in the Hive connector documentation.
|Property name||Description||Default value|
||How frequently the connector checks for new Delta transactions.||5 min|
||Amount of memory allocated for caching information about Delta files.||10% of the maximum memory allocated to the master node JVM|
When Delta tables exist in storage, but not in the metastore, Presto can be used to register them:
CREATE TABLE delta.default.my_table ( dummy bigint ) WITH ( location = '...' )
Note that the columns listed in the DDL, such as
dummy in the above example,
are ignored. The table schema is read from the transaction log, instead. If the
schema is changed by an external system, Presto automatically uses the new
The Delta Lake specification defines a number of per file statistics that can be included in the transaction log. When they are present, the connector uses them to expose table and column level statistics, as documented in Table Statistics. Only the number of distinct values for a column is not provided by the transaction log, and can therefore not be used for cost based optimizations. This results in less efficient optimization.
The Databricks runtime automatically collects and records statistics, while they are missing for tables created by the open source Delta Lake implementation.
In addition to the defined columns, the Delta Lake connector automatically exposes metadata in a number of hidden columns in each table. You can use these columns in your SQL statements like any other column, e.g., they can be selected directly or used in conditional statements.
- Full file system path name of the file for this row.
- Date and time of the last modification of the file for this row.
- Size of the file for this row.
Inserting, removing and updating data is not supported. The connector does not
support DDL statements, with the exception of
CREATE TABLE, as described
The column type
TIMESTAMP in Delta Lake is currently mapped to
TIMESTAMP in Presto. In Delta Lake this type includes timezone
information, which is currently lost. An upcoming release changes the mapping
to the correct
TIMESTAMP WITH TIME ZONE type in Presto.