5.6. Querying Azure Storage

Hive Connector configuration

Presto for HDInsight can be configured to query Azure Standard Blob Storage and Azure Data Lake Storage Gen2 (ABFS). Azure Blobs are accessed via the Windows Azure Storage Blob (WASB). This layer is built on top of the HDFS APIs and is what allows for the separation of storage from the cluster. This is key to what allows you to scale Presto and HDInsight independently of storage. All configuration for the Azure storage driver is stored in HDInsight’s core-site.xml configuration file.

Warning

We no longer support ADLS Gen1 storage.

If you choose the supported Azure Storage options in HDInsight, it will be configured automatically for you. However, if you need to change it later, the following need to be specified in the hive.properties Presto configuration.

hive.config.resources=<path_to_hadoop_core-site.xml>

Refer to the Custom Configuration section for how to extend the default configurations.

Presto configuration for Azure Storage credentials

If you do not want to rely on Hadoop’s core-site.xml and want to have Presto configured independently with the storage credentials, you can add the following properties in the hive.properties Presto configuration.

Note

We suggest to use this kind of configuration when you only have the Primary Storage account linked to the cluster. When there are secondary storage accounts involved, it is recommended to configure Presto using hive.config.resources=<path_to_hadoop_core-site.xml>, provided core-site.xml has the all storage account credentials.

Property Name Description
hive.azure.wasb-storage-account Storage account name of Azure Blob Storage.
hive.azure.wasb-access-key The decrypted access key for the Azure Blob Storage.

If you choose to use ADLS Gen2, you need to add the following to your hive.properties Presto configuration:

Property Name Description
hive.azure.abfs-storage-account Storage account name of Azure Data Lake Storage Gen2.
hive.azure.abfs-access-key The decrypted access key for the Azure Data Lake Storage Gen2.

Accessing Azure Storage Data

URI scheme to reference data

Consistent with other FileSystem implementations within Hadoop, the Azure Standard Blob and Azure Data Lake Storage Gen2 (ABFS) drivers define their own URI scheme so that resources (directories and files) may be distinctly addressed. You can access both primary and secondary storage accounts linked to the cluster with the same URI scheme.

The structure of the ABFS URI is: abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<file_name> and for the Azure Standard blob URI is: wasb[s]://container@account_name.dfs.core.windows.net/<path>/<path>/<file_name>.

Configuring Storage Credentials

Usage of Azure Standard Blob Storage requires configuration of credentials, set in core-site.xml. The configuration property name is of the form fs.azure.account.key.<account name>.blob.core.windows.net in encrypted form. By default, HDInsight uses org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider to decrypt the key and the core-site.xml looks like the following.

../_images/blob_key_encrypt.png

For Azure Data Lake Storage Gen2 by default, HDInsight cluster is configured with a “Custom” TokenServiceBasedAccessTokenProvider as below.

../_images/oauth_hadoop_gen2.png

Access to Azure storage may not work as expected with any other custom type of key encryption/auth mechanisms and we do not recommend changing these Hadoop configuration properties from default values.

Querying Azure storage from Presto

You should be able to query the tables already configured in your Hive metastore used in Presto’s Hive connector. To access Azure Storage data that is not yet mapped in the Hive metastore you need to provide the schema of the data, the file format, and the data location. For example, if you have ORC or Parquet files in an ABFS file_system, you will need to execute a query:

-- select schema in which the table will be defined, must already exist
USE hive.default;

-- create table
CREATE TABLE orders (
     orderkey bigint,
     custkey bigint,
     orderstatus varchar(1),
     totalprice double,
     orderdate date,
     orderpriority varchar(15),
     clerk varchar(15),
     shippriority integer,
     comment varchar(79)
) WITH (
     external_location = 'abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<file_name>`',
     format = 'ORC' -- or 'PARQUET'
);

Now you should be able to query the newly mapped table:

SELECT * FROM orders;

Writing Azure Data with Presto

Prerequisites

Before you attempt to write data to Azure Storage, make sure you have configured everything necessary to read data from the storage.

Create Write Schema

If Hive metastore contains schema(s) mapped to Azure storage filesystems, you can use them to write data to Azure storage. If you don’t want to use existing schemas (or there are no appropriate schemas in the Hive metastore), you need to create a new one:

CREATE SCHEMA hive.abfs_export WITH (location = 'abfs[s]://file_system@account_name.dfs.core.windows.net/<path>');

Write Data to Azure storage

Once you have a schema pointing to a location where you want to write the data, you can issue a CREATE TABLE AS statement and select your desired file format. The data will be written to one or more files within the abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/my_table namespace. Example:

CREATE TABLE hive.abfs_export.orders_abfs
WITH (format = 'ORC')
AS SELECT * FROM tpch.sf1.orders;