10.3. Hive Connector
The Hive connector allows querying data stored in a Hive data warehouse. Hive is a combination of three components:
- Data files in varying formats that are typically stored in the Hadoop Distributed File System (HDFS) or in Amazon S3.
- Metadata about how the data files are mapped to schemas and tables. This metadata is stored in a database such as MySQL and is accessed via the Hive metastore service.
- A query language called HiveQL. This query language is executed on a distributed computing framework such as MapReduce or Tez.
Presto only uses the first two components: the data and the metadata. It does not use HiveQL or any part of Hive’s execution environment.
Supported File Types
The following file types are supported for the Hive connector:
The Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).
~/.prestoadmin/catalog/hive.properties with the following
contents to mount the
hive-hadoop2 connector as the
example.net:9083 with the correct host and port
for your Hive metastore Thrift service:
presto-admin to deploy the connector file. See Adding a Catalog.
Multiple Hive Clusters
You can have as many catalogs as you need, so if you have additional
Hive clusters, simply add another properties file to
with a different name (making sure it ends in
example, if you name the property file
will create a catalog named
sales using the configured connector.
For basic setups, Presto configures the HDFS client automatically and
does not require any configuration files. In some cases, such as when using
federated HDFS or NameNode high availability, it is necessary to specify
additional HDFS client options in order to access your HDFS cluster. To do so,
hive.config.resources property to reference your HDFS config files:
Only specify additional configuration files if necessary for your setup. We also recommend reducing the configuration files to have the minimum set of required properties, as additional properties may cause problems.
The configuration files must exist on all Presto nodes. If you are referencing existing Hadoop config files, make sure to copy them to any Presto nodes that are not running Hadoop.
When not using Kerberos with HDFS, Presto will access HDFS using the
OS user of the Presto process. For example, if Presto is running as
nobody, it will access HDFS as
nobody. You can override this
username by setting the
HADOOP_USER_NAME system property in the
Presto JVM Config, replacing
hdfs_user with the
Accessing Hadoop clusters protected with Kerberos authentication
Kerberos authentication is supported for both HDFS and the Hive metastore. However, Kerberos authentication by ticket cache is not yet supported.
The properties that apply to Hive connector security are listed in the Hive Configuration Properties table. Please see the Hive Security Configuration section for a more detailed discussion of the security options in the Hive connector.
Before running any
CREATE TABLE or
CREATE TABLE ... AS statements
for Hive tables in Presto, you need to check that the operating system user
running the Presto server has access to the Hive warehouse directory on HDFS. The Hive warehouse
directory is specified by the configuration variable
hive-site.xml, and the default value is
/user/hive/warehouse. If that
is not the case, either add the following to
jvm.config on all of the nodes:
USER is an operating system user that has proper
permissions for the Hive warehouse directory, or start the Presto server as a user with
similar permissions. The
hive user generally works as
USER, since Hive is often
started with the
hive user. If you run into HDFS permissions problems on
CREATE TABLE ... AS, remove
/tmp/presto-* on HDFS, fix the user as described
above, then restart all of the Presto servers.
Hive Configuration Properties
||The URI(s) of the Hive metastore to connect to using the
Thrift protocol. If multiple URIs are provided, the first
URI is used by default and the rest of the URIs are
fallback metastores. This property is required.
||The username Presto will use to access the Hive metastore.|
||An optional comma-separated list of HDFS
configuration files. These files must exist on the
machines running Presto. Only specify this if
absolutely necessary to access HDFS.
||The default file format used when creating new tables.||
||The compression codec to use when writing files.||
||Force splits to be scheduled on the same node as the Hadoop DataNode process serving the split data. This is useful for installations where Presto is collocated with every DataNode.||
||Should new partitions be written using the existing table format or the default Presto format?||
||Can new data be inserted into existing partitions?||
||Maximum number of partitions per writer.||100|
||Maximum number of partitions for a single table scan.||100,000|
||Hive metastore authentication type.
Possible values are
||The Kerberos principal of the Hive metastore service.|
||The Kerberos principal that Presto will use when connecting to the Hive metastore service.|
||Hive metastore client keytab location.|
||HDFS authentication type.
Possible values are
||Enable HDFS end user impersonation.||
||The Kerberos principal that Presto will use when connecting to HDFS.|
||HDFS client keytab location.|
||See Hive Security Configuration.|
||Path of config file to use when
||Enable writes to non-managed (external) Hive tables.||
||Enable creating non-managed (external) Hive tables.||
||Enable support for multiple files per bucket for Hive clustered tables. See Clustered Hive tables support||
||Enable support for clustered tables with empty partitions. See Clustered Hive tables support||
Amazon S3 Configuration
The Hive Connector can read and write tables that are stored in S3. This is accomplished by having a table or database location that uses an S3 prefix rather than an HDFS prefix.
Presto uses its own S3 filesystem for the URI prefixes
S3 Configuration Properties
||Use the EC2 metadata service to retrieve API credentials
||Default AWS access key to use.|
||Default AWS secret key to use.|
||The S3 storage endpoint server. This can be used to
connect to an S3-compatible storage system instead
of AWS. When using v4 signatures, it is recommended to
set this to the AWS region-specific endpoint
||Specify a different signer type for S3-compatible storage.
||Use path-style access for all requests to the S3-compatible storage.
This is for S3-compatible storage that doesn’t support virtual-hosted-style access.
||Local staging directory for data written to S3.
This defaults to the Java temporary directory specified
by the JVM system property
||Pin S3 requests to the same region as the EC2
instance where Presto is running (defaults to
||Use HTTPS to communicate with the S3 API (defaults to
||Use S3 server-side encryption (defaults to
||The type of key management for S3 server-side encryption.
||The KMS Key ID to use for S3 server-side encryption with KMS-managed keys. If not set, the default key is used.|
||If set, use S3 client-side encryption and use the AWS KMS to store encryption keys and use the value of this property as the KMS Key ID for newly created objects.|
||If set, use S3 client-side encryption and use the
value of this property as the fully qualified name of
a Java class which implements the AWS SDK’s
If you are running Presto on Amazon EC2 using EMR or another facility,
it is highly recommended that you set
true and use IAM Roles for EC2 to govern access to S3. If this is
the case, your EC2 instances will need to be assigned an IAM Role which
grants appropriate access to the data stored in the S3 bucket(s) you wish
to use. This is much cleaner than setting AWS access and secret keys in
hive.s3.aws-secret-key settings, and also
allows EC2 to automatically rotate credentials on a regular basis without
any additional work on your part.
Custom S3 Credentials Provider
You can configure a custom S3 credentials provider by setting the Hadoop
presto.s3.credentials-provider to be the
fully qualified class name of a custom AWS credentials provider
implementation. This class must implement the
interface and provide a two-argument constructor that takes a
java.net.URI and a Hadoop
as arguments. A custom credentials provider can be used to provide
temporary credentials from STS (using
IAM role-based credentials (using
or credentials for a specific use case (e.g., bucket/user specific credentials).
This Hadoop configuration property must be set in the Hadoop configuration
files referenced by the
hive.config.resources Hive connector property.
The following tuning properties affect the behavior of the client
used by the Presto S3 filesystem when communicating with S3.
Most of these parameters affect settings on the
object associated with the
||Maximum number of error retries, set on the S3 client.||
||Maximum number of read attempts to retry.||
||Use exponential backoff starting at 1 second up to this maximum value when communicating with S3.||
||Maximum time to retry communicating with S3.||
||TCP connect timeout.||
||TCP socket read timeout.||
||Maximum number of simultaneous open connections to S3.||
||Minimum file size before multi-part upload to S3 is used.||
||Minimum multi-part upload part size.||
S3 Data Encryption
Presto supports reading and writing encrypted data in S3 using both server-side encryption with S3 managed keys and client-side encryption using either the Amazon KMS or a software plugin to manage AES encryption keys.
With S3 server-side encryption,
(called SSE-S3 in the Amazon documentation) the S3 infrastructure takes care of all encryption and decryption
work (with the exception of SSL to the client, assuming you have
hive.s3.ssl.enabled set to
S3 also manages all the encryption keys for you. To enable this, set
With S3 client-side encryption,
S3 stores encrypted data and the encryption keys are managed outside of the S3 infrastructure. Data is encrypted
and decrypted by Presto instead of in the S3 infrastructure. In this case, encryption keys can be managed
either by using the AWS KMS or your own key management system. To use the AWS KMS for key management, set
hive.s3.kms-key-id to the UUID of a KMS key. Your AWS credentials or EC2 IAM role will need to be
granted permission to use the given key as well.
To use a custom encryption key management system, set
hive.s3.encryption-materials-provider to the
fully qualified name of a class which implements the
interface from the AWS Java SDK. This class will have to be accessible to the Hive Connector through the
classpath and must be able to communicate with your custom key management system. If this class also implements
org.apache.hadoop.conf.Configurable interface from the Hadoop Java API, then the Hadoop configuration
will be passed in after the object instance is created and before it is asked to provision or retrieve any
Hive allows the partitions in a table to have a different schema than the table. This occurs when the column types of a table are changed after partitions already exist (that use the original column types). The Hive connector supports this by allowing the same conversions as Hive:
varcharto and from
- Widening conversions for integers, such as
Any conversion failure will result in null, which is the same behavior
as Hive. For example, converting the string
'foo' to a number,
or converting the string
'1234' to a
tinyint (which has a
maximum value of
The Hive connector supports querying and manipulating Hive tables and schemas (databases). While some uncommon operations will need to be performed using Hive directly, most operations can be performed using Presto.
Create a new Hive schema named
web that will store tables in an
S3 bucket named
CREATE SCHEMA hive.web WITH (location = 's3://my-bucket/')
Create a new Hive table named
page_views in the
that is stored using the ORC file format, partitioned by date and
country, and bucketed by user into
50 buckets (note that Hive
requires the partition columns to be the last columns in the table):
CREATE TABLE hive.web.page_views ( view_time timestamp, user_id bigint, page_url varchar, ds date, country varchar ) WITH ( format = 'ORC', partitioned_by = ARRAY['ds', 'country'], bucketed_by = ARRAY['user_id'], bucket_count = 50 )
Drop a partition from the
DELETE FROM hive.web.page_views WHERE ds = DATE '2016-08-09' AND country = 'US'
SELECT * FROM hive.web.page_views
Create an external Hive table named
request_logs that points at
existing data in S3:
CREATE TABLE hive.web.request_logs ( request_time timestamp, url varchar, ip varchar, user_agent varchar ) WITH ( format = 'TEXTFILE', external_location = 's3://my-bucket/data/logs/' )
Drop the external table
request_logs. This only drops the metadata
for the table. The referenced data directory is not deleted:
DROP TABLE hive.web.request_logs
Drop a schema:
DROP SCHEMA hive.web
Clustered Hive tables support
By default Presto supports only one data file per bucket per partition for clustered tables (Hive tables declared with
CLUSTERED BY clause).
If number of files does not match number of buckets exception would be thrown.
To enable support for cases where there are more than one file per bucket, when multiple INSERTs were done to a single partition of the clustered table, you can use:
multi_file_bucketing_enabledsession property (using
SET SESSION <connector_name>.multi_file_bucketing_enabled)
Config property changes behaviour globally and session property can be used on per query basis. The default value of session property is taken from config property.
If support for multiple files per bucket is enabled Presto will group the files in partition directory. It will sort filenames lexicographically. Then it will treat part of filename up to first underscore character as bucket key. This pattern matches naming convention of files in directory when Hive is used to inject data into table.
Presto will still validate if number of file groups matches number of buckets declared for table and fail if it does not.
Similarly by default empty partitions (partitions with no files) are not allowed for clustered Hive tables. To enable support for empty paritions you can use:
empty_bucketed_partitions_enabledsession property (using
SET SESSION <connector_name>.empty_bucketed_partitions_enabled)
Hive Connector Limitations
DELETE is only supported if the
WHERE clause matches entire partitions.