5.8. Querying AWS S3

Preliminary Steps

Ensure Access to S3

Before you start querying the data on S3, you need to make sure the Presto cluster is allowed to query the data. If your S3 data is publicly available, you do not need to do anything. However, typically the data is not publicly available, and you need to grant the Presto cluster access to the data. For this, you need to create an instance profile and select it when creating a cluster. You can then validate the selected instance profile is sufficient for Presto to read S3 data by opening an SSH connection to the Presto coordinator and issuing the following commands. For more information on creating Instance profiles, refer to the Deploying Presto section.

# aws s3 ls s3://your-bucket/path/to/dataset/
# aws s3 cp s3://your-bucket/path/to/dataset/data-file - > /dev/null

Hive Connector

Another requirement is that you have enabled and configured a Hive connector in Presto. The Hive Metastore should also have an instance profile granting access to S3.

Configuring Hive Connector

If you have not yet done so, you need to configure a Hive connector in Presto. The connector uses a Hive Metastore for data discovery and is not limited to data residing on HDFS. Metastore should be viewed as a database of metadata about your data. You can use an existing Hive Metastore or start a new one.

  • Existing Hive Metastore:
To use an existing Hive Metastore instance with a Presto cluster, you need to provide the HiveMetastoreURI parameter in the form of: thrift://example.net:9083. If the Metastore uses authentication, please refer to the Configurations section of this documentation along with the following link.
  • New Hive Metastore:
You can create a new metastore for Presto via a CloudFormation template. See Deploying Presto Metastore.

Accessing S3 Data From Presto for the First Time

Accessing Data Already Mapped in the Hive Metastore

If you chose an existing Hive Metastore instance when configuring the Hive connector, the chances are that your S3 data is already mapped to SQL tables in the Metastore. In that case, you should be able to query it.

Accessing Data Not Yet Mapped in the Hive Metastore

To access S3 data that is not yet mapped in the Hive Metastore you need to provide the schema of the data, the file format, and the data location. For example, if you have ORC or Parquet files in an S3 bucket, my_bucket, you will need to execute a command similar to the following.

presto> -- select schema in which the table will be defined, must already exist
presto> USE hive.default;
USE

presto> -- create table
presto> CREATE TABLE orders (
     ->     orderkey bigint,
     ->     custkey bigint,
     ->     orderstatus varchar(1),
     ->     totalprice double,
     ->     orderdate date,
     ->     orderpriority varchar(15),
     ->     clerk varchar(15),
     ->     shippriority integer,
     ->     comment varchar(79)
     -> ) WITH (
     ->     external_location = 's3://my_bucket/path/to/folder',
     ->     format = 'ORC' -- or 'PARQUET'
     -> );
CREATE TABLE

Now you should be able to query the newly mapped table:

presto> SELECT * FROM orders;

Statistics

It’s very easy to query data stored on S3, as all you need to do is define the mapping between the S3 data and SQL tables. However, if your queries are complex and include joining large data sets, you may run into performance issues. This is because Presto does not know the statistical properties of the data and the such properties are a basis for the Presto Cost-Based Optimizer’s decisions.

Prerequisites

Presto uses Hive for table statistics collection. You do not need to start a Hadoop cluster with Hive just to collect statistics because the Presto cluster can do this for you – automatically stopping the cluster once the statistics collection is finished. For this to work, you need to assign an instance profile to the Presto cluster that has permission to create Elastic MapReduce (EMR) jobs. For more information on instance profiles, please refer to Instance Profiles.

Collecting Statistics

To gather table statistics for tables in Hive connector, you need to open an SSH connection to the Presto coordinator and issue a command like the one below. For more information on connecting via SSH see Accessing the Presto Cluster.

# hive-analyze --schema-pattern 'schema\_name'
--table-pattern 'table\_name'

The --schema-pattern and --table-pattern parameters accept a pattern similar to SQL LIKE’s. '%' matches any sequence of characters (including empty). Due to limitations in Hive, '_' is currently not supported. It may become supported in the future, so if your schema or table name contains an underscore, it needs to be preceded with a backslash: \_.

Note

You may want to run the hive-analyze command in a tmux or screen session, because it may take several minutes or hours to complete (depending on actual data size). For your convenience, both these programs are included on the Presto coordinator node.

Writing S3 Data with Presto

Prerequisites

Before you attempt to write data to S3, make sure you have configured everything necessary to read data from S3.

Create Export Schema

If Hive Metastore contains schema(s) mapped to S3 locations, you can use them to export data to S3. If you don’t want to use existing schemas (or there are no appropriate schemas in the Hive Metastore), you need to create a new one.

presto> CREATE SCHEMA hive.s3_export WITH (location = 's3://my_bucket/some/path');

Export Data to S3

Once you have a schema pointing to a location where you want to export the data, you can issue the export using a CREATE TABLE AS statement and select your desired file format.

presto> CREATE TABLE hive.s3_export.my_table
WITH (format = 'ORC')
AS <source query>;

The data will be written to one or more files within the s3://my_bucket/some/path/my_table namespace. The number of files depends on the size of the data being exported and possible parallelization of the source of the data.

Example:

presto> CREATE TABLE hive.s3_export.orders_export
WITH (format = 'ORC')
AS SELECT * FROM tpch.sf1.orders;