Querying AWS S3#

Preliminary Steps#

Ensure Access to S3#

Before you start querying the data on S3, you need to make sure the Presto cluster is allowed to query the data. If your S3 data is publicly available, you do not need to do anything. However, typically the data is not publicly available, and you need to grant Presto access. For this, you need to create an instance profile and select it when creating a cluster. You can then validate the selected instance profile is sufficient for Presto to read S3 data by opening an SSH connection to the coordinator and issuing the following commands. For more information on creating Instance profiles, refer to the Deploying Presto section.

aws s3 ls s3://your-bucket/path/to/dataset/
aws s3 cp s3://your-bucket/path/to/dataset/data-file - > /dev/null

Hive Catalog#

Another requirement is that you have enabled and configured a Hive catalog. The Hive Metastore should also have an instance profile granting access to S3.

Configuring Hive Catalog

If you have not yet done so, you need to configure catalog using the Hive connector in Presto. The connector uses a Hive Metastore for data discovery and is not limited to data residing on HDFS. Metastore should be viewed as a database of metadata about your data. You can use an existing Hive Metastore or start a new one.

  • Existing Hive Metastore:

    To use an existing Hive Metastore instance with a Presto cluster, you need to set the MetastoreType parameter to External Hive Metastore Service and ExternalMetastoreHost to IP address of your Hive Metastore Service (see Configuring Hive Metastore section). If the Metastore uses authentication, refer to the Configurations section of this documentation along with the following link.

Accessing S3 Data From Presto for the First Time#

Accessing Data Already Mapped in the Hive Metastore#

If you chose an existing Hive Metastore instance when configuring the Hive catalog, chances are that your S3 data is already mapped to SQL tables in the Metastore. In that case, you should be able to query it.

Accessing Data Not Yet Mapped in the Hive Metastore#

To access S3 data that is not yet mapped in the Hive Metastore you need to provide the schema of the data, the file format, and the data location. For example, if you have ORC or Parquet files in an S3 bucket, my_bucket, you need to execute a command similar to the following.

presto> -- select schema in which the table will be defined, must already exist
presto> USE hive.default;
USE

presto> -- create table
presto> CREATE TABLE orders (
     ->     orderkey bigint,
     ->     custkey bigint,
     ->     orderstatus varchar(1),
     ->     totalprice double,
     ->     orderdate date,
     ->     orderpriority varchar(15),
     ->     clerk varchar(15),
     ->     shippriority integer,
     ->     comment varchar(79)
     -> ) WITH (
     ->     external_location = 's3://my_bucket/path/to/folder',
     ->     format = 'ORC' -- or 'PARQUET'
     -> );
CREATE TABLE

Now you should be able to query the newly mapped table:

presto> SELECT * FROM orders;

Statistics#

It’s very easy to query data stored on S3, as all you need to do is define the mapping between the S3 data and SQL tables. However, if your queries are complex and include joining large data sets, you may run into performance issues. This is because Presto does not know the statistical properties of the data and the such properties are a base for the cost-based optimizer’s decisions.

Collecting Statistics#

To gather table statistics for tables in Hive connector, you need to execute the following command:

presto> ANALYZE orders;

Writing S3 Data with Presto#

Prerequisites#

Before you attempt to write data to S3, make sure you have configured everything necessary to read data from S3.

Create Export Schema#

If the Hive Metastore contains schema(s) mapped to S3 locations, you can use them to export data to S3. If you don’t want to use existing schemas (or there are no appropriate schemas in the Hive Metastore), you need to create a new one.

presto> CREATE SCHEMA hive.s3_export WITH (location = 's3://my_bucket/some/path');

Export Data to S3#

Once you have a schema pointing to a location where you want to export the data, you can issue the export using a CREATE TABLE AS statement and select your desired file format.

presto> CREATE TABLE hive.s3_export.my_table
WITH (format = 'ORC')
AS <source query>;

The data is written to one or more files within the s3://my_bucket/some/path/my_table namespace. The number of files depends on the size of the data being exported and possible parallelization of the source of the data.

Example:

presto> CREATE TABLE hive.s3_export.orders_export
WITH (format = 'ORC')
AS SELECT * FROM tpch.sf1.orders;