6.1. Querying Google Cloud Storage

Preliminary Steps

Ensure Access to GCS

If your data is publicly available, you do not need to do anything here. However, in most cases data is not publicly available, and the Presto cluster needs to have access to it. This is typically achieved by creating a service account which has permissions to access your data. You can do this on the service accounts page in GCP. Once you create a service account, create a key for it and download the key in JSON format.

Configure Cloud Storage connector

Access to Cloud Storage data is possible thanks to Hadoop Cloud Storage connector. It allows Presto (and other components using Hadoop) to access files in GCS via the gs:// protocol.

There are two ways to configure this connector:

  • Simplified Presto-specific configuration via Hive catalog properties (hive.properties)
  • Full configuration for the entire Hadoop ecosystem

The simplified configuration method is useful if only Presto needs to access files in GCS, and is described in the Hive Connector section. If another Hadoop-based component (e.g. Hive metastore) also needs to access these files, it will be necessary to perform full configuration.

Full GCS configuration:

Google Cloud Storage access is configured in core-site.xml and will be used by all components using Apache Hadoop. Here’s an example set of properties responsible for GCS access:

<property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>
        The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
    </description>
</property>
<property>
    <name>fs.gs.project.id</name>
    <value></value>
    <description>
        Required. Google Cloud Project ID with access to configured GCS buckets.
    </description>
</property>
<!--
The following property is required when Presto is NOT running in a GCE VM.
The connector will look to see if it is running on a GCE VM with some level of GCS access in it's service account scope, and use that service account.
-->
<property>
    <name>fs.gs.auth.service.account.json.keyfile</name>
    <value></value>
    <description>
        The path to the json keyfile for the service account.
    </description>
</property>
<!--
The following property is required only when Presto must run as different user than Hive metastore and Hive metastore uses StorageBasedAuthorization.
The connector presents files with '700' POSIX permissions by default, and the above combination causes Hive metastore to deny access to Presto.
-->
<property>
    <name>fs.gs.reported.permissions​</name>
    <value>777</value>
    <description>
      Optional. Override the default '700' simulated POSIX file permissions.
    </description>
</property>

More configuration properties and their default values are shown in gcs-core-default.xml.

Hive Connector

Another requirement is that you have enabled and configured a Hive connector in Presto. The connector uses Hive metastore for data discovery and is not limited to data residing on HDFS.

Configuring Hive Connector

  • URL to Hive metastore:

    • New Hive metastore on GCP:

      If your Presto nodes are provisioned by GCP, your Hive metastore should also be on GCP to minimize latency and costs. The simplest way to create a new Hive metastore on GCP is to create a small Cloud DataProc cluster (1 master, 0 workers), accessible from your Presto cluster. Follow the steps for existing Hive metastore after finishing this step.

    • Existing Hive metastore:

      To use an existing Hive metastore with a Presto cluster, you need to set the hive.metastore.uri property in your Hive catalog properties file to thrift://${METASTORE_ADDRESS}:${METASTORE_THRIFT_PORT}. If the metastore uses authentication, please refer to Hive Security Configuration.

  • GCS access:

    • Simplified configuration:

      Here are example values for all GCS configuration properties which can be set in the Hive catalog properties file:

      # JSON key file used to access Google Cloud Storage
      hive.gcs.json-key-file-path=/path/to/gcs_keyfile.json
      
      # Use client-provided OAuth token to access Google Cloud Storage
      hive.gcs.use-access-token=false
      
    • Full configuration:

      If you need more configuration options, finish the Full GCS configuration section and then add the full path to your core-site.xml file to the hive.config.resources property in the Hive catalog properties file. This will allow Presto Hive connector to pick up GCS configuration during startup.

Accessing GCS Data From Presto for the First Time

Accessing Data Already Mapped in the Hive metastore

If you migrate to Presto from Hive, chances are that your GCS data is already mapped to SQL tables in the metastore. In that case, you should be able to query it.

Accessing Data Not Yet Mapped in the Hive metastore

To access GCS data that is not yet mapped in the Hive metastore you need to provide the schema of the data, the file format, and the data location. For example, if you have ORC or Parquet files in an GCS bucket my_bucket, you will need to execute a command similar to the following.

presto> -- select schema in which the table will be defined, must already exist
presto> USE hive.default;
USE

presto> -- create table
presto> CREATE TABLE orders (
     ->     orderkey bigint,
     ->     custkey bigint,
     ->     orderstatus varchar(1),
     ->     totalprice double,
     ->     orderdate date,
     ->     orderpriority varchar(15),
     ->     clerk varchar(15),
     ->     shippriority integer,
     ->     comment varchar(79)
     -> ) WITH (
     ->     external_location = 'gs://my_bucket/path/to/folder',
     ->     format = 'ORC' -- or 'PARQUET'
     -> );
CREATE TABLE

Now you should be able to query the newly mapped table:

presto> SELECT * FROM orders;

Statistics

It’s very easy to query data stored on GCS, as all you need to do is define the mapping between the GCS data and SQL tables. However, if your queries are complex and include joining large data sets, you may run into performance issues. This is because Presto does not know the statistical properties of the data and the such properties are a basis for the Presto Cost-Based Optimizer’s decisions.

Collecting Statistics

To gather table statistics for tables in Hive connector, you will need to execute the following command:

presto> ANALYZE orders;

Writing GCS Data with Presto

Prerequisites

Before you attempt to write data to GCS, make sure you have configured everything necessary to read data from GCS.

Create Export Schema

If Hive metastore contains schema(s) mapped to GCS locations, you can use them to export data to GCS. If you don’t want to use existing schemas (or there are no appropriate schemas in the Hive metastore), you need to create a new one.

presto> CREATE SCHEMA hive.gcs_export WITH (location = 'gs://my_bucket/some/path');

Export Data to GCS

Once you have a schema pointing to a location where you want to export the data, you can issue the export using a CREATE TABLE AS statement and select your desired file format.

presto> CREATE TABLE hive.gcs_export.my_table
WITH (format = 'ORC')
AS <source query>;

The data will be written to one or more files within the gs://my_bucket/some/path/my_table namespace. The number of files depends on the size of the data being exported and possible parallelization of the source of the data.

Example:

presto> CREATE TABLE hive.gcs_export.orders_export
WITH (format = 'ORC')
AS SELECT * FROM tpch.sf1.orders;