9.25. Caching for Distributed Storage#

Querying object storage with the Hive connector is a very common use case for Presto, that often involves the transfer of large amounts of data. The objects are retrieved from HDFS, or any other supported object storage, by multiple workers and processed on these workers.

Repeated queries with different parameters, or even different queries from different users, often access, and therefore transfer, the same objects. The caching can provide significant performance benefits, by avoiding the repeated network transfers and instead accessing copies of the objects from a local cache.

Warning

Caching is currently available as beta release only. Work with the Starburst support team, if you are planning to use this feature in production.

Architecture#

Caching for distributed storage provides a read-through cache. After first retrieval from storage by any query, objects are cached in the local cache storage on the workers. Objects are cached on local storage of each worker and managed by a bookkeeper component. The cache chunks are 1MB in size and are well suited for ORC or Parquet format objects.

Configuration#

The caching feature is part of the Hive connector and can be activated in the catalog properties file:

connector.name=hive-hadoop2
hive.cache.enabled=true
hive.cache.location=/opt/hive-cache

The cache operates on the coordinator and all workers accessing the object storage. The used networking ports for the managing bookkeeper and the data transfer, by default 8898 and 8899, need to be available.

To use caching on multiple catalogs, you need to configure different caching directories and different book-keeper and data-transfer ports.

Cache configuration parameters#

Property

Description

Default

hive.cache.enabled

Toggle to enable or disable caching

false

hive.cache.location

Required directory location to use for the cache storage on each worker. Fast cache performance can be achieved with a RAM disk used as in-memory cache, or with high performance SSD disk usage. Storage should be local to on each coordinator and worker node. The directory needs to exist before Presto starts.

hive.cache.data-transfer-port

The TCP/IP port used to transfer data managed by the cache.

8898

hive.cache.bookkeeper-port

The TCP/IP port used by the bookkeeper managing the cache.

8899

hive.cache.parallel-warmup-enabled

Toggle to enable that during the first retrieval of objects, they are returned and cached in parallel. This can reduce the impact of a cold cache.

true