4.16. Presto Coordinator High Availability

Starburst Presto offers the ability to enable Presto Coordinator high availability (HA). In the event the Coordinator becomes unavailable, this allows your Presto cluster to automatically switch over to a new Coordinator and continue to accept new queries.

Configuring Presto Coordinator HA

Coordinator high availability (HA) is supported only via the Starburst CloudFormation template in AWS. This is where we tie together all the necessary components for this feature to work. In order to fully utilize this capability set the HACoordinatorsCount field of the Stack creation form (EC2 Configuration section) to a value greater than 1. Setting it to 2 or 3 should suffice most scenarios.

HA is ALWAYS enabled. However in the when HACoordinatorsCount is set to 1, there is no hot standby. In that case Starburst Presto will eventually create a new Coordinator and this may take several minutes. If HACoordinatorsCount equals 2 or more, then there are hot standby Coordinators and the fail-over switch is faster.

Coordinator IP Address

Coordinator is accessible via attached Elastic Network Interface (ENI) which has a static auto-assigned private IP address. After you launch the Starburst CloudFormation cluster stack, note the PrestoCoordinatorURL and PrestoSSH keys in the stack’s Outputs section/tab in the AWS CloudFormation console.

PrestoCoordinatorURL is the Presto Web UI and REST API endpoint address, which you use to point your Presto CLI or JDBC/ODBC drivers or access the Presto Web UI from your browser. PrestoSSH notes the SSH connection details to manually log onto the current Presto Coordinator.

Fail Over Scenario(s)

In general in the event of a failure of the current Presto Coordinator the HA mechanism will kick in and perform the following steps:

  1. Terminate the old/failed Coordinator EC2 instance
  1. Attach Elastic Network Interface to the new Coordinator
  1. Launch a new stand-by coordinator (within a couple minutes)

The core fail-over process (steps 1 and 2) should complete in under a minute, from the time when the Coordinator started failing to respond. It is a matter of seconds once the Coordinator is identified to be in a failed state, but there is some built-in time buffer so that we don’t act on a false alarm. When Elastic Network Interface is attached to the new coordinator it is almost immediately available to clients.

In real life it may happen that a Coordinator “dies” because one of the following:

  • The node becomes unresponsive (e.g. hardware issues, OS level and network issues).
  • The node disappears, might be terminated by some account admin or by AWS.
  • The Presto process may exit because of an fatal error.
  • The Presto process may become unresponsive, e.g. because of a long full garbage collection taking place.

In all those scenarios, after a short grace period, the failed Coordinator, if still exists, is terminated. Then a new Coordinator is chosen among the hot standby instances and has Coordinator ENI attached to it. Clients should re-issue the failed queries when the new Coordinator becomes accessible. A new hot standby coordinator is launched in the background to take place of the one that has just been assigned.

High-Availability Considerations

  • When Presto is deployed in a custom setup (e.g. with a bootstrap script which sets up security) make sure the Presto HTTP port (unsecured) is open and accessible from localhost. You may want to restrict access to it by binding it to localhost only or otherwise securing external access e.g. via the AWS Security Group assigned to the cluster. See HA with Presto HTTPS enabled for more information.
  • Coordinator ENI has private IP address which is accessible only within the same VPC as the Starburst Presto cluster stack. This means in order to connect to Presto’s Coordinator you need to initiate the connection from a client either on EC2 machine deployed in the same VPC or connected to the VPC via a VPN.
  • Please note that all queries that were running when the Coordinator failed, will also fail to complete. You will need to restart these queries on the new Coordinator. Similarly the SSH connections to the old Coordinator will need to be re-established after the fail-over.
  • When connecting via SSH, depending on your SSH configuration you may see login issues like REMOTE HOST IDENTIFICATION HAS CHANGED etc, due to the fact that the underlying host has changed, and the key’s fingerprint that was previously accepted has changed. You may want to not verify the host keys at all, by adding -o StrictHostKeyChecking=no to the SSH command or deleting the key from your known_hosts file and accepting the new one.

HA with Presto HTTPS enabled

The Coordinator’s health is checked by polling Presto locally via HTTP. This is why you need to have the Coordinator’s HTTP port open even if you configured Presto to use HTTPS, regardless how many Coordinators were configured (even if only one). Workers do not need to have their HTTP port open, although we recommend using HTTP for internal cluster communication (unless HTTPS is explicitly required for internal communication as well). Note that using HTTPS for internal communication may have substantial impact on overall cluster performance, because all intermediate data needs to be encrypted and decrypted. The overhead of HTTPS depends on the amount of data sent over network and actual ciphers being used. Example Presto config.properties fragment:

... other `https` related configs

Additionally you should block non-local HTTP access to the Coordinator by configuring the AWS Security Group assigned to the cluster accordingly.