6.1. Overview of Presto on Azure
Starburst Presto is available on the Azure HDInsight Platform. Azure HDInsight is a fully-managed cloud service that makes it easy, fast, and cost-effective to process massive amounts of data. Presto and HDInsight were both designed for the separation of storage and compute which allows for flexible performance and cost. You pay only for what you use by creating clusters on demand scaling them up and down. You can read more about Azure HDInsight from Microsoft.
Presto is deployed as an application on Azure HDInsight and can be configured to immediately start querying data in Azure Storage. This allows you to install Presto on an existing HDInsight cluster or create a new cluster for Presto. Additionally, Presto is able to leverage the external metastore feature on HDInsight and share metadata with other clusters such as Hive and Spark
Releases & Software
Starburst Enterprise Presto
Starburst Enterprise Presto on Azure HDInsight is always based on our latest release.
Check out our Release Notes for more detailed and up to date information.
Packaged alongside Presto is the business intelligence tool Apache Superset. Superset is a data exploration and visualization web application that enables users to process data in a variety of ways including writing SQL queries, creating new tables, creating a visualization (slice), adding that visualization to one or many dashboards and downloading a CSV. Superset’s SQL Lab IDE provides the user ways to both query and visualize data. You can explore and preview tables in Presto and effortlessly compose SQL queries to access data. From here, you can export a CSV file or immediately visualize your data in the Superset “Explore” view.
For more information on Apache Superset, please refer to our dedicated section: Using Apache Superset.
Starburst Presto is installed as an application on the Azure HDInsight Hadoop Cluster. The Cluster contains 2 HDInsight Head nodes and a variable number of HDInsight Worker nodes. The Presto Coordinator is installed on one of the two HDInsight Head Nodes and the Presto Workers are installed on HDInsight Worker Nodes. When HDInsight is either scaled up or down, Presto will scale with it.
The deployment also includes an HDInsight Edge Node. The Presto command line interface (CLI) and Apache Superset is installed on this node.
It is highly recommended to not use the Presto HDInsight cluster for anything compute intensive when Presto is in use. Presto is designed to assume full utilization of the system resources to get maximum performance. For example, do not run both Hive jobs and Presto queries simultaneously on the same cluster. However you can run them on the same cluster during different times.
Recommended Node Sizes
Generally, you should favor clusters with a higher CPU/memory ratio, as Presto is usually bound by CPU. However, machines with larger memory are favorable if one or more of the following cases hold true:
- Queries are failing because of exceeding node memory limits
- There is high query concurrency and queries are executing in the reserved memory pool
- A large number of nodes with lower memory and a higher number of CPUs is required because maximum query memory is very high
- Query skewness issues
- Presto is not bounded by CPU
|Instance Family||Category||CPU/Memory Ratio||Use Case|
|Fsv2, Fs, F||Compute||High||
|B, Dsv3, Dv3, DSv2, Dv2, Av2.||General||Moderate||
|Esv3, Ev3, M, GS, G, DSv2, Dv2||Memory||Low||
For more information refer to Microsoft’s documentation for Virtual Machines and sizes
Using Azure Powershell and Azure CLI
This guide primarily refers to using the Azure Portal when describing various actions. The same functionality is also available by using Azure Powershell and Azure CLI. Refer to the following documentation from Microsoft on how to manage Azure HDInsight clusters using these tools.