4.7. Collecting Statistics

If your queries are complex and include joining large data sets, you may run into performance issues. This is because Presto does not know the statistical properties of the data and the such properties are a basis for the Presto Cost-Based Optimizer’s decisions.

Presto uses Hive for table statistics collection. Because Presto is installed on an HDInsight cluster with Hadoop, collecting statistics is very simple.

1. Connect to the Edge Node

Log into the Edge Node as described in Connect to the Presto Edge Node via SSH.

2. Run beeline to connect to Hive

beeline -u 'jdbc:hive2://headnodehost:10001/;transportMode=http'

3. Collect Statistics

analyze table default.hivesampletable compute statistics for columns;
analyze table default.hivesampletable compute statistics;

It is recommended you use the Custom Metastore so that the statistics persist even when the Presto HDInsight cluster is recreated.