Configuring Starburst Enterprise with CFT#

Starburst Enterprise platform (SEP) has an extensive set of configuration switches that allow it to be tuned for certain specific requirements. Default values are chosen for the best “out of the box” experience. However, if you need to fine-tune SEP behavior, you can do so when using Starburst’s CloudFormation template.

Default configuration#

The following configuration changes are applied automatically for you:

Java heap maximum memory (-Xmx) is set appropriately for the selected EC2 instance type
JVM’s JIT caches are set to 512 MiB
Java is configured to use G1 garbage collector, this is the recommended garbage collector to use when running SEP
If Hive Metastore is configured (refer to Configuring the Hive Metastore Service with CFT), the hive catalog is configured with connector configuration left at default values.
A query audit event listener is configured in etc/event-listener-audit-log.properties. If you have configured another event listener, add the property event-listener.config-files in the config properties file, and ensure both files are in the list comma-separated list.
The query.max-memory property is set to 1PB. This setting overrides the low default value.

Note

All configuration changes generated by the CFT are stored in the etc directory of the SEP installation directory. Because the installation directory itself is mounted as a RAM disk, files generated by the CFT configuration are also stored in memory only.

No secrets in any files, such as usernames or passwords in catalog files, are actually stored on disk at any time and the files can not be access from outside the running EC2 instances.

Custom configuration#

When using Starburst’s CloudFormation template, configuration packages for the coordinator, workers and catalogs are used to customize SEP. These configuration packages are used to append or override the default SEP configuration.

The CloudFormation template provides the AdditionalCoordinatorConfigurationURI and AdditionalWorkersConfigurationURI parameters used to specify the locations of the configuration packages for the coordinator and workers respectively. See the following sections for how to create, upload, and use configuration packages for SEP.

Note

All configuration changes made to your SEP cluster must be performed via the CloudFormation Template. If you manually change the configurations on the instances running SEP, the changes are not persisted.

Creating a configuration package#

A configuration package is a ZIP file with the structure shown below. All files are optional except for top-level etc/ directory entry.

etc/
  config.properties
  jvm.config
  catalog/
    hive.properties
    <catalog-name>.properties

Warning

You must use this exact directory structure or SEP is unable to start correctly.

Startup script nodes#
Node name	Description
`etc/config.properties`	This global configuration file is optional. Refer to the properties reference documentation for details.
`etc/jvm.config`	This Java Virtual Machine configuration file is optional. Certain options, including `-Xmx` and garbage collection algorithm selection are set by default.
`etc/catalog/hive.properties`	If the configuration package contains this file and the Hive Metastore is not configured (refer to Configuring the Hive Metastore Service with CFT) when launching Starburst’s CloudFormation template, then the file must contain the following: connector.name=hive hive.metastore.uri=thrift://example.net:9083 If the `MetastoreType` parameter is set to something other than `None`, then the `hive.properties` file was already created and it is not needed to provide the above. However, you can still provide a `hive.properties` file that includes properties you wish to append to the configuration. Refer to the Hive connector and Hive security documentation of options that can be set here. Also refer to Auxiliary Files in this table for instructions on how to configure properties that refer to additional files.
`etc/catalog/<catalog-name>.properties`	When such a file is placed in the configuration package, a catalog called `<catalog-name>` is created. The file must contain the following: connector.name=<connector_name> Where `<connector_name>` is the name of the connector, refer to the connector documentation documentation for a list of supported connectors and their documentation. If the chosen connector has some mandatory configuration parameters, they must be set in the `<catalog-name>.properties` file. There can be more than one such file in the `etc/catalog/` folder of the configuration package. This allows you to define multiple catalogs. Refer to Auxiliary Files in this table for instructions on how to configure properties that refer to additional files.
Auxiliary files	If a configuration property in any of the configuration files accepts a path to an additional file (e.g., Hive’s `security.config-file`), add the file to the configuration package and refer to it using a path that is relative to the configuration package top-level directory. For example, if you are configuring Hive connector to use `hive.security=file`, you also must set `security.config-file` (see Hive Security documentation for the meaning and structure of the file). To do so, add `etc/catalog/hive-security.json` in the configuration package and refer to `etc/catalog/hive-security.json` using a relative path: hive.security=file security.config-file=etc/catalog/hive-security.json

Uploading a configuration package to S3#

To use a configuration package ZIP when launching Starburst’s CloudFormation template, it must first be uploaded to S3 to a location of your choice.

Warning

If the configuration package contains sensitive information such as passwords, AWS access keys or Kerberos keytab files, make sure to use an S3 location that is not publicly accessible.

Using a configuration package#

When launching Starburst’s CloudFormation template, you can use the AdditionalCoordinatorConfigurationURI and AdditionalWorkersConfigurationURI parameters to refer to the configuration package that should be applied on top of default configuration done by the template. The URI should be of the form s3://my_bucket/path/to/configuration/package.zip. You may decide to use a single configuration package for use by both the SEP coordinator and workers or use different packages for each. Additionally, you may provide a configuration package only for the coordinator or worker.

If you upload to a location that is not publicly accessible, you must use IamInstanceProfile parameter when launching the cluster, and the selected Instance Profile must allow read access to the selected S3 location.

Updating a configuration package#

Instead of deleting a CloudFormation stack and creating a new one, you can use the AWS stack update feature to update the SEP configuration package. You must first create a new configuration package with the necessary changes, and then upload it to S3 as described in the previous sections. Then when updating the CloudFormation stack, enter the new S3 location as values to the AdditionalCoordinatorConfigurationURI and AdditionalWorkersConfigurationURI parameters. When CloudFormation is applying the updates, it updates the stack by using the new configuration package to configure SEP.

AWS CloudFormation does not update the CloudFormation stack if the values to the parameters have not changed. Therefore you must create a new configuration package zip file with a different name. We recommend including a version name within the file name to avoid any confusion when updating your configurations.

For example, if the original configuration package was located at s3://my_bucket/path/to/configuration/package-1.0.zip, then create a new configuration package with a location such as: s3://my_bucket/path/to/configuration/package-2.0.zip. Even if you change the contents of s3://my_bucket/path/to/configuration/package-1.0.zip and keep the name, CloudFormation is not able to update the configuration.

Interactions between default and custom configurations#

It is important to note that default values are overridden only for keys where a customization exists. If no customizations are made, the default value remains. However, in the case of jvm.config, additional configuration entries are appended to the default configuration.

CFT configuration parameters#

The CFT includes numerous configuration parameters that are grouped in different sections. All listed parameters have a description in the AWS console.

Network configuration#

Network Configuration Parameters#
Parameter key	Description	Example
`VPC`	Virtual Private Cloud ID	vpc-4bd6ca11
`Subnet`	Subnet to use for SEP nodes (must belong to the selected VPC)	subnet-123abc2b
`SelectedSubnetAutoAssignsPublicIp`	Set to `no` if selected subnet does not provide public IPs. In this case VPC endpoints are created for the SEP stack. VPC Endpoints create an `EndpointSecurityGroup`. There is no option for an existing security group for the end point.	yes
`SecurityGroups`	Additional Security Groups for SEP nodes (e.g: allowing SSH access). Must select at least one.	sg-12e34aeb

EC2 configuration#

The EC2 configuration details the infrastructure used for your SEP cluster.

Choose a CoordinatorInstanceType and WorkerInstanceType suitable for your workload. The r4.4xlarge instance types are chosen by default and work well for most workloads. See our CFT deployment guide for information about what instance types may be best for you.

EC2 Configuration Parameters#
Parameter key	Description	Default	Example
`CoordinatorInstanceType`	EC2 instance type of the coordinator.	r4.xlarge	r5.12xlarge
`WorkerInstanceType`	EC2 instance type of the workers.	r4.xlarge	m5.4xlarge
`KeyName`	Name of an EC2 KeyPair to enable SSH access to the instance. See SSH keys for more details.		john.smith
`WorkersCount`	Number of dedicated worker nodes (apart from coordinator) to instantiate. Worker nodes are added to an AWS AutoScaling Group. See Auto scaling for more details.		10
`HACoordinatorsCount`	Number of coordinator nodes to instantiate. If there’s more then one, the coordinator offers HA capabilities. This number represents one active coordinator plus the number of optional hot-standby coordinators. For example, if you specify 3, then there is 1 active coordinator and 2 standby coordinators, if the active one fails. See Coordinator high availability for more details.	1	3
`WorkerMountVolume`	Mount an additional EBS volume on each worker at `/data`. This is required when using caching for distributed storage. Make sure that the `/data` directory is configured in your Hive catalog properties.	no	yes
`WorkerVolumeType`	Type of the additional EBS volume mounted on the workers.	io1	gp2
`WorkerVolumeSize`	Size of the additional EBS volume mounted on the workers, in GiB. Use at least 10GiB with the io1 volume type. Value must be in the range of 4 to 16384.	4	100
`WorkerVolumeIOPS`	The number of possible I/O operations per second for the additional volume. Used only with the io1 volume type. Each 5000 I/O ops require at least 100 GiB storage size on the volume. Value must be in the range of 100 to 20000.	100	2000
`KeepCoordinatorNode`	(Debug only) Keep coordinator node running after the coordinator service fails.	no	yes

SEP configuration#

The SEP configuration parameter allow you to configure all SEP-specific aspects of your coordinators and workers in the cluster.

SEP Configuration Parameters#
Parameter key	Description
`AdditionalCoordinatorConfigurationURI`	(Optional) URI of S3 zip file with additional configuration for the coordinator. This zip file must contain the required directory structure. Example `s3://my_bucket/starburst-additional-coordinator-configuration-1.0.zip`.
`AdditionalWorkersConfigurationURI`	(Optional) URI of S3 zip file with additional configuration for the workers. This zip file must contain the required directory structure. Example `s3://my_bucket/starburst-additional-workers-configuration-1.0.zip`.
`BootstrapScriptURI`	(Optional) URI of a shell script stored on S3 to execute on all nodes. The script runs after SEP is configured, but before it is started. For example, a bash script can be used to create directories, install additional software, deploy UDFs, or deploy other plugins. When the script is executed, a string argument value of `coordinator` or `worker` is passed in. Check for this argument value in your script to perform certain actions based on the node type. Example `s3://my_bucket/starburst-bootstrap-1.0.sh`.
`StarburstHttpPort`	Port to use for SEP coordinator and therefore the Starburst Enterprise web UI as well as JDBC and other client connections. Example `8080`.
`LicenseURI`	URI of the SEP license in S3. This is only needed when deploying the CFT (using a privately shared SEP AMI) without subscribing to the AWS Marketplace. Example `s3://my_bucket/starburstdata.license`.

Hive connector options#

The Hive connector is required if you plan to access data in HDFS or S3. It requires a Hive Metastore so SEP knows where data lives. Refer to the dedicated documentation Configuring the Hive Metastore Service with CFT to determine your configuration.

Hive Connector Options#
Parameter key	Description
`MetastoreType`	Determines what metastore is used by the Hive connector. Defaults to `None`, which means that no Hive connector is provisioned. Example `AWS Glue Data catalog`.
`ExternalMetastoreHost`	When external Metastore is used (see `MetastoreType` parameter), this points to the host of the Metastore. Example `metastore.example.com`.
`ExternalMetastorePort`	When external Metastore is used (see `MetastoreType` parameter), this points to the Metastore service port number. When set to `0` (the default value), default value per each metastore type is used: `3306` for `External MySQL RDBMS` `5432` for `External PostgreSQL RDBMS` `9083` for `External Hive Metastore Service` Cannot be empty when `MetastoreType` is set to either of: `External MySQL RDBMS` `External PostgreSQL RDBMS` `External Hive Metastore Service` Example `9083`.
`ExternalRdbmsMetastoreUserName`	When external Metastore is used (see `MetastoreType` parameter), this determines the JDBC connection user name. Cannot be empty when `MetastoreType` is set to either of: `External MySQL RDBMS` `External PostgreSQL RDBMS` Example `database_user_name`.
`ExternalRdbmsMetastorePassword`	When external Metastore is used (see `MetastoreType` parameter), this determines the JDBC connection password. Cannot be empty when `MetastoreType` is set to either of: `External MySQL RDBMS` `External PostgreSQL RDBMS` Example `jdbc_user_p@55vv0rd`.
`ExternalRdbmsMetastoreDatabaseName`	When external Metastore is used (see `MetastoreType` parameter), this determines the JDBC connection password. Cannot be empty when `MetastoreType` is set to either of: `External MySQL RDBMS` `External PostgreSQL RDBMS` Example `hivemetastore`.

Ranger and LDAP user synchronization#

The following parameters are related to the global access control with Apache Ranger and the related synchronization of Ranger with an LDAP backend for user and group information.

Ranger-related Configuration Parameters#
Parameter key	Description
`EnableRanger`	When enabled, Apache Ranger for global access control is added. Defaults to no. Note that all other settings in this section are ignored if Ranger is disabled. Example `yes`.
`RangerAdminPassword`	Administrator password for Ranger. At least 8 characters, including lowercase, uppercase and digit, are required. When reusing an existing external database for Ranger in your CFT stack, you must provide the same password as the initial one, to ensure access remains functional.
`RangerBackendType`	Type of database backend used for Apache Ranger. The default `External PostgreSQL RDBMS` is recommended for production usage. `Built-in PostgreSQL RDBMS` is ephemeral and only suitable for demo purposes.
`ExternalRdbmsRangerHost`	Hostname of the external PostgreSQL RDBMS server.
`ExternalRdbmsRangerPort`	Port of the external PostgreSQL RDBMS server. Defaults to 5432.
`ExternalRdbmsRangerDatabaseName`	Name of the database on the external PostgreSQL RDBMS server to use as Ranger database backend. The database must already exist. Defaults to `ranger`.
`ExternalRdbmsRangerUserName`	Name of the database user that Ranger uses to manage the database on the external PostgreSQL RDBMS. The user must exist, have full permissions to the database and must have CREATEROLE permissions granted. An additional user ‘ranger’ is created for non-admin database access. If you specify ‘ranger’, the single user is used for all operations. Defaults to `rangeradmin`
`ExternalRdbmsRangerPassword`	Password for the database user.
`RangerConfigFile`	URL to an optional additional Ranger config file in an S3 bucket. A template is available to download. Modify the template and upload it to an S3 bucket. The config file is required for using Solr Audit with Ranger and other customizations. Example: `s3://my-bucket/my-config_file.properties`
`RangerBootstrapScript`	URL to an optional bootstrap script in an S3 bucket. The script is run before Ranger starts. For example, a bootstrap script can be used to provide truststore files. Example: `s3://my-bucket/ranger-bootstrap.sh`
`EnableRangerUserSync`	When enabled, Apache Ranger synchronizes users from an external LDAP directory. Requires Ranger to be enabled, disabled by default. The RangerUserSyncConfigFile setting is ignored if Ranger user sync is disabled.
`RangerUserSyncConfigFile`	URL to Ranger user synchronization configuration file in S3 bucket. A user sync template is available to download. Create a modified copy of the template and upload it to an S3 bucket. Required if Ranger user sync is enabled. Example: s3://my-bucket/my-config_file.properties

Advanced AWS S3 configuration#

The advanced AWS S3 configuration parameters only affect the configuration of provisioned Hive catalogs in order to:

configure custom access credentials for AWS S3
access a third-party S3-compatible storage system

In both of these cases, you must set all three of the the parameters listed in the following table:

Advanced S3 Configuration Parameters#
Parameter key	Description	Example
`S3Endpoint`	URI to AWS S3-compatible endpoint. Your choice of endpoint affects your ability to write to buckets. Specifying https://s3.us-east-2.amazonaws.com allows you to write to any bucket in that region, whereas specifying https://mybucket.s3-us-west-2.amazonaws.com restricts the metastore to reading and writing from a single bucket.	https://s3.us-east-2.amazonaws.com
`S3AccessKey`	Access key to AWS S3-compatible storage	AKIAIOSFODNN7EXAMPLE
`S3SecretKey`	Access secret to AWS S3-compatible storage	wJarXUI/PiYEXAMPLEKEY

Warning

Failure to set the S3Endpoint results in an empty value for both S3AccessKey and S3SecretKey in the hive-site.xml file generated for the CFT deployment, resulting in Access Denied exceptions at runtime.

Monitoring#

Monitoring Parameters#
Parameter key	Description	Example
`EnableCloudWatchMetrics`	Enable integration with CloudWatch metrics. When enabled, OS and SEP metrics are reported for each cluster node and a CloudWatch Dashboard with cluster overview is created. Additional CloudWatch fees are charged. Refer to Configuring Starburst Enterprise with CloudWatch in CFT for more details.	no

IAM instance#

IAM instance parameters#
Parameter key	Description	Example
`IamInstanceProfile`	Optional name of an IAM instance profile to attach to SEP nodes. See Instance profiles for more detail. If you do not specify the InstanceProfile, the CloudFormation Template creates the necessary IAM role privileges.	my-ec2-instance-profile

Other parameters#

Other Parameters#
Parameter key	Description	Example
`LaunchSuperset`	When enabled, Superset is deployed and started on an EC2 instance	yes