Analytics Platform installation¶
This guide covers the installation of the Analytics Platform (AP) software. The AP backend server is composed of the following services.
- API gateway
- Identity
- Data pipeline
The key and port of each service are described below. The key refers to the name used in configuration directories and files. The port refers to the default port for which the service will listen for incoming requests.
Name | Key | Port |
---|---|---|
API gateway | bao-api-gateway | 8085 |
Identity | bao-identity | 8086 |
Data pipeline | bao-data-pipeline | 8084 |
User¶
Create an operating system user for running the AP services. This guide uses bao-admin
as username, though any valid username can be used. The user has no password. For security reasons, avoiding password-based login and instead use SSH key-based login is strongly recommended.
For security reasons, the AP services should not run as a privileged user. It may however be practical to allow sudo without password:
sudo usermod -aG sudo bao-admin
sudo echo "bao-admin ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/50-ap-users
Create the SSH directory, and add the authorized keys file. Add public keys for users which should have access.
SSH¶
Carefully confirm that public key based authentication to the server is successful, i.e. login without specifying a password.
Disable password-based authentication for enhanced security. Create a SSH daemon config file.
Add the following properties.
Restart the SSH daemon to have the changes take effect.
JAR files¶
Each service is available as an executable JAR file.
The JAR files should be installed at the following locations.
JAR file | File location |
---|---|
bao-api-gateway.jar | /var/lib/bao-api-gateway/bao-api-gateway.jar |
bao-identity.jar | /var/lib/bao-identity/bao-identity.jar |
bao-data-pipeline.jar | /var/lib/bao-data-pipeline/bao-data-pipeline.jar |
Create the directories manually and make bao-admin
the owner.
sudo mkdir /var/lib/bao-api-gateway
sudo mkdir /var/lib/bao-identity
sudo mkdir /var/lib/bao-data-pipeline
sudo chown bao-admin:bao-admin /var/lib/bao-api-gateway
sudo chown bao-admin:bao-admin /var/lib/bao-identity
sudo chown bao-admin:bao-admin /var/lib/bao-data-pipeline
Place the JAR files in the respective directories and make bao-admin
the owner.
sudo cp bao-api-gateway.jar /var/lib/bao-api-gateway
sudo cp bao-identity.jar /var/lib/bao-identity
sudo cp bao-data-pipeline.jar /var/lib/bao-data-pipeline
sudo chown bao-admin:bao-admin /var/lib/bao-api-gateway/bao-api-gateway.jar
sudo chown bao-admin:bao-admin /var/lib/bao-identity/bao-identity.jar
sudo chown bao-admin:bao-admin /var/lib/bao-data-pipeline/bao-data-pipeline.jar
Systemd¶
The systemd service manager is used to manage the service processes. Each service has a corresponding systemd service file and a configuration file.
The systemd service files are specified below. The memory allocations should be adjusted to the available server resources. The systemd service files should be located in the /etc/systemd/system
directory.
Systemd file | File location |
---|---|
bao-api-gateway.service | /etc/systemd/system/bao-api-gateway.service |
bao-identity.service | /etc/systemd/system/bao-identity.service |
bao-data-pipeline.service | /etc/systemd/system/bao-data-pipeline.service |
The bao-api-gateway.service
systemd service file.
[Unit]
Description = AP API Gateway
[Service]
Environment = "JAVA_OPTS=-Xms256M -Xmx512M"
ExecStart = /var/lib/bao-api-gateway/bao-api-gateway.jar
User = bao-admin
[Install]
WantedBy = multi-user.target
The bao-identity.service
systemd service file.
[Unit]
Description = AP Identity
[Service]
Environment = "JAVA_OPTS=-Xms1024M -Xmx2048M"
ExecStart = /var/lib/bao-identity/bao-identity.jar
User = bao-admin
[Install]
WantedBy = multi-user.target
The bao-data-pipeline.service
systemd service file.
[Unit]
Description = AP Data Pipeline
[Service]
Environment = "JAVA_OPTS=-Xms1024M -Xmx2048M"
ExecStart = /var/lib/bao-data-pipeline/bao-data-pipeline.jar
User = bao-admin
[Install]
WantedBy = multi-user.target
To enable the services on boot, invoke the following commands.
sudo systemctl enable bao-api-gateway
sudo systemctl enable bao-identity
sudo systemctl enable bao-data-pipeline
To start a service using systemd, after the JAR files and configuration files are installed, invoke the following command.
To stop a service using systemd, invoke the following command.
PostgreSQL¶
The AP identity and data pipeline services use PostgreSQL for persistence. Note that the PostgreSQL contains metadata for data pipelines, views and more, while analytical data is stored in a data warehouse such as ClickHouse. Note that the names given to the databases and users can be adjusted as preferred, and the following names are suggestions.
Database name | Database user | Encoding |
---|---|---|
baoidentity | baoidentity | UTF-8 |
baodatapipeline | baodatapipeline | UTF-8 |
Users¶
Create the required users. Switch to the postgres
user. Connect to PostgreSQL with the psql
CLI.
Create users for the identity and data pipeline services. Replace mypassword1
and mypassword2
with a strong passwords, and take note securely.
Databases¶
Create databases for the identity and data pipeline services. Set encoding to UTF-8.
Exit the CLI with Ctrl+D
and then return to the bao-admin
user with exit
.
Configuration¶
Each service has a corresponding configuration file.
Config file | File location |
---|---|
bao-api-gateway.conf | /opt/bao-api-gateway/bao-api-gateway.conf |
bao-identity.conf | /opt/bao-identity/bao-identity.conf |
bao-data-pipeline.conf | /opt/bao-data-pipeline/bao-data-pipeline.conf |
API gateway¶
Create the bao-api-gateway.conf
configuration file for the API gateway service with chmod 600
.
# ----------------------------------------------------------
# Service to URI mapping
# ----------------------------------------------------------
# Identity service URI
service.identity = http://localhost:8086/
# Data pipeline service URI
service.datapipeline = http://localhost:8084/
# ----------------------------------------------------------
# CORS
# ----------------------------------------------------------
# Allowed origins for CORS
cors.allowed_origins = https://localhost:3000, \
https://localhost:9000
Identity¶
Create the bao-identity.conf
configuration file for the identity service. Adjust usernames and passwords to your environment.
# ----------------------------------------------------------
# Database connection
# ----------------------------------------------------------
# JDBC connection URL
connection.url = jdbc:postgresql://127.0.0.1/baoidentity
# JDBC connection username
connection.username = baoidentity
# JDBC connection password (confidential)
connection.password = xxxx
# ----------------------------------------------------------
# Redis
# ----------------------------------------------------------
# Redis hostname / IP address
redis.hostname = 127.0.0.1
# Redis port, optional, default: 6379
redis.port = 6379
# Redis password, optional
redis.password =
# ----------------------------------------------------------
# Apache Pulsar
# ----------------------------------------------------------
# Pulsar hostname / IP address
pulsar.service_url = pulsar://127.0.0.1:6650
# Pulsar TLS authentication plugin, optional, TLS only
# pulsar.tls.auth.plugin =
# Pulsar TLS certificate path, optional, optional, TLS only
# pulsar.tls.trusts.certs.file.path =
# Pulsar TLS certificate file, optional, TLS only
# pulsar.tls.cert.file =
# Pulsar TLS key file, optional, TLS only
# pulsar.tls.key.file =
# ----------------------------------------------------------
# System
# ----------------------------------------------------------
# System hostname / base URL
system.base_url = https://analytics.mydomain.org
# System application title
system.application_title = Analytics Platform
# Log email invitation URLs, disable in prod, debugging only
system.user_invite.logging = off
# Name of issuer for MFA entries
system.mfa_issuer = Analytics Platform
# ----------------------------------------------------------
# Email
# ----------------------------------------------------------
# From address for outgoing emails
email.from.address = noreply@mydomain.org
# ----------------------------------------------------------
# SMTP
# ----------------------------------------------------------
# SMTP hostname or IP address
smtp.host = 127.0.0.1
# SMTP port, default: 587
smtp.port = 587
# SMTP TLS
smtp.tls = true
# SMTP username
smtp.user = myuser
# SMTP password
smtp.password = xxxx
Data pipeline¶
Create the bao-data-pipeline.conf
configuration file for the data pipeline service. Adjust usernames and passwords to your environment.
# ----------------------------------------------------------
# Database connection
# ----------------------------------------------------------
# JDBC connection URL
connection.url = jdbc:postgresql://127.0.0.1/baodatapipeline
# JDBC connection username
connection.username = baodatapipeline
# JDBC connection password (confidential)
connection.password = xxxx
# ----------------------------------------------------------
# Redis
# ----------------------------------------------------------
# Redis hostname / IP address
redis.hostname = 127.0.0.1
# Redis port, optional, default: 6379
redis.port = 6379
# Redis password, optional
redis.password =
# ----------------------------------------------------------
# Apache Pulsar
# ----------------------------------------------------------
# Pulsar hostname / IP address
pulsar.service_url = pulsar://127.0.0.1:6650
# Pulsar TLS authentication plugin, optional, TLS only
# pulsar.tls.auth.plugin =
# Pulsar TLS certificate path, optional, optional, TLS only
# pulsar.tls.trusts.certs.file.path =
# Pulsar TLS certificate file, optional, TLS only
# pulsar.tls.cert.file =
# Pulsar TLS key file, optional, TLS only
# pulsar.tls.key.file =
# ----------------------------------------------------------
# System
# ----------------------------------------------------------
# System hostname / base URL
system.base_url = https://analytics.mydomain.org
# Retain temporary data files (debugging only)
system.retain_temp_files = off
# Sample size for dataset column type detection, default: 5k
system.max_sample_size = 500000
# Email address to send alert messages on error
system.error.alert_email = alerts@mydomain.org
# ----------------------------------------------------------
# Blobstore (local filesystem only)
# ----------------------------------------------------------
# Root directory for local file system blob storage
blobstore.root_dir = /var/lib/bao-data-pipeline/data
# ----------------------------------------------------------------------
# OpenAI [Optional]
# ----------------------------------------------------------------------
# OpenAI API key
openai.api_key =
# OpenAI model, can be 'default', 'gpt-4o-mini', 'gpt-4o', 'o3-mini'
openai.model = default
# ----------------------------------------------------------------------
# Google [Optional]
# ----------------------------------------------------------------------
# API key
google.gemini.api_key =
Encryption¶
The data pipeline service encrypts all secrets at the database level, and requires an encryption key to provided.
Note
Store the encryption key in a secure manner!
The encryption key should be stored in a secure and confidential way. If the key is lost, the encrypted database content cannot be recovered. If the key is exposed, an attacker could use the key to decrypt the database secrets.
The Tink Java library is used for encryption. An encryption key can be generated using the tink CLI called Tinkey.
The encryption key file name is bao-data-pipeline-key.json
and the content is in JSON format.
Download Tinkey from the following URL.
Uncompress the tar ball in a suitable location. Generate the key with the following command.
Create and store the encryption key file in the data pipeline configuration directory.
Example encryption key file.
{
"primaryKeyId": 0000000000,
"key": [
{
"keyData": {
"typeUrl": "type.googleapis.com/google.crypto.tink.AesGcmKey",
"value": "{secret}",
"keyMaterialType": "SYMMETRIC"
},
"status": "ENABLED",
"keyId": 0000000000,
"outputPrefixType": "TINK"
}
]
}
Data cache¶
When AP ingests data from various data sources, it caches data in the form of data files, which are temporarily stored on the filesystem of the server where AP is deployed. Depending on the data sources, significant storage capacity is required. However, data is deleted when a data load process completes, meaning the data volume will not grow over time.
The data cache directory name is data-pipeline
, and located below the configuration directory.
Create the directory manually.
CACHE_DIR="/opt/bao-data-pipeline/data-pipeline"
sudo mkdir $CACHE_DIR
sudo chown bao-admin:bao-admin $CACHE_DIR
sudo chmod 755 $CACHE_DIR
Data storage¶
Note
This section applies only for on-premise server data storage environments
When deploying AP in on-premise server environments, take care to provision storage device (disk or SSD) with appropriate capacity. 500GB is a reasonable starting point. Separate storage devices may be provisioned for the AP software and for the data storage.
The configuration property blobstore.root_dir
in bao-data-pipeline.conf
defines the root directory for data storage on the local filesystem. It allows for storing data on a dedicated storage device (disk or SSD). The default location is /var/lib/bao-data-pipeline/data
. Create the data
directory manually.
DATA_DIR="/var/lib/bao-data-pipeline/data"
sudo mkdir $DATA_DIR
sudo chown bao-admin:bao-admin $DATA_DIR
sudo chmod 755 $DATA_DIR
In the following configuration section, the blob store container name will be specified per client (tenant). In an on-premise environment, create a directory manually to represent the container using the specified container name below the root data directory. This guide uses bao-ap-client-main
as the container name for the default client, though any container name can be used. The directory should be created in the following location.
Create the directory manually. The data and client directories should be located on a storage medium with appropriate capacity.
CLIENT_DIR="/var/lib/bao-data-pipeline/data/bao-ap-client-main"
sudo mkdir $CLIENT_DIR
sudo chown bao-admin:bao-admin $CLIENT_DIR
sudo chmod 755 $CLIENT_DIR
The data storage location can be defined with the blobstore.root_dir
property in the bao-data-pipeline.conf
configuration file.
Read me¶
The following content is convenient to maintain in a readme.md
file.
# Analytics Platform
## Redis
redis-cli -h 127.0.0.1
## Apache Pulsar
sudo systemctl status apache-pulsar
sudo systemctl restart apache-pulsar
sudo journalctl -n 500 -f -u apache-pulsar
## Nginx
sudo systemctl status nginx
sudo systemctl restart nginx
sudo tail -f /var/log/nginx/access.log
## Apache Superset
sudo systemctl status apache-superset
sudo systemctl restart apache-superset
sudo journalctl -n 500 -f -u apache-superset
## AP service status
sudo systemctl status bao-api-gateway
sudo systemctl status bao-identity
sudo systemctl status bao-data-pipeline
## AP service restart
sudo systemctl restart bao-api-gateway
sudo systemctl restart bao-identity
sudo systemctl restart bao-data-pipeline
## AP service logging
sudo journalctl -n 500 -f -u bao-api-gateway -u bao-identity -u bao-data-pipeline -o cat
sudo journalctl -n 500 -f -u bao-data-pipeline
Debug¶
To adjust the log level for the Java services, append the following parameter to the ExecStart
property in the appropriate systemd service file. The com.bao
part of the parameter value refers to the package of the classes for which the logging level will apply.