Sysadmin¶
Welcome to the installation guide for Analytics Platform (AP from now on). This document is intended for system administrators who will be setting up and maintaining the environment required to run AP.
AP requires a Linux operating system. An Ubuntu LTS version is the recommended Linux distribution. The installation guide assumes Ubuntu Linux as the operating system and the availability of the systemd process and service manager.
AP supports a variety of public cloud providers, data storage and data warehouses. It can be deployed in a public cloud environment, and on Linux-based, on-premise server environments.
Data platform¶
AP uses a data storage provider to ingest and store raw data files from multiple sources in its native format. The following data storage environments are supported.
Infrastructure | Data storage | Data warehouse |
---|---|---|
AWS | Amazon S3 | ClickHouse |
AWS | Amazon S3 | Amazon Redshift |
Azure | Azure Blob Storage | SQL Database |
Azure | Azure Blob Storage | Synapse |
On-prem | Local filesystem | ClickHouse |
On-prem | Local filesystem | PostgreSQL |
On-prem | Local filesystem | SQL Server |
Data storage¶
AP supports three providers for data storage:
- Amazon S3
- Azure Blob Storage
- Local filesystem
Amazon S3 and Azure Blob Storage are scalable, highly durable and cost-effective public cloud storage service that allows users to store and retrieve any amount of data from anywhere on the web. These services integrate well with the vast ecosystem of data services in the AWS and Azure public clouds respectively.
Local filesystem refers to using a regular server with attached disk storage. This approach leverages the file system of the server and data files are stored in regular directories. As high-speed reading is not a priority, HHDs (Hard Disk Drives) is a cost-effective and feasible option, as opposed to more expensive and faster SSDs (Sold State Drives).
Data warehouses¶
- ClickHouse
- Amazon Redshift
- Azure SQL Database
- Azure Synapse
- PostgreSQL
- Microsoft SQL Server
In on-premise environments, ClickHouse is the preferred data warehouse, due to its open source license, well-documented server installation and high-perforance data ingestion and data querying.
Middleware¶
Below is a summary of the necessary middleware components that your system needs to ensure optimal performance and compatibility.
- OpenJDK 17: A robust and widely-used open-source implementation of the Java Platform which provides the runtime environment necessary for running Java applications. The AP backend services are written in Java 17.
- PostgreSQL: Version 14 or later. A powerful, open-source relational database management system that offers advanced features such as complex queries, foreign keys, triggers, and up-to-date compliance with SQL standards. The AP backend services use PostgreSQL databases for persistence of data.
- nginx: A high-performance, open-source HTTP server and reverse proxy that is essential for handling web traffic, load balancing, and serving static content efficiently.
- Redis: An in-memory, open-source key-value store that provides lightning-fast data retrieval, making it ideal for caching and supporting real-time analytics, session management, and message brokering.
- Apache Pulsar: An open-source distributed messaging and streaming platform that enables reliable, scalable, and low-latency data streaming and message queueing, suitable for event-driven applications.
- ClickHouse: A high-performance, open-source columnar database management system designed for online analytical processing (OLAP) and real-time data analytics at scale. AP utilizes ClickHouse as data warehouse for analytical data processing and querying.
AP is based on several independent services.
- API Gateway: The API gateway is responsible for routing API requests to the appropriate backend service. It manages authentication and user sessions.
- Identity: The identity service is responsible for security, authentication, authorization, and for user and client management.
- Data pipeline: The data pipeline service is the main component of AP and is responsible for data catalog, data pipelines, views, destinations, workflows, data quality checks.
- Web UI: The UI is composed of two web apps written in React and Javascript: The analytics platform web app and the user management web app.
AP is deployed as executable JAR files, managed by the systemd system and process manager. A Docker image is planned for but not currently available.
Software architecture¶
AP is a multi-tenant and web-based software. Multi-tenancy is an application architecture where a single instance of the software serves multiple "tenants", also known as clients or organizations. Each tenant's data and configuration are isolated, ensuring security and privacy, but they all share the same underlying infrastructure and codebase. This approach allows for efficient use of resource, as the software instance can be maintained and updated centrally while still catering to the unique needs of different tenants. For an on-premise installation scenario used by a single organization, a single tenant can be configured, alternatively, individual tenants for development, testing and production. The high-level architecture of the AP is described in the below diagram.
Network architecture¶
AP network architecture for on-prem hosting environments is described in the diagram below, which shows a typical example with a DHIS2 instance as data source, AP multi-tenant service and tenant-specific data storage and data warehouse.
Tech stack¶
The AP software is built using a client-server architecture, where the client (front-end) communicates with the server (backend) over a REST HTTP API.
- Database: The transactional database for metadata storage is PostgreSQL.
- Backend: Backend services are written in Java using OpenJDK 17. Major frameworks are Spring Boot, Hibernate and Apache Commons. Testcontainers and JUnit are used for unit and integration testing.
- Front-end: The front-end web apps are written in Javascript with the React framework and Ant Design UI library.