Use cases¶

Overview¶

This page describes common use-cases for data analytics within AP.

Natural language queries and analytics¶

AP provides the following main capabilities for natural language analytics, also referred to as conversational analytics.

Convo web app: Convo is a web app for DHIS2 for conversational analytics which allows users to ask questions about their data using natural text. Convo understands natural text questions and uses DHIS2 API calls to retrieve the relevant information. See the "Convo" documenation page.
Data browser: The data browser in AP allows users to ask questions about their data using natural text. The text-to-SQL engine converts questions in SQL queries which retrieve the relevant data.
Script editor: The script editor supports Python and R scripting and allows users to declare what type of data analytics and data science they want to achieve in natural text. The text-to-script engine converts the natural text input into Python and R scripts.

Geospatial visualization and maps¶

AP does not provide a native geospatial visualization and maps (GIS) component. The following systems are recommended to be used with AP for geospatial analysis:

DHIS2 Maps: The DHIS2 platform featues a web-based and powerful thematic mapping and geospatial visualization tool. DHIS2 is integrated with AP through the DHIS2 data pipeline.
ArcGIS: ArcGIS from Esri is a comprehensive geographic information system (GIS) platform that enables users to create, manage, analyze, and map spatial data to gain actionable insights. ArcGIS is integrated through the DHIS2 to ArcGIS Connector app.

Audit trail¶

AP includes audit records when loading data from source systems with data pipelines, assuming that the source system stores audit information. The main entities in AP, including workflows, data pipelines, data quality check groups, materialized views and destinations, create and store a detailed change log for every task. For example, every time a workflow runs, either scheduled or ad-hoc, a change log with detailed logs are stored. The changelog includes the time the task started and completed. For ad-hoc runs, the changelog includes information about which user started the task. This ensures auditability of the system, both for source system data changes and for operations within AP including data loading and transformation.

Project-based configuration model¶

AP supports stand-alone deployment in on-premise server environments. AP is designed with a multi-tenant architecture. These features allows for setting AP in way where multiple projects or teams within the same organization can work independently and isolated from each other. To achieve this, separate clients for each team are created. This allows each team to create data pipelines, views, scripts and workflows, as well as a dedicated data warehouse, independently of each other.

Unified and integrated data layer¶

AP allows for importing data from data sources into a central data warehouse through the data pipeline system. Each data pipeline creates and loads data into several data tables in a dedicated data warehouse schema. SQL views and Python and R scripts are used to join the data tables, either by speciying SQL and script code directly, or from natural text, where AP converts the natural text input to SQL and script code with the use LLM/AI technology.

High-performance query execution on large datasets¶

AP utilizes ClickHouse as the recommended data warehouse. ClickHouse is an open-source, column-oriented database management system specifically designed for high-performance Online Analytical Processing (OLAP). It allows users to generate analytical reports in real-time using SQL queries on trillions of rows and petabytes of data with sub-second latency. ClickHouse is designed for lightning fast queries through a range of features.

Column-oriented data storage: Reduces disk reads by only reading columns specified in queries.
Vectorized query execution: Processes data in large blocks.
Massive parallelism: Utilizes many CPUs and compute nodes for a single query.
High data compression: Reduces disk usage and speeds up data reading.
Specialized storage engines: Allows for high ingestion rates and fast data range scans.
Data skipping indexes: Allows for skipping over ranges of blocks while reading data.

Data capture¶

AP does not provide a built-in data capture app or module. Instead, AP provides no-code, easily configured connectors (data pipelines) for a variety for data capture solutions. The following software systems are integrated and recommended for data capture.

DHIS2: A highly flexible configurable information system with support for data capture of aggregate data, survey and event data, and individual data. Supports desktop web clients and Android mobile clients. Data is ingested into AP using the DHIS2 data pipeline.
ODK: Flexible and easy-to-configure software for survey data collection through an Android mobile app. Data is ingested into AP using the ODK data pipeline.
CSV: Data can be collected in CSV and Excel files. Data is ingested into AP using the CSV file upload data pipeline.

Data governance¶

AP supports the key aspects of data governance through its extensive feature set.

Data quality: The data quality checks, SQL views, scripting environment, workflows and notification system in AP are useful for ensuring that data is accurate, complete, and consistent.
Data security and privacy: The multi-dimensional sharing and access control system, secure-by-default approach and compliance are useful for managing the data security aspect.
Role based access control: User groups in AP allows for grouping users by their role within the organization, allowing access and permission to be granted at the role level.
Metadata management: The data catalog in AP allows for creating a detailed inventory of data sources and datasets through its rich metadata model.
Auditing: All main entities in AP including data pipelines, data quality checks, views and workflows provide changelogs with detailed logs for every task, ensuring all activity is audited.
Data lifecycle management: In addition to data loading, AP supports data archiving and data deletion, allowing for managing the lifecycle of datasets and data records.
Data retention and versioning: AP retains and stores the raw data files for every data load operation, allowing for comparing the current data with historical versions.

Role based access control (RBAC)¶

Role-Based Access Control (RBAC) is a security model that restricts system access and permissions based on roles within an organization, rather than individual user identities. It simplifies administration by granting permissions to roles, such as data owner, data steward, data analyst and data viewer, and not directly to individual users. AP supports user groups, users can be assigned to groups depending on their role in the organizations, after which user groups can be granted access to datasets, workflows, views and other entites and objects.

Data archiving and removal¶

AP supports archiving of data. A "frozen" data pipeline can be used to ingest data from a source system and then prohobit the data pipeline to subsequently be refreshed or deleted. This allows for archival of historical datasets, and means the historical data can be deleted from the source system, and instead be accessed with a data exploration or BI tool in AP. Data can be permanently removed based on policies and time thresholds. This is achieved through SQL statements with filters and workflows. Automated data lifecycle management can be enforced through scheduled workflows running on a fixed interval. The data archiving and removal operations are audited through the workflow changelog system.

Permanent removal of data¶

AP allows permanent and complete removal of the data associated with a data source (data pipeline). When a data pipeline is removed, associated data is by default deleted in its entirety, including data files stored in the data lake, data tables in the warehouse and change log records. Note that AP stores data files for every run of data pipelines, and that data files for all runs are deleted as part of deleting a data pipeline.

ETL and ELT¶

AP follows the ELT approach for data integration. The ELT (Extract, Load, Transform) approach is a modern data integration methodology designed to leverage the high-performance processing capabilities of modern cloud data warehouses. Unlike traditional ETL frameworks, ELT shifts the transformation layer to the end of the pipeline, allowing for more agile and scalable data management.

Extract: Raw data is ingested from disparate source systems, such as databases and APIs. AP features a range of no-code data connectors for this purpose.
Load: Data is moved directly into the destination storage in its raw format. In AP this is a cloud-native or on-premise data lake and data warehouse like ClickHouse.
Transform: Once the data is stored in the data warehouse, it is processed, cleaned, modeled and transformed using the AP and data warehouse native compute power. This includes SQL for scalable data processing, and integrated Python and R scripts for complex algorithmic logic and machine learning.

The ELT process has several advantages version the traditional ETL approach.

Scalability: Utilizes the elastic, parallel-processing power of cloud environments to handle massive datasets without dedicated staging hardware.
Scalability: By loading raw data before transformation, organizations maintain a single source of truth that can be re-processed as business requirements evolve.
Low latency: Minimizes the time between extraction and availability, as data does not need to wait for complex transformation cycles before being stored.

New connector development¶

AP features two main approaches for development on new connectors (data pipelines).

Framework: AP has a connector development framework which avoids boiler-plate code and makes connector development more efficient. The AP team leverages agent skills and agentic coding for rapid development of new connectors.
Python scripting: The integrated Python scripting environment is suitable for rapid development of light-weight and customized API connectors for local and bespoke software applications. Python scripts can be used for data loading and be integrated in AP workflows together with built-in connectors.

API access¶

AP is designed with the API first principle in mind. The data model and all platform operations are exposed in the API. This means that clients and third-party tools, such as CLIs, HTTP clients, web apps and BI tools, an utilize the comprehensive API to communicate, integrate and extend the platform. AP provides complete API documentation based on the OpenAPI (Swagger) standard. The API first design allows for extending the platform to meet specific needs and requiremenents.

OWASP Top 10 standard¶

AP maintains a rigorous security posture which embeds Static Application Security Testing (SAST) directly into the CI/CD pipeline, ensuring every build is automatically vetted against the OWASP Top 10 security risks. The managed hosting environment reinforces this by deploying a Web Application Firewall (WAF) that leverages the latest OWASP Core Rule Set to stop malicious traffic and abnormal patterns at the doorstep. Finally, the platform’s architecture provides native protection against high-risk vulnerabilities like cross-site request forgery (CSRF), cross-site scripting (XSS), and session hijacking, strictly adhering to industry-standard web security guidelines for comprehensive application-level defense.

Outlier detection¶

AP supports data outlier detection using advanced statistical methods including the Modified Z-score and Inter-quartile range (IQR) algorithms. These checks are implemented with SQL queries to achieve high performance. For more complex data scenarios such as seasonal diseases, the integrated Python scripting environment is used to identity outliers using advanced time series decomposition, unsupervised machine learning and statistical profiling.

Azure deployment¶

The Microsoft Azure cloud is a popular environment for deploying AP. AP integrates well with the Azure cloud stack and services.

Deployment¶

Self-hosted model: AP can be provisioned within a user managed Azure Account leveraging Azure Virtual Machines for compute and Azure Database for PostgreSQL for metadata storage.
Hybrid model: The multi-tenant managed service establishes secure connections to Azure SQL Database and other cloud services residing within the user managed Azure environment.

Connectivity¶

Database: AP connects to and can load data from Azure SQL Database (SQL Server).
Object storage: AP connects to and can load data files from Azure Blob Storage.
Data warehousing: AP utilizes Azure SQL Database and Azure Synapse as primary data warehouses and destinations for data loading.
LLM: AP utilizes Azure OpenAI for LLM content generation, such as text-to-SQL conversion.
Business intelligence: AP is purpose-built for MS Power BI integration, allowing for streamlined data modelling and visualization.