---
date: Generated on 2026-04-16
title: User Guide
---

# Overview

Welcome to the user guide for Analytics Platform (AP from now on). AP
lets you ingest and merge, in real time, data from multiple and varied
data sources together in a scalable data warehouse. The AP is designed
with easy-to-use connectors (data pipelines) for platforms, systems and
tools commonly used by governments and development organizations
supporting them.

Publicly available and global datasets ranging from population and
demographic, health, nutrition, agriculture and food security,
geological, and economic data are made available through AP to enhance
your programmatic data, enabling data triangulation within and across
sectors to generate better insights.

Data in the warehouse are available for advanced analytics, machine
learning and predictive analytics, and widespread sharing using popular
third party business intelligence (BI) tools.

AP offers a user-friendly interface and seamless flow, from data
ingestion to visualization, so that organizations can reduce staff time
spent on curating, managing, and manipulating data, and instead focus on
generating actionable insights from their data to inform programmatic
decision making.

![Superset dashboard](../assets/images/user/superbi_dashboard.png)

## Key platform features

- **Data ingestion:** The platform provides connectors (data pipelines)
  to systems, databases and tools commonly used in the international
  development sector.
- **Data transformation:** Data can be transformed and enriched using
  SQL views, Python scripts and R scripts upon ingestion. Furthermore,
  data sets can be parsed and joined to create unique data views for
  enhanched analysis.
- **Natural text queries:** Users can ask data questions in natural text
  and have AP convert to SQL queries for informational retrieval,
  allowing non-technical users to analyze and retrieve data.
- **Data warehousing:** Data is organized and stored in a scalable
  cloud-based warehouse. AP integrates with ClickHouse, PostgreSQL, SQL
  Server, Amazon Redshift, Azure SQL Database and Anzure Synapse.
- **Import of public data sets:** The platform offers easy import of
  publicly accessible data sets. A range of datasets exist within the
  library, including from the UN, WHO Global Health Observatory and
  World Bank.
- **User management:** Users and user groups can be managed, and
  fine-grained access control provided through a multi-dimensional
  security model.
- **Monitoring, logging and alerts:** The platform provides monitoring,
  logging and alerts on failures so that issues can be immediately
  detected and corrected.
- **Analytics and BI tool integration:** The platform supports most
  leading analytics and business intelligence tools, including Power BI,
  Tableau, and Superset, to create customized visualizations and
  dashboards.
- **Embedded visualization:** Data visualization and dashboards can be
  embedded in popular systems including DHIS2 using the Super BI web
  app, allowing visualizations to be consumed with existing user
  accounts.
- **Security:** Data is encrypted during transit and at rest in the data
  warehouse. AP offers firewall management for BI tool connections.

## ETL vs ELT

Until recently, expensive data storage and underpowered data warehouses
meant that accessing data involved building and maintaining fragile ETL
(Extract, Transform, Load) pipelines that pre-aggregated and filtered
data down to a consumable size. This meant you had to decide up front
which data elements and fields were to be ingested. Technological
advances now makes the life of data analysts easier. Practically free
cloud data storage and a lot more powerful, modern, columnar cloud data
warehouses make fragile ETL pipelines a relic of the past. Modern data
architecture is ELT (Extract, Load, Transform): Extract and load the raw
data into the destination, then transform and model it after load. ELT
has many benefits, including increased versatility and usability.

## Data pipelines

AP offers turn-key data pipelines to popular information systems,
databases and public cloud blob stores. The data pipelines are designed
to just work, meaning automatically adapting to changes in the source
system, such as new data fields becoming available and changes to
existing fields. The primary value is that you can define a data
pipeline and forget it, allowing the platform to keep it up to date.
Data will be loaded through full refreshes or incremental updates. In
the case of a data pipeline failing, e.g. because the authentication is
no longer valid, AP provides alerts so that you can take timely action.

Data pipelines use a combination of API calls and database connections,
depending on the nature and capability of the data source. AP offers
fast data synchronization, ensuring you have data that are correct and
up-to-date.

AP offers strong security. The platform encrypts secrets such as
passwords and API keys before they are stored, using strong algorithms
and encryption keys. Communication with data sources and data warehouses
are encrypted using TLS/SSL.

## Data flow

AP allows for ingesting data from a variety for data sources including
systems, databases and files using data pipelines. The data is stored in
the data storage area and loaded into the data warehouse of the
platform. This makes the data available for analytics using a variety of
tools. This includes BI tools, such as PowerBI and Tableau, data
exploration tools like Apache Superset and the Super BI web app for
DHIS2. Data can be aggregated and loaded back to DHIS2 using
destinations. A high level diagram with a typical data flow is found
below.

![AP high-level data
flow](../assets/images/user/ap_high_level_data_flow.png)

## Integrated data repository

AP follows the *ELT* (extract, load and transform) approach for data
loading and integration. Data pipelines are responsible for retrieving
data of interest from source systems and loading it into the platform.
From there, data can be mapped, transformed and aggregated using views
in order to produce data analytics and insights. This approach reduces
the challenges of complex, fragile and slow *ETL* (extract, transform
and load) jobs, where you have to decide up front which datasets and
fields to ingest. The diagram below illustrates a typical integration
scenario.

![AP integrated data
repository](../assets/images/user/ap_integrated_data_repository.png)

## Bring your own analytics tools

AP integrates and streamlines your data and makes it easy to consume
from a variety of BI and data visualization tools.

- **Apache Superset:** Superset is integrated as the default data
  exploration tool, providing comprehensive and flexible data
  visualizations.
- **BI tools:** Users can easily connect both the cloud and desktop
  versions of popular BI tools such as Power BI and Tableau.
- **Super BI for DHIS2:** Dashboards can be embedded within DHIS2 with
  the Super BI web app, even without loading the data into DHIS2.

![Superset ANC
dashboard](../assets/images/user/superset_dashboard_anc.png)

## Spaces

The AP front-end is composed of two main spaces.

- Analytics Platform
- Users

You can navigate to each space by clicking on the app menu in the header
bar, followed by **Analytics Platform** or **Users**. Which spaces are
visible depends on the permissions of the user.

![App menu](../assets/images/user/app_menu.png)

# Data Integration

## Overview

The Analytics Platform (AP) provides a comprehensive, modern framework
for data integration, designed to simplify the journey from fragmented
data sources to actionable insights. By moving away from traditional,
rigid data architectures, AP enables organizations to ingest, transform,
and analyze data at scale with minimal technical overhead.

## The modern ELT philosophy

Traditional data management relied heavily on *ETL* (Extract, Transform,
Load). This legacy approach required data to be pre-aggregated and
filtered before reaching the data warehouse, often resulting in fragile,
complex pipelines where even minor changes in the source system could
cause a complete failure.

AP adopts the *ELT* (Extract, Load, Transform) approach, leveraging the
high-performance processing power of modern columnar data warehouses
like ClickHouse for data ingestion and transformation. With the ELT
approach, raw data is loaded into the data storage and data warehouse in
its native format, where it is processed and transformed with SQL views,
Python, and R scripts. These are standardized languages that empower
data engineers and scientists to perform scalable data manipulation,
complex statistical modeling, and advanced analytics directly within the
platform.

## Benefits of ELT vs ETL

The ELT approach offers many benefits compared to ETL.

- **Flexibility:** By loading raw data first, organizations maintain a
  "single source of truth" that can be re-processed as business
  requirements evolve, rather than being locked into initial
  pre-aggregations.
- **Scalability:** ELT utilizes the parallel-processing capabilities of
  cloud and on-premise environments to handle massive datasets up to
  trillions of rows without requiring dedicated staging hardware.
- **Reliability:** Pipelines are less prone to failure because they are
  designed to automatically adapt to changes in source systems, such as
  the addition of new columns and tables.
- **Familiarity:** Data engineers can use the familiarity and knowledge
  they already have of the SQL, Phyton and R languages, without being
  constrained by rigid drag-and-drop user interfaces.

![Integrated data
repository](../assets/images/user/ap_elt_transformation_data_repository.png)

## No-code data connectors

AP provides a library of no-code data connectors that eliminate the need
for manual loading or custom ETL scripts. These connectors pull data
from a wide variety of sources directly into the platform's scalable
storage area.

AP provides a range of built-in no-code data pipelines.

- **Applications:** Integration with popular systems such as DHIS2,
  CommCare, Kobo Toolbox, ODK and iHRIS.
- **Databases:** Connections to popular databases such PostgreSQL,
  MySQL, MS SQL Server and Oracle.
- **Cloud storage:** Integration with Amazon S3, Azure Blob Storage and
  Google Cloud Platform cloud storage.
- **Data files:** Upload of data files in open formats such as CSV and
  Parquet.

These pipelines are pull-based and can be scheduled as part of workflwos
for regular intervals or configured for near real-time (continuous)
loading, ensuring that analytical data is never more than a few minutes
behind the source system.

## Data transformation and enrichment

Once data is loaded into the warehouse, AP offers powerful tools to
join, merge, and enrich datasets. This transformation layer is critical
for creating a unified view across disparate systems.

### SQL views

Views act as virtual tables that allow users to manage data without
moving it physically.

- **Logical Views:** Execute SQL queries in real-time for frequently
  changing data.
- **Materialized Views:** Store query results physically on disk to
  provide lightning-fast read access for large datasets.

### Scripting with Python and R

For advanced analytics, such as predictive modeling, machine learning,
and statistical computations, AP provides a web-based editor for Python
and R. Data scientists can retrieve data directly from the warehouse,
perform complex transformations, and write the results back into new
tables.

## AI-powered transformation

A key differentiator of the BAO Analytics Platform is the integration of
AI and Large Language Models (LLMs) to bridge the gap between technical
and non-technical users.

- **Natural text to SQL:** Within the Data Browser, users can describe
  the data they need in plain language. AP automatically translates
  these questions into complex SQL queries, enabling non-technical staff
  to perform advanced data retrieval.
- **Natural text to script:** The script editor allows users to generate
  Python or R code by describing the desired outcome, such as "perform
  outlier detection using Z-score", and having the AI-powered editor
  write the script automatically.

## Data lineage

Data lineage refers to the process of tracking data's entire lifecycle,
documenting its specific origin, the transformations it undergoes, and
its destination to ensure transparency, quality, and trust. AP provides
full data lineage through its approach to data loading, transformation
and integration.

- **Connectors:** Datasets and data sources are ingested using no-code
  data pipelines (connectors) in a repeatable and automated way.
- **Visibility:** AP data loading is organized by *workflows*,
  consisting of multiple *steps*, each with multiple *jobs*. Job types
  exist for data pipelines, data quality checks, materialized views,
  scripts, and destinations. Comprehensive metadata including
  description and tags, in addition to the query and script code itself,
  are available for SQL views and scripts. Workflows can be configured
  and viewed through the AP user interface, and the data model and
  operations are exposed and accessible in the API. As a result, users
  have full visibility into the data journey and the data loading and
  transformation process.
- **Change log:** Workflows, as well as data pipelines, materialized
  views, data quality checks and destinations, provide a *change log*
  where key information about each run is stored. The information
  includes a unique job identifier, start time, count of affected rows,
  duration and status. In addition, detailed logs for each run are
  available, including informatiob about which tables were created and
  the row count for each table. This provides complete insight into the
  history of workflow activity.

## Best practices for data integration

The AP architecture is built upon several core data engineering
principles to ensure security and integrity.

- **Unified Data Catalog:** A centralized, searchable inventory of all
  data assets ensures teams can quickly find and access the right
  information.
- **Automated Workflows:** Complex, multi-step processes, from ingestion
  to transformation to quality checks, are orchestrated through a
  flexible workflow builder to ensure data consistency and freshness.
- **Security by Design:** All secrets (API keys, passwords) are
  encrypted using the Google Tink library, and data is protected both at
  rest and in transit via TLS/SSL.
- **Auditability:** Detailed change logs are maintained for every task,
  providing a transparent audit trail for data loading and
  transformation activities.

# Data catalog

## Overview

The data catalog in Analytics Platform (AP) is a comprehensive inventory
system that organizes and manages information about your data assets.
The data catalog is integral for users to understand the types, sources,
and characteristics of data integrated within the platform.

The data catalog serves as a central repository where all your data
assets are systematically cataloged. In AP, the data catalog provides
metadata, management features, and search capabilities, enabling users
to quickly locate and understand data across various sources. It details
the data origin, format, and the relationships between different
datasets, making it easier to navigate and manage large volumes of
information within the organization.

The primary utility of the AP data catalog lies in its ability to
provide a central inventory of datasets and sources, which simplifies
data governance and enhances the efficiency of data management
practices. It ensures that users have access to reliable and up-to-date
data descriptions, fostering better decision-making and streamlining
data utilization across projects. By centralizing data knowledge, the
data catalog reduces redundancy and improves data quality.

The terms dataset, data source and data pipelines are used
interchangeably in this guide.

![Data catalog](../assets/images/user/data_catalog.png)

## Audience

The data catalog is designed for use by various stakeholders within an
organization. Data engineers and data integration specialists benefit
from it by gaining insights into available data sources and how they can
be best utilized and integrated. Analysts and data scientists use the
catalog to find relevant datasets for their analytical work, ensuring
that they are working with the most appropriate and up-to-date data.
Additionally, business users and decision-makers rely on the catalog to
verify that the data they base their strategic decisions on is accurate
and comprehensive.

## View data catalog

- Click **Data catalog** from the left-side menu to open the data
  catalog and view datasets.
- Use the source *All sources* drop-down at the top of the page to
  filter datasets by source type.
- Use the schema *All schemas* drop-down at the top of the page to
  filter datasets by schema.
- Click the name of the dataset to view more information.

## Connect data

To connect data sources and bring datasets into the data catalog, click
**Connect data** from the top-right corner. This will open the data
pipeline dialog. Consult the *Data pipelines* page to learn more about
connecting data sources.

## View dataset

The dataset overview screen provides comprehensive information about the
dataset.

### Details

The *Details* tab displays metadata such as owner and URL, and the
username of the user who created and last modified the dataset.

### Tables

The *Tables* tab shows a list of all tables for the data source. Data
pipelines can generate one or many tables. As an example, a DHIS2 data
pipeline will typically generate a large number of tables, including
metadata, data, enrollment and event tables. The tables list allows you
to get an overview of which tables exist. Clicking on table will display
the data structure, meaning the list of columns for the table.

![Dataset tables](../assets/images/user/dataset_tables.png)

### Data structure

The *Data structure* tab shows the structure of the table as a list of
columns for the selected table. For each column, the data type, number
of distinct values, null (blank) values, min and max value are
displayed. Min and max value only apply to numeric data fields.

![Dataset data
structure](../assets/images/user/dataset_data_structure.png)

### Data preview

The *Data preview* tab displays the first 50 rows of the table. This is
useful to get an overview over what type of data exists in the table.

![Dataset data preview](../assets/images/user/dataset_data_preview.png)

### Change log

The *Change log* tab displays an overview of data load *tasks* for the
data pipeline. A task represent a single data pipeline run. For each
task, following information is available.

- **Start time:** The time at which the task started.
- **Data load strategy:** The strategy for data loading. The strategy
  can be *Full replace*, which means completely loading entire data
  tables, or *Incremental append*, which means loading data records
  which were created, updated or deleted since the last task. The
  *Incremental append* strategy is only relevant for data pipelines for
  which data is continuously updated.
- **Duration:** The duration of the task.
- **Rows:** The number of data records which were loaded by the task.
- **Status:** The status of the task, can be *Successful*, which means
  the task completed successfully, *Failed*, which means the task
  completed with an error, and *Pending*, which means the task is
  currently in progress.

![Dataset change log](../assets/images/user/dataset_change_log.png)

### Task log

You can click on a task row to view logs for the task. The logs provides
detailed information about the data load process, and includes the
tables which were created and loaded, the count of data records, the
runtime for each table data load, and more.

![Dataset change log](../assets/images/user/dataset_task_log.png)

## Edit dataset

A dataset can be edited after it has been created.

1.  Open the context menu by clicking the icon in the top-right corner.
2.  Click **Edit dataset**.
3.  Edit values in the relevant sections.
4.  Click **Save** at the bottom of the section.
5.  Close the dialog by clicking the close icon in the top-left corner.

## Share

Access to a dataset can be controlled by setting the appropriate sharing
permissions. A dataset can be shared with everyone in the organization,
referred to as *public access*, with user groups and with users. Users
can be given view access or edit access. Edit access implies view
access. Refer to the sharing page for sharing and access control
documentation.

## Download dataset

The data files used to load data to the data warehouse is available for
download in CSV format.

1.  Open the context menu by clicking the icon in the top-right corner.
2.  Click **Download data** in the context menu. This will open a dialog
    that displays the available data files for download.
3.  Click the download icon next to a file to download it.
4.  Click the link icon next to a file to copy the link / URL to the
    file.

Note that the downloadable data files are in compressed CSV format. The
files are compressed with the gzip tool. Tools for decompression exist
for all operating systems. For MS Windows, 7-Zip is a free alternative.
For Mac and Linux, use a terminal with the `gunzip` command,
e.g. `gunzip data.csv.gz`.

![Dataset context menu](../assets/images/user/dataset_context_menu.png)

## Download metadata

Metadata for the dataset is available for download in JSON format.

1.  Open the context menu by clicking the icon in the top-right corner.
2.  Click **Download metadata**.

## Refresh data

Data for the data pipeline can be manually refreshed. This will load
data from the data source into the platform and data warehouse. Note
that data pipelines will typically be scheduled to refresh
automatically. This can be set in the create and update data pipeline
screens.

1.  Open the context menu by clicking the icon in the top-right corner.
2.  Click **Refresh data**.

## Test connection

After setting up a data pipeline, it is useful to be able to test that
the connection is valid.

1.  Open the context menu by clicking the icon in the top-right corner.
2.  Click **Test connection**.

## Purge data files

Every time data for a data pipeline is loaded into the platform, the
data files used to stage and load data are retained. If underlying data
has changed, for data protection and compliance reasons, it may be
necessary to purge the data files for each data load process in order to
have a fresh start.

1.  Open the context menu by clicking the icon in the top-right corner.
2.  Click **Purge data files**.

## Remove data pipeline

A data pipeline, including data files and data warehouse tables, can be
removed when no longer needed.

1.  Open the context menu by clicking the icon in the top-right corner.
2.  Click **Remove**.

# Data pipelines

## Overview

Analytics Platform (AP) offers *data pipelines* for ingesting data from
a variety of data sources into the platform. A data pipeline is a
mechanism for moving data from data sources into the AP. Within the AP,
the data is ingested into a data catalog, data store and data warehouse.
A data source can be an application, a blob (file) store and a data
file. Data pipelines are *pull* based, meaning data will be pulled from
the data source before being loaded in the platform. Specific data
pipeline types are capable of loading data in near real-time.

The following data pipelines are supported.

### Applications

- **BHIMA** An open-source hospital and logistics information management
  system used for electronic medical records (EMR), inventory and
  commodity tracking and billing in low-resource settings.
- **CommCare:** A mobile-based platform for managing front-line health
  programs, providing case management, data collection, and real-time
  decision support for community health workers.
- **DHIS2:** A flexible, an open-source, web-based platform for
  collecting, analyzing and visualizing health data, widely used for
  managing and monitoring health programs, particularly in low-resource
  settings.
- **FHIR:** Fast Healthcare Interoperability Resources (FHIR) is a
  healthcare data standard and framework with an API for representing
  and exchanging electronic health records (EHR).
- **Google Sheets:** A cloud-based spreadsheet application within Google
  Drive that enables users to create, edit, and format spreadsheets
  online while collaborating in real time with others.
- **iHRIS:** An open-source human resources information system designed
  for managing health workforce data, allowing organizations to track
  employee information.
- **Kobo Toolbox:** An open-source data collection and management tool
  that enables field researchers and humanitarian organizations to
  design surveys and collect data offline or online using mobile
  devices.
- **ODK:** A suite of open-source tools used for mobile data collection,
  especially in challenging environments, with features like offline
  data capture and GPS integration.
- **Ona:** An open-source data collection and analysis platform designed
  for mobile and web-based surveys, commonly used for monitoring and
  evaluating projects in health, agriculture, and development sectors.
- **Talamus:** A hybrid cloud-based software for patients and
  health-care providers to facilitate high quality health-care,
  providing hospitals, labs, pharmacies and imaging centers with a
  digital platform.

### Blob stores

- **Amazon S3:** A scalable and durable object storage service in the
  AWS cloud.
- **Azure Blob Storage:** A scalable and durable object storage in the
  Microsoft Azure cloud.

### Databases

- **SQL Server:** A relational database management system with robust
  data management capabilities developed by Microsoft.
- **MySQL:** An open-source and flexible relational database management
  system widely used for web applications and large datasets.
- **Oracle RDBMS:** A powerful, enterprise-grade relational database
  management system designed for complex transactional processing.
- **PostgreSQL:** A sophisticated, widely adopted and open-source
  relational database management system with a rich ecosystem of
  extensions.
- **Amazon Redshift:** A managed and scalable data warehouse designed
  for fast querying of large-scale datasets in the AWS cloud.

### File formats

- **CSV file upload:** A simple, text-based file format used to store
  tabular data, where each line represents a data record and each field
  is separated by a comma.
- **Parquet file upload:** A columnar storage file format designed for
  high-performance data querying and storage optimization in analytics
  workloads.

### Data portals

- **Global Health Observatory:** The WHO primary data portal for
  health-related statistics, providing a central repository of over
  1,000 indicators used to monitor and analyze global health situations
  and trends.

### Schedule data refresh

Data pipelines can be scheduled to automatically refresh data from the
data source. The refresh schedule is set in the add and update data
pipeline pages. There are two types of scheduling.

- **Regular interval:** Next to the **Refresh schedule** label, the
  preferred interval for when to refresh the data source can be selected
  from the drop-down field. Data will be regularly refreshed from the
  data source at the selected interval. Data in AP will be *fully
  replaced* for every refresh, meaning, existing data will be entirely
  replaced with data from the source system.
- **Continuously updated**: Next to the **Refresh schedule** label, the
  **Continuously updated** checkbox can be selected. Data will be
  continuously refreshed from the data source. Data in AP will be
  *incrementally appended* for every refresh with a one minute delay in
  between. In addition, data in AP will be *fully replaced* once per
  night. For the incremental appends, new, updated and removed data
  records in the source system are first loaded in a staging table in
  AP, then corresponding data records are removed, then finally the new
  and updated data records are loaded in AP. Continuous update
  scheduling is currently only supported for the DHIS2 data pipeline.

### Authentication

Data pipelines need to authenticate to the data source for secure
exchange of data records. Data pipelines typically use the following
types of authentication.

- **API token:** This method uses a token that is sent in the HTTP
  header of requests to authenticate with APIs. The token acts as a
  secure key to access the API, ensuring that only authorized users or
  services can interact with the data source. It is often used for
  RESTful APIs and provides a straightforward way to handle
  authentication without exposing user credentials.

- **API username/password:** A simple authentication scheme built into
  the HTTP protocol, also known as *basic auth*. It involves sending a
  username and password with each HTTP request. These credentials are
  typically *base64* encoded for transmission but are easily decoded,
  making this method less secure unless used in conjunction with TLS
  encryption to protect the credentials in transit.

- **Database username/password:** This method involves connecting to a
  database using a username and password. The credentials are used to
  establish a JDBC or similar database connection, ensuring that only
  authenticated users can execute queries and access data. This
  traditional form of database access control is widely used due to its
  simplicity and direct support in most database systems.

### General metadata

All data pipeline types have a common section titled **General
settings** with various metadata fields. This section allows you to
store extensive metadata for each data pipeline (dataset). Maintaining
descriptive metadata for each dataset allows for a comprehensive data
catalog, allowing users to get an overview of which datasets exists for
your organization. The metadata fields include name, description, owner,
URL, tags, reference, link to source and link to terms of use. Tags are
entered as free text. Tags which already exist will be suggested as you
type in a tag name.

### Data source connections

AP pulls data from your data sources using a set of fixed IP addresses.
To ensure that AP can connect to your data sources, you must allow list
these IP addresses in your firewall. This typically only applies to
databases, where a connection is made directly with the database. It
typically does not apply to HTTP API-based applications which are
already accessible on the Internet. It also only applies for the managed
AP offering from BAO Systems, not for other deployment models.

  AP environment   Region    IP address
  ---------------- --------- ---------------
  Production       US East   3.93.131.28
  Test             US East   54.173.36.156

## Manage data pipelines

The following section covers how to view, create, update and remove data
pipelines.

![Data catalog](../assets/images/user/data_catalog.png)

### View data pipeline

1.  Click **Data catalog** in the left-side menu to view all data
    pipelines.
2.  Click the name of a data pipeline to view more information.

### Create data pipeline

The starting point for creating a new data pipeline is the data catalog.
The data catalog displays the existing data pipelines, also referred to
as datasets.

1.  Click **Connect data** from the top-right corner.
2.  Choose the data source for which you want to create a data pipeline.

![Data pipeline types](../assets/images/user/data_pipeline_types.png)

**General settings**

In the *General settings* section, enter the following information. This
section is present for all data pipeline types.

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Name                                              The name of the data
                                                    pipeline (required)

  Refresh schedule                                  The interval for when
                                                    to refresh data from
                                                    the data source
                                                    (required)

  Description                                       A description of the
                                                    data pipeline

  Owner                                             The owner of the
                                                    source data or system

  URL                                               A URL to the source
                                                    data or system

  Tags                                              Free text tags which
                                                    categorizes the
                                                    source data

  Disable pipeline                                  Whether to disable
                                                    loading of source
                                                    data

  Reference                                         A reference text for
                                                    the data source

  Link to source                                    A URL refering to
                                                    information about the
                                                    data source

  Link to terms of use                              A URL refering to
                                                    terms of use for the
                                                    data source
  -----------------------------------------------------------------------

The following section describes steps for creating each type of data
pipeline.

**Data warehouse target**

In the *Data warehouse* enter the following information. This section
appears last of all sections, and is present for all data pipeline
types.

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Table shema                                       The data warehouse
                                                    schema in which to
                                                    create tables

  Table name                                        The table name, for
                                                    multi-table data
                                                    pipelins, the base
                                                    name for all tables
  -----------------------------------------------------------------------

### BHIMA

  Topic            Value
  ---------------- -----------------------
  Connection       Web API
  Authentication   API username/password
  Data model       Project

**Connection**

  Field      Description
  ---------- ------------------------------------
  URL        The URL for the BHIMA instance
  Username   Username for the BHIMA account
  Password   Password for the BHIMA account
  Project    Project for which to exchange data

**Settings**

*Stock usage*

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Depot                                             The depot to load
                                                    data for

  Inventory                                         The inventory to load
                                                    data for

  Avg consumption algorithm                         The algorithm to use
                                                    for average
                                                    consumption
                                                    calculation

  Monthly interval                                  The monthly interval
  -----------------------------------------------------------------------

*Stock satisfaction rate*

  Field        Description
  ------------ -----------------------------------
  Start date   Start date for satisfaction rates
  End date     End date for satisfication rates
  Depots       Depots

### CommCare

  Topic            Value
  ---------------- ----------------------
  Connection       Web API
  Authentication   API username/token
  Data model       Application and form

**Settings**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Domain                                            The domain (project)
                                                    to load data for

  Application                                       The application to
                                                    load data for

  Hash column names                                 Whether to hash
                                                    column names to
                                                    ensure uniqueness
  -----------------------------------------------------------------------

### DHIS2

  --------------------------------------------------------------------------------
  Topic                                                       Value
  ----------------------------------------------------------- --------------------
  Connection                                                  Web API and database
                                                              connection

  Authentication                                              API token, API
                                                              username/password,
                                                              database
                                                              username/password

  Data model                                                  Aggregate data,
                                                              program, event and
                                                              enrollment
  --------------------------------------------------------------------------------

**Web API**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Base URL to web API                               Base URL to web API
                                                    for DHIS2 instance,
                                                    do not include `/api`

  Username                                          Username for DHIS2
                                                    user account

  Password                                          Password for DHIS2
                                                    user account
  -----------------------------------------------------------------------

**Database**

Providing a database connection URL and credentials will drastically
improve performance, and is required to load enrollment and event data.
If a database connection cannot be provided, the database section can be
skipped, and the data pipeline will work with an API connection only.
For the API connection, only metadata, data set completness and
aggregate data are supported.

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Hostname                                          The hostname to the
                                                    PostgreSQL DHIS2
                                                    database server, do
                                                    not include a
                                                    protocol prefix

  Port                                              Port number to the
                                                    database server,
                                                    default is 5432

  SSL                                               Whether to enable SSL
                                                    encryption for the
                                                    database connection

  Trust server certificate                          Whether to trust the
                                                    server SSL
                                                    certificate for the
                                                    database connection

  Database name                                     The name of the
                                                    database

  Database username                                 The username of the
                                                    database user

  Database password                                 The password of the
                                                    database user
  -----------------------------------------------------------------------

**Data types**

The data types section provides selections for the data types to load.

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Aggregate data                                    Include aggregate
                                                    data values and
                                                    complete data set
                                                    registrations

  Program                                           Include events and
                                                    enrollments, use the
                                                    drop-down to specify
                                                    which programs to
                                                    include, or leave the
                                                    drop-down blank to
                                                    include all current
                                                    and future programs

  Metadata                                          Include metadata
  -----------------------------------------------------------------------

**Data filters**

The data filters section provides filters for the data to load. All
filters are optional.

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Data element groups                               The data element
                                                    groups to include

  Data elements                                     The data elements to
                                                    include

  Organisation units                                The organisation
                                                    units to include

  Data sets                                         The data sets to
                                                    include

  Period last time unit                             The last time periods
                                                    of the specified unit
                                                    to load

  Include soft deleted data                         Whether to include
                                                    soft deleted data
                                                    records

  Skip wide aggregate data table                    Whether to skip the
                                                    wide aggregate data
                                                    table

  Include zero data values                          Whether to include
                                                    zero data values

  Include narrow event table for programs           Whether to include
                                                    narrow event table
                                                    for programs
  -----------------------------------------------------------------------

### FHIR

The FHIR data pipeline allows for retrieving information, typically
related to electronic health records.

Learn more about FHIR in general at [fhir.org](https://fhir.org) and FHR
development at [build.fhir.org](https://build.fhir.org).

  Topic            Value
  ---------------- -----------------------
  Connection       Web API
  Authentication   API username/password

The following FHIR resources are supported.

  --------------------------------------------------------------------------------------------------------
  Resource        Documentation
  --------------- ----------------------------------------------------------------------------------------
  Code system     [build.fhir.org/codesystem.html](https://build.fhir.org/codesystem.html)

  Condition       [build.fhir.org/condition.html](https://build.fhir.org/condition.html)

  Encounter       [build.fhir.org/encounter.html](https://build.fhir.org/encounter.html)

  Location        [build.fhir.org/location.html](https://build.fhir.org/location.html)

  Medication      [build.fhir.org/medication.html](https://build.fhir.org/medication.html)

  Observation     [build.fhir.org/observation.html](https://build.fhir.org/observation.html)

  Organization    [build.fhir.org/organization.html](https://build.fhir.org/organization.html)

  Patient         [build.fhir.org/patient.html](https://build.fhir.org/patient.html)

  Person          [build.fhir.org/person.html](https://build.fhir.org/person.html)

  Practitioner    [build.fhir.org/practitioner.html](https://build.fhir.org/practitioner.html)

  Questionnaire   [build.fhir.org/questionnaire.html](https://build.fhir.org/questionnaire.html)

  Questionnaire   [build.fhir.org/questionnaireresponse.html](https://build.fhir.org/questionnaire.html)
  response        

  Value set       [build.fhir.org/valueset.html](https://build.fhir.org/valueset.html)
  --------------------------------------------------------------------------------------------------------

**Settings**

  Field            Description
  ---------------- -------------------------------
  Questionnaires   The questionnaires to include

### Global Health Observatory

The Global Health Observatory (GHO) data pipeline allows for retrieving
indicators and datasets from the GHO public data portal.

  ------------------------------------------------------------------------
  Topic                                                       Value
  ----------------------------------------------------------- ------------
  Connection                                                  Web API

  Authentication                                              None,
                                                              publicly
                                                              available

  Data model                                                  Indicator,
                                                              dimension,
                                                              dimension
                                                              value and
                                                              data
  ------------------------------------------------------------------------

**Settings**

  Field        Description
  ------------ ---------------------------------
  Indicators   The indicators to load data for

### World Bank Open Data

The World Bank Open Data data pipeline allows for retrieving indicators
and datasets from the World Bank Open Data portal.

  Topic            Value
  ---------------- --------------------------
  Connection       Web API
  Authentication   None, publicly available
  Data model       Indicator and data

**Settings**

  Field        Description
  ------------ ---------------------------------
  Indicators   The indicators to load data for

### Google Sheets

The Google Sheets data pipeline allows for retrieving data from a Google
Sheet and loading it into the AP.

To connect to a Google Sheet, the first step is to share the sheet with
the AP service account email. The service account email address is:
`google-sheets@dharma-prod.iam.gserviceaccount.com`.

To share the file in Google Drive:

1.  Right-click the Google Sheet document and click **Share** \>
    **Share**.
2.  In the **Add people, groups and calendar events** field, enter the
    service account email address:
    `google-sheets@dharma-prod.iam.gserviceaccount.com`.
3.  Set the permission to **Viewer**.
4.  Click **Send**.

After sharing the sheet, get the URL to the sheet in the share dialog by
clicking **Copy link**.

  Topic            Value
  ---------------- ----------------------------------------
  Connection       Web API
  Authentication   File shared with service account email
  Data model       Sheet

**Connection**

  Field   Description
  ------- ------------------------------------------------
  URL     URL to sheet shared with service account email

**Settings**

  Field    Description
  -------- -----------------------
  Schema   Sheets in JSON format

The schema is a JSON array with one object per sheet. Each object has
the following properties:

  -----------------------------------------------------------------------
  Property                          Description
  --------------------------------- -------------------------------------
  sheetName                         The name of the sheet in Google
                                    Sheets. Ensure this matches the exact
                                    sheet name in the Google Sheets

  ranges                            An array of cell ranges in the sheet
  -----------------------------------------------------------------------

The ranges array has one object per range. Each object has the following
properties:

  -----------------------------------------------------------------------
  Property                          Description
  --------------------------------- -------------------------------------
  header                            The cell range to use as column
                                    headers in the data warehouse table,
                                    e.g. `A1:J2`. The `header` can span
                                    multiple rows but this will be
                                    flattened into a single column header
                                    in the data warehouse

  data                              The cell range to use as rows in the
                                    data warehouse table, e.g. `A3:J100`
  -----------------------------------------------------------------------

!!! tip "Note"\
The header and data ranges must have the same number of columns.

**Example schema**

``` json
[
  {
    "sheetName": "ANC Data January 2025",
    "ranges": [
      {
        "header": "A1:J2",
        "data": "A3:J100"
      },
      {
        "header": "A120:P121",
        "data": "A122:P200"
      }
    ]
  },
  {
    "sheetName": "TB Data January 2025",
    "ranges": [
      {
        "header": "A1:D1",
        "data": "A2:D100"
      }
    ]
  }
]
```

### iHRIS

  Topic            Value
  ---------------- ----------------------------
  Connection       Database (JDBC)
  Authentication   Database username/password
  Data model       Form

**MySQL database**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Hostname                                          The hostname for the
                                                    iHRIS database

  Port                                              The port for the
                                                    iHRIS database, often
                                                    3306

  SSL                                               Whether to enable SSL
                                                    encryption for the
                                                    database connection

  Trust server certificate                          Whether to trust the
                                                    server SSL
                                                    certificate for the
                                                    database connection

  Database name                                     The name of the
                                                    database

  Database username                                 The username of the
                                                    database user

  Database password                                 The password of the
                                                    database user
  -----------------------------------------------------------------------

**Settings**

  Field                    Description
  ------------------------ ---------------------------------
  Forms                    The forms to load data for
  References               The references to load data for
  Include record history   Whether to load record history

### Kobo Toolbox

  Topic            Value
  ---------------- -----------
  Connection       Web API
  Authentication   API token
  Data model       Survey

**Connection**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  URL                                               URL to Kobo instance

  Auth token                                        Authentication token
                                                    for Kobo user account
  -----------------------------------------------------------------------

**Settings**

  Field    Description
  -------- -----------------------------
  Survey   The survey to load data for

### ODK

  Topic            Value
  ---------------- -----------------------
  Connection       Web API
  Authentication   API username/password
  Data model       Project and form

**Connection**

  Field      Description
  ---------- -------------------------------
  URL        URL to ODK instance
  Username   Username for ODK user account
  Password   Password for ODK user account

**Settings**

  Field     Description
  --------- ------------------------------
  Project   The project to load data for
  Form      The form to load data for

### Ona

  Topic            Value
  ---------------- -----------
  Connection       Web API
  Authentication   API token
  Data model       Form

**Connection**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  URL                                               URL to Ona instance

  Auth token                                        Authentication token
                                                    for Ona user account
  -----------------------------------------------------------------------

**Settings**

  Field   Description
  ------- ---------------------------
  Form    The form to load data for

### Talamus

  Topic            Value
  ---------------- -----------
  Connection       Web API
  Authentication   API token
  Data model       Facility

**Connection**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  URL                                               URL to Kobo instance

  Auth token                                        Authentication token
                                                    for Kobo user account
  -----------------------------------------------------------------------

**Settings**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Facilities                                        The facilities to
                                                    load data for

  Start date                                        The start date of the
                                                    time range to load
                                                    data for

  End date                                          The end date of the
                                                    time range to load
                                                    data for
  -----------------------------------------------------------------------

### Amazon S3

Amazon S3 refers to files as objects.

  Topic            Value
  ---------------- -----------------------
  Connection       Web API
  Authentication   Access key/secret key
  Data model       Bucket and object

**Source**

  Field        Description
  ------------ --------------------------------
  Bucket       The bucket name
  Object key   The key for the object to load
  Access key   The IAM access key
  Secret key   The IAM secret key

### Azure Blob Storage

Azure Blob Storage refers to files as blobs.

  Topic            Value
  ---------------- --------------------
  Connection       Web API
  Authentication   Connection string
  Data model       Container and blob

**Source**

  Field               Description
  ------------------- -----------------------------------------
  Container name      The container name
  Blob path           The path to the blob to load
  Connection string   The connection string for the container

### SQL Server

  Topic            Value
  ---------------- ----------------------------
  Connection       Database (JDBC)
  Authentication   Database username/password
  Data model       Table

**SQL Server**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Hostname                                          The hostname for the
                                                    database

  Port                                              The port for the
                                                    database, often 1433

  SSL                                               Whether to enable SSL
                                                    encryption for the
                                                    database connection

  Trust server certificate                          Whether to trust the
                                                    server SSL
                                                    certificate for the
                                                    database connection

  Database name                                     The name of the
                                                    database

  Database username                                 The username of the
                                                    database user

  Database password                                 The password of the
                                                    database user
  -----------------------------------------------------------------------

**Data source**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  SQL query                                         The SQL query for
                                                    retrieving data to
                                                    load

  Tables                                            The database tables
                                                    to load
  -----------------------------------------------------------------------

### MySQL

  Topic            Value
  ---------------- ----------------------------
  Connection       Database (JDBC)
  Authentication   Database username/password
  Data model       Table

**MySQL**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Hostname                                          The hostname for the
                                                    database

  Port                                              The port for the
                                                    database, often 1433

  SSL                                               Whether to enable SSL
                                                    encryption for the
                                                    database connection

  Trust server certificate                          Whether to trust the
                                                    server SSL
                                                    certificate for the
                                                    database connection

  Database name                                     The name of the
                                                    database

  Database username                                 The username of the
                                                    database user

  Database password                                 The password of the
                                                    database user
  -----------------------------------------------------------------------

**Data source**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  SQL query                                         The SQL query for
                                                    retrieving data to
                                                    load

  Tables                                            The database tables
                                                    to load
  -----------------------------------------------------------------------

### Oracle RDBMS

  Topic            Value
  ---------------- ----------------------------
  Connection       Database (JDBC)
  Authentication   Database username/password
  Data model       Table

**Oracle RDBMS**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Hostname                                          The hostname for the
                                                    database

  Port                                              The port for the
                                                    database, often 1433

  SSL                                               Whether to enable SSL
                                                    encryption for the
                                                    database connection

  Trust server certificate                          Whether to trust the
                                                    server SSL
                                                    certificate for the
                                                    database connection

  Database name                                     The name of the
                                                    database

  Database username                                 The username of the
                                                    database user

  Database password                                 The password of the
                                                    database user
  -----------------------------------------------------------------------

**Data source**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  SQL query                                         The SQL query for
                                                    retrieving data to
                                                    load

  Tables                                            The database tables
                                                    to load
  -----------------------------------------------------------------------

### PostgreSQL

  Topic            Value
  ---------------- ----------------------------
  Connection       Database (psql)
  Authentication   Database username/password
  Data model       Table

**PostgreSQL**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Hostname                                          The hostname for the
                                                    database

  Port                                              The port for the
                                                    database, often 1433

  SSL                                               Whether to enable SSL
                                                    encryption for the
                                                    database connection

  Trust server certificate                          Whether to trust the
                                                    server SSL
                                                    certificate for the
                                                    database connection

  Database name                                     The name of the
                                                    database

  Database username                                 The username of the
                                                    database user

  Database password                                 The password of the
                                                    database user
  -----------------------------------------------------------------------

**Data source**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  SQL query                                         The SQL query for
                                                    retrieving data to
                                                    load

  Tables                                            The database tables
                                                    to load
  -----------------------------------------------------------------------

### Amazon Redshift

  Topic            Value
  ---------------- -------------------------------------
  Connection       JDBC
  Authentication   Database username/password, IAM ARN
  Data model       Table

**Amazon Redshift**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Hostname                                          The hostname for the
                                                    database

  Port                                              The port for the
                                                    database, often 1433

  SSL                                               Whether to enable SSL
                                                    encryption for the
                                                    database connection

  Trust server certificate                          Whether to trust the
                                                    server SSL
                                                    certificate for the
                                                    database connection

  Database name                                     The name of the
                                                    database

  Database username                                 The username of the
                                                    database user

  Database password                                 The password of the
                                                    database user
  -----------------------------------------------------------------------

**Data source**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  SQL query                                         The SQL query for
                                                    retrieving data to
                                                    load

  Tables                                            The database tables
                                                    to load
  -----------------------------------------------------------------------

### CSV file upload

  Topic            Value
  ---------------- -------------
  Connection       File upload
  Authentication   \-
  Data model       Table

**Settings**

  Field       Description
  ----------- ------------------------------------
  CSV files   One or more CSV data files to load
  Delimiter   The CSV file delimiter

**File format requirements**

- If uploading multiple files, the schema (columns) must be the same for
  all files
- The first row should be the header defining the column names
- Column names must be unique within the file
- Column names are recommended to contain only letters and digits and
  start with a letter
- The filename is recommended to contain only letters and digits and
  start with a letter

### Parquet file upload

  Topic            Value
  ---------------- -------------
  Connection       File upload
  Authentication   \-
  Data model       Table

**Settings**

  Field          Description
  -------------- ---------------------------
  Parquet file   Parquet data file to load

## Edit data pipeline

1.  Find and click the data pipeline to edit in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Edit**.
4.  Edit values in the relevant sections.
5.  Click **Save** at the bottom of the section.
6.  Close the dialog by clicking the close icon in the top-left corner.

## Remove data pipeline

1.  Find and click the data pipeline to remove in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Remove**.

## Manage access for data pipeline

1.  Find and click the data pipeline in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Share**.
4.  Grant appropriate access levels to users and user groups.
5.  Click **Save**.

## Technical overview

This section explains technical aspects of the data pipeline system.

### Regular data loading

The steps for data loading are described below.

1.  Connect to the data source system, typically an encrypted HTTP API
    connection using the TCP/IP protocol, or an encrypted database
    connection using the JDBC or ODCB protocol.
2.  Retrieve and write table data into buffer data files in the AP
    storage area. Data files use the pipe character as delimiter for
    text values, and are compressed with gzip using the DEFLATE
    algorithm.
3.  Upload the data files into the client-specific blob store. Depending
    on the client-specific hosting environment, this is either an
    on-premise file system or public cloud blob store.
4.  Create a staging table, which is a regular table with a `_staging`
    suffix.
5.  Load data from the data files into the staging table.
6.  If the data warehouse supports indexing, add indexes for appropriate
    tables and columns.
7.  Drop the existing main table, and promote the staging table to
    become the main table. This is done by dropping the main table and
    renaming the staging table in an atomic operation.

### Real-time data loading

AP supports near real-time data loading for specific data pipeline
types. This behavior is referred to as continuous data loading. To
enable continuous data loading for a data pipeline, select the
"Continuously updated" checkbox under "Refresh schedule" in the general
settings section when creating or updating a data pipeline or workflow.
Continuous data loading is supported for specific data pipeline types
only, and the implementation depends partly on the structure of each
data pipeline source system.

The steps for continuous data loading are described below. The steps are
partly overlapping with the steps for regular data loading described
above. The incremental data loading tasks run with a fixed delay of one
minute. The incremental data load process is typically taking between 1
and 3 minutes. This means that data becomes available in AP between 2 to
4 minutes after data is captured in the source system.

1.  Store a timestamp for the beginning of the last successful data load
    task.
2.  Connect to the data source system.
3.  Retrieve data which were created, updated or deleted since the
    beginning of the last successful data load task. This is done with a
    SQL filter particular to the source system. Write data into buffer
    data files.
4.  Upload the data files into the client-specific blob store.
5.  Create a staging table.
6.  Load data from the data files into the staging table.
7.  Delete data records from the main table which match data records in
    the staging table using a unique record identifier.
8.  Append data records from the staging table to the main table.

This proccess ensures that data records which were created or updated
since the last data load task are effectively added or updated, and that
data records which were deleted since the last data load task are
deleted in the AP managed data warehouse.

### Data retention and versioning

The data pipeline system in AP retains and stores the raw data files for
every data load operation. Each data load is associated with a job
identifier. The raw data files are kept in the data storage area in a
directory named after the job identifier. This allows data analysts to
compare the current data with historical versions and perform historical
trend analysis, and system administrators to perform root-cause
debugging and conduct point-in-time data re-processing.

# DHIS2 data pipeline

## Overview

Table and column names in the AP are similar, but not identical to
DHIS2, and not all tables are ingested. The general structure however
remains, with one table for aggregate data, separate tables for each
program and separate tables for metadata, such as categories, data
elements and org units.

## Table naming

Each data pipeline specifies a schema and a data warehouse base table
name. The base table name then becomes the prefix to all tables
contained within it. For example, for a data pipeline with base name
`data`, the `datavalue` table will be named `data_datavalue`.

There are two versions of the aggregate data tables that are created by
default.

  -----------------------------------------------------------------------
  Version           Table name                Description
  ----------------- ------------------------- ---------------------------
  Narrow            datavalue                 For most queries. There are
                                              fewer columns to manage,
                                              however it will require
                                              joining with other metadata
                                              tables to get all the
                                              information you need.

  Wide              analytics                 Ad-hoc analysis. Most of
                                              the metadata tables are
                                              joined in this table,
                                              however it requires
                                              familiarity with the column
                                              structure.
  -----------------------------------------------------------------------

## Table structure

This section describes the tables which are created and loaded by the
DHIS2 data pipeline.

### Data tables

The DHIS2 data pipeline creates and loads tables for aggregate data,
complete data set registrations, program events, program enrollments,
tracked entities and relationships.

  -----------------------------------------------------------------------
  Table name                       Description
  -------------------------------- --------------------------------------
  Audit                            Audit table

  Data                             Wide aggregate data table

  Data value                       Narrow aggregate data table

  Complete data set registration   Completed datasets

  \[*Program name*\] - Event       Tracker/event data table

  \[*Program name*\] - Event audit Tracker/event audit table

  Enrollment                       Tracker enrollment data table

  Tracked entity attribute value   Tracked entity attribute value table

  Tracked entity instance          Tracked entity instance data table

  Relationship item                Relationship data table
  -----------------------------------------------------------------------

### Metadata tables

Metadata tables are split into their component parts. The wide data
value table includes most of the key metadata information, however it
might be neccessary to build a star schema in joining metadata tables.
There are additional metadata tables suffixed with 'structure' which are
automatically created by AP to avoid having to make as many table joins.

  Data elements
  ----------------------------------
  Data element
  Data element group
  Data element group members
  Data element structure
  Data element group set
  Data element group set members
  Data element group set structure

  Categories
  ---------------------------------------------------
  Category
  Category structure
  Category combination
  Category combinations and categories
  Category combinations and option combinations
  Category option
  Category option and organisation units
  Categories and category options
  Category option combination
  Category option combination structure
  Category option combinations and category options
  Category option group
  Category option group members
  Category option group set
  Category option group set members

  Organisation Units
  ---------------------------------------
  Organisation unit
  Organisation unit structure
  Organisation unit group
  Organisation unit group members
  Organisation unit group set
  Organisation unit group set members
  Organisation unit group set structure

  Time periods
  ---------------------------
  Period
  Period structure
  Relative monthly period
  Relative quarterly period
  Relative yearly period

  Data Sets
  ------------------
  Data set
  Data set element

  Programs
  ---------------
  Program
  Program stage

  Tracked Entities
  -----------------------------
  Tracked entity attribute
  Tracked entity type
  Relationships
  Relationship structure
  Relationship constraint
  Relationship type
  Relationship type structure

  Indicators
  -----------------------------
  Indicator
  Indicator group
  Indicator group members
  Indicator group set
  Indicator group set members
  Program indicator
  Dimension

# Views

## Overview

Views are SQL queries that act like virtual tables in a database system
and display results returned from a specified SQL query. In AP, views
enable users to manage and analyze data more efficiently, serving as a
powerful abstraction layer that simplifies complex SQL queries into
reusable and manageable components.

SQL views are versatile tools in data management, offering numerous data
capabilities:

- **Join:** Views can seamlessly integrate data from multiple tables
  through SQL *JOIN* operations. This capability is invaluable when you
  need to combine related datasets for comprehensive analyses.
- **Filter:** Views can filter data to focus on specific records, making
  it easier to work with subsets of data pertinent to particular
  analyses or reports.
- **Aggregate:** With views, you can perform aggregation queries, such
  as *SUM*, *AVG*, *MAX*, and *COUNT*, to summarize data. This is
  particularly useful for generating high-level reports from detailed
  data records.
- **Enrich:** Views can enhance data by incorporating calculated columns
  or by formatting existing data in a way that is more suitable for user
  requirements or specific analyses.
- **Project:** Views can select and include only specific columns and
  rename them using the AS keyword to make them more human-readable.
- **Calculate:** Views can create calculated "virtual" columns that
  don't exist in the physical tables, represented by SQL expressions.

## Types of views

AP provides utilize two primary types of SQL views.

- **Logical views:** These are the standard types of views that do not
  store data physically. They are essentially SQL queries which execute
  every time the view is accessed. Logical views are ideal for real-time
  data analysis and scenarios where data changes frequently.
- **Materialized views:** Unlike logical views, materialized views store
  the query result as a physical table on the disk. This type of view is
  particularly useful for datasets that do not change frequently but
  require fast read access. Materialized views improve performance by
  storing the computed result, reducing the load on compute resources
  during each query execution. Note that in AP, materialized views are
  created as regular tables. A benefit of tables over "native"
  materialized views is the ability to drop tables which are part of the
  SQL query of the view.

## Manage views

The following section covers how to view, create, update and remove
views.

### View views

1.  Click **Views** in the left side menu to list all views.
2.  Click the name of a view to see more information.

![View overview](../assets/images/user/view_overview.png)

### Create view

1.  Click the **Create new** button from the top-right corner.

2.  Enter the following information.

      -----------------------------------------------------------------------
      Field                                             Description
      ------------------------------------------------- ---------------------
      Name                                              The name of the view

      Description                                       A description of the
                                                        view

      Tags                                              Free text tags which
                                                        categorizes the view

      Schema                                            The schema in which
                                                        to store the view

      View name                                         The name of the data
                                                        warehouse view,
                                                        meaning the name as
                                                        it will appear in the
                                                        data warehouse

      SQL query                                         The SQL query uses to
                                                        retrieve data for the
                                                        view
      -----------------------------------------------------------------------

![Create view](../assets/images/user/view_create.png)

### Edit view

1.  Find and click the view to edit in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Edit**.
4.  Edit values in the relevant sections.
5.  Click **Save** at the bottom of the section.
6.  Close the dialog by clicking the close icon in the top-left corner.

### Edit SQL query

1.  Find and click the view to edit in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Edit SQL query**.
4.  In the SQL editor, edit the SQL query.
5.  Click **Save**.

### Remove view

1.  Find and click the view to remove in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Remove**.

### Refresh data

Materialized views can be refreshed manually. A materialized view
refresh task will start in the background when the materialized view is
triggered. Use the change log to view the task progress.

1.  Find and click the materialized view in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Refresh data**.

## Manage access for view

1.  Find and click the view in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Share**.
4.  Grant appropriate access levels to users and user groups.
5.  Click **Save**.

## View change log

The *Change log* tab displays an overview of *tasks* for the
materialized view. A task represent a single materialized view refresh.
Note that change logs are available only for materialized views and not
for logical views. For each task, following information is available.

- **Start time:** The time at which the task started.
- **Duration:** The duration of the task.
- **Rows:** The number of data records which were loaded by the task.
- **Status:** The status of the task, can be *Successful*, which means
  the task completed successfully, *Failed*, which means the task
  completed with an error, and *Pending*, which means the task is
  currently in progress.

![Materialized view change
log](../assets/images/user/materialized_view_change_log.png)

### Task log

You can click on a task row to view logs for the task. The logs provides
detailed information about the materialized view refresh.

# Scripts

## Overview

The integrated scripting support in AP allows data analysts and data
scientists to create data transformations and analytics. AP supports the
R scripting language. Python support is coming soon.

In AP, data pipelines automates the ingestion of data from multiple
sources into the central data store. Scripts have access to the entire
catalog of datasets and tables, allowing for integrated and sophisticatd
data transformations.

AP provides a web-based editor for scripting with syntax highlighting
and code completion and a console area which displays script output.
This makes it straight-forward to write scripts and get access to data.

Scripts can be included as jobs in workflows in combination with data
pipelines, SQL transformations and destinations, allowing for flexible
and orcestrated data movement and transformation.

## Script flow

The sripting editor provides a function for aquiring a connection to the
AP data warehouse. The function name is `connect_datawarehouse` and is
available for all supported scripting languages. This function provides
a connection to the data warehouse, making it easy to read and write
data to the data warehouse. For scripts, a typical flow is described
below.

- Retrieve data from the data warehouse with a SQL using the data
  warehouse connection provided by the `connect_datawarehouse` function.
- The SQL query will specify aggregation and filters to retrieve the
  relevant data, and ensure that an appropriate set of data records is
  pulled into the script environment.
- Perform data computation such maching learning, forecasting and data
  modelling.
- Load the resulting data into a data frame.
- Write the data frame to a table in the data warehouse.

This section covers the supported scripting languages.

## R scripting

The R scripting language is supported and is ideal for statistical
computing and graphics.

![Script R editor](../assets/images/user/script_r_editor.png)

## Python scripting

The Python scripting language is supported and is ideal for machine
learning, data modelling and statistics.

![Script Python editor](../assets/images/user/script_python_editor.png)

## Natural text script generation

The script editor allows for specifying a natural text description of
the outcome you want to achieve in a conversational style and have the
AI-powered editor generate a script automatically. For example, ask the
platform to perform outlier detection using the modified Z-score
statistical algorithm, build a light-weight API data connector and
provide forecasting for a particular data table.

### Generate script

1.  In the script editor area, click the right side panel.
2.  Specify the natural text description in the input area.
3.  Click the generate icon or click **Enter**.
4.  In the script output area, click the **Copy** icon to copy the text.
5.  Alternatively, click the **Insert** icon to insert the script
    directly into the script area.
6.  Click the **Close** icon or click anywhere outside the right side
    panel to close it.

![Natural text script
generation](../assets/images/user/script_natural_text_generation.png)

## Manage scripts

The following section covers how to view, create, update and remove
scripts.

### View scripts

1.  Click **Scripts** in the left side menu to list all scripts.
2.  Click the name of a script to see more information.

![Script overview](../assets/images/user/script_overview.png)

### Create script

1.  Click the **Create new** button from the top-right corner.

2.  Enter the following information.

      -----------------------------------------------------------------------
      Field                                             Description
      ------------------------------------------------- ---------------------
      Name                                              The name of the
                                                        script

      Language                                          The script language

      Refresh schedule                                  The interval for when
                                                        to run the script
                                                        (required)

      Description                                       A description of the
                                                        script

      Tags                                              Free text tags which
                                                        categorizes the
                                                        script

      Script                                            The script code
      -----------------------------------------------------------------------

![Script view](../assets/images/user/script_create.png)

### Edit script

1.  Find and click the script to edit in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Edit**.
4.  Edit values in the relevant sections.
5.  Click **Save** at the bottom of the section.
6.  Close the dialog by clicking the close icon in the top-left corner.

### Edit script code

1.  Find and click the view to edit in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Edit script**.
4.  In the script editor, edit the script query.
5.  Click **Save**.

### Remove script

1.  Find and click the script to remove in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Remove**.

## Manage access for script

1.  Find and click the script in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Share**.
4.  Grant appropriate access levels to users and user groups.
5.  Click **Save**.

## Using variables in scripts

Scripts support the use of variables, allowing you to reference dynamic
values within your script code. To reference a variable, use the
following syntax.

`${VAR_NAME}`

For example, if you have a variable named `START_DATE`, you can use it
in your script as `${START_DATE}`. When the script is executed,
`${START_DATE}` will be replaced with the actual value of the variable.

## Packages

Packages commonly used for data analysis and data science are available
in the AP scripting environment.

### Python

The available Python packages are listed below.

  -----------------------------------------------------------------------
  Package                                  Description
  ---------------------------------------- ------------------------------
  numpy                                    Scientific computing including
                                           arrays, matrice and math
                                           functions.

  scipy                                    Scientific and technical
                                           computing, built on numpy.

  pandas                                   Data manipulation and analysis
                                           including data frames.

  scikit-learn                             Classical machine learning and
                                           predictive data analysis.

  nltk                                     Natural language toolkit
                                           working with human language
                                           data.

  joblib                                   Pipeline persistence and
                                           parallel computing.

  tqdm                                     Extensible progress bar for
                                           long running processes.

  tabulate                                 Print data in pretty, readable
                                           tables.

  prophet                                  Forecasts for univariate time
                                           series datasets.

  python-dateutil                          Handling of time-series data
                                           and time zone conversions.
  -----------------------------------------------------------------------

# Destinations

## Overview

AP offers *destinations* for loading data from the platform back into
operational systems. This is valuable in order to enrich the destination
system with integrated and harmonized data from a variety of sources.

The following destinations are supported.

### Applications

- **DHIS2:** A flexible, an open-source, web-based platform for
  collecting, analyzing and visualizing health data, widely used for
  managing and monitoring health programs, particularly in low-resource
  settings.

## Manage destinations

The following section covers how to view, create, update and remove
destinations.

![Destination overview](../assets/images/user/destination_overview.png)

### View destination

1.  Click **Views** in the left side menu to view all destinations.
2.  Click the name of a view to see more information.

### Create destination

1.  Click **Create new** from the top right corner.
2.  Chose the type of destination.

![Create destination](../assets/images/user/destination_create.png)

**General settings**

In the *General settings* section, enter the following information. This
section is present for all data pipeline types.

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Name                                              The name of the
                                                    destination
                                                    (required)

  Refresh schedule                                  The interval for when
                                                    to refresh data from
                                                    the data source
                                                    (required)

  Description                                       A description of the
                                                    data pipeline

  URL                                               A URL to the source
                                                    data or system

  Disable destination                               Whether to disable
                                                    loading of
                                                    destination data

  Reference                                         A reference text for
                                                    the data source

  Link to source                                    A URL refering to
                                                    information about the
                                                    data source

  Link to terms of use                              A URL refering to
                                                    terms of use for the
                                                    data source
  -----------------------------------------------------------------------

**Source**

In the *Source* section, select the view to use to retrieve destination
data.

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  View                                              The view to use to
                                                    retrieve destination
                                                    data

  -----------------------------------------------------------------------

**DHIS2 Web API**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Base URL to web API                               Base URL to web API
                                                    for DHIS2 instance,
                                                    do not include `/api`

  Username                                          Username for DHIS2
                                                    user account

  Password                                          Password for DHIS2
                                                    user account
  -----------------------------------------------------------------------

**Import options**

  -----------------------------------------------------------------------
  Field                                             Description
  ------------------------------------------------- ---------------------
  Data element ID scheme                            The data element ID
                                                    scheme to use for
                                                    data import

  Org unit ID scheme                                The organisation unit
                                                    ID scheme to use for
                                                    data import

  Cat opt combo ID scheme                           The category option
                                                    combo ID scheme to
                                                    use for data import

  General ID scheme                                 The general ID scheme
                                                    to use for data
                                                    import

  Dry run                                           Whether to make a dry
                                                    run import without
                                                    saving data in the
                                                    destination

  Skip audit                                        Whether to skip
                                                    generating audit
                                                    records during data
                                                    import in the
                                                    destination
  -----------------------------------------------------------------------

**Destination view**

The following columns are supported for SQL views for DHIS2
destinations.

  Field                    Column name                 Required
  ------------------------ --------------------------- ----------
  Data element             data_element_id             Yes
  Period                   period_id                   Yes
  Org unit                 org_unit_id                 Yes
  Category option combo    category_option_combo_id    Yes
  Attribute option combo   attribute_option_combo_id   Yes
  Value                    value                       Yes
  Stored by                stored_by                   No
  Comment                  comment                     No
  Follow-up                comment                     No

The column name matching is permissive and tolerates variations.
Matching is case insensitive, and allows names with or without
underscore and `_id` suffix. Using data element as an example, the
following column names are valid:

  Column name variation
  -----------------------
  data_element_id
  dataelementid
  data_element
  DataElementID
  DataElement

An example of a SQL query to use with a view as source for a DHIS2
destination.

``` sql
select
  dv."DataElementID",
  dv."PeriodID",
  dv."OrgUnitID",
  dv."CatOptComboID",
  dv."AttOptComboID",
  dv."Value",
  dv."Deleted"
from
  dhis2.data_datavalue dv;
```

### Edit destination

1.  Find and click the destination to edit in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Edit**.
4.  Edit values in the relevant sections.
5.  Click **Save** at the bottom of the section.
6.  Close the dialog by clicking the close icon in the top-left corner.

### Remove destination

1.  Find and click the destination to remove in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Remove**.

### Trigger destination

Destinations can be triggered manually. A destination task will start in
the background when the destination is triggered. Use the change log to
view the task progress.

1.  Find and click the destination in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Trigger**.

## Manage access for destination

1.  Find and click the destination in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Share**.
4.  Grant appropriate access levels to users and user groups.
5.  Click **Save**.

## View change log

The *Change log* tab displays an overview of *tasks* for the
destination. A task represent a single destination run. For each task,
following information is available.

- **Start time:** The time at which the task started.
- **Duration:** The duration of the task.
- **Rows:** The number of data records which were imported by the task.
- **Status:** The status of the task, can be *Successful*, which means
  the task completed successfully, *Failed*, which means the task
  completed with an error, and *Pending*, which means the task is
  currently in progress.

![Destination change
log](../assets/images/user/destination_change_log.png)

### Task log

You can click on a task row to view logs for the task. The logs provides
detailed information about the data import.

# Workflows

## Overview

The Analytics Platform (AP) offers a comprehensive workflow management
system designed to orchestrate complex data pipelines efficiently.
Workflows in AP automate the movement, transformation, and integration
of data across various sources and systems into a unified data store,
enabling integrated data analysis through your chosen visualization
tools. This section explains the architecture of workflows within AP and
their critical role in streamlining data operations.

## Benefits

Workflows and job orchestration in AP bring several key benefits.

- **Efficiency:** Automating data tasks reduces manual efforts and
  speeds up the data transformation processes.
- **Consistency:** Scheduled workflows ensure that data handling is
  performed consistently without gaps or overlaps, leading to reliable
  data integrity.
- **Quality:** Integrated data quality checks to verify data meets
  defined quality standards, improving trust and reliability in
  downstream analytics.
- **Scalability:** As organizational needs grow, workflows can be scaled
  to handle new data sources, increasing data volumes and complexity
  without compromising performance.

## Workflow Model

- **Workflow**: A workflow is a structured sequence of operations
  designed to automate processes for data loading, transformation and
  integration. Each workflow consists of one or many steps.
- **Step:** A step encapsulates work to be done. Each step contains one
  or many jobs.
- **Job:** A job represents a specific task such as data extraction,
  transformation, or loading.

![Workflow diagram](../assets/images/user/ap_workflow_overview.png)

### Scheduling

A workflow can be scheduled to run at specific intervals, which
automates the recurring tasks and ensures data freshness without manual
intervention. This scheduling capability is crucial for maintaining
up-to-date data views and operational readiness in dynamic business
environments. Workflows can be set to continuous updates. This will
apply to jobs which support continuous updates, such as DHIS2 data
pipelines.

### Steps

Steps can be configured to run jobs either in sequence or in parallel
within a step.

- **Parallel:** Parallel job execution means that jobs within the step
  will be executed in parallel. This will reduce the total runtime of a
  workflow, thereby enhancing the efficiency of data processing tasks.
  This is suitable when the jobs have no dependencies on each other.
- **Serial:** Serial job execution means that jobs within each step will
  be executed serially, meaning one after the other. This is required
  when jobs have dependencies on each other, for example when the output
  of one view is required as input for another view.

### Jobs

Jobs within a workflow are categorized into five types:

- **Data pipelines:** Extracts and moves data from source systems into
  the data store.
- **Data quality checks:** Validates data against defined quality rules
  to ensure it is suitable for analysis. Notifications may be sent to
  alert relevant users.
- **Views:** Handles joining and integration of datasets with SQL
  statements to prepare them for analysis. Views can be logical and
  materialized.
- **Scripts:** Handles advanced data transformation, modelling and
  forecastig using the R and Python scripting languages.
- **Destinations:** Manages loading of processed data back into
  operational systems or other destinations. This is also referred to as
  "reverse ETL".

![Workflow job types](../assets/images/user/ap_workflow_job_types.png)

## Manage workflows

The following section covers how to create, update and remove workflows.

![Workflow overview](../assets/images/user/workflow_overview.png)

### View workflow

1.  Click **Workflows** in the left-side menu.
2.  Click the name of a workflow to see more information.

### Create workflow

1.  Click the **Create new** button from the top-right corner.

2.  In the *General settings* section, enter the following information.

      -----------------------------------------------------------------------
      Field                                             Description
      ------------------------------------------------- ---------------------
      Name                                              The name of the view

      Refresh schedule                                  The interval for when
                                                        to refresh data from
                                                        the data source
                                                        (required)

      Description                                       A description of the
                                                        view

      Disable workflow                                  Whether to disable
                                                        the workflow
      -----------------------------------------------------------------------

### Create step in workflow

1.  In the *Steps* section, click **+ Add step** to add one or more
    steps.

2.  In the *Add step* section, enter the following information.

      -----------------------------------------------------------------------
      Field                                             Description
      ------------------------------------------------- ---------------------
      Execution mode                                    Whether to execute
                                                        jobs in the step in
                                                        serial or in parallel

      Disable workflow                                  Whether to disable
                                                        the workflow
      -----------------------------------------------------------------------

3.  Click **+ Add job** to add one or more jobs.

4.  In the section for new job, select the following information.

      -----------------------------------------------------------------------
      Field                                             Description
      ------------------------------------------------- ---------------------
      Type                                              The type of job

      Source/Target                                     For data pipelines,
                                                        the data source type;
                                                        for destinations, the
                                                        target type

      Name                                              The data pipeline,
                                                        view or destination
      -----------------------------------------------------------------------

5.  Click the check icon to save the job.

6.  Repeat from 3. to create additional jobs.

7.  Click **Save** to save the step.

8.  Repeat to create additional steps.

### Edit workflow

1.  Find and click the workflow to edit in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Edit**.
4.  Edit values in the relevant sections.
5.  Click **Save** at the bottom of the section.
6.  Close the dialog by clicking the close icon in the top-left corner.

### Remove workflow

1.  Find and click the workflow to remove in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Remove**.

### Refresh data

Workflows can be triggered manually. A workflow task will start in the
background when the workflow is triggered. Use the change log to view
the task progress.

1.  Find and click the workflow in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Refresh data**.

## Manage access for workflow

1.  Find and click the workflow in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Share**.
4.  Grant appropriate access levels to users and user groups.
5.  Click **Save**.

## View change log

The *Change log* tab displays an overview of *tasks* for the workflow. A
task represent a single workflow run. For each task, following
information is available.

- **Start time:** The time at which the task started.
- **Data load strategy:** The strategy for data loading. The strategy
  can be *Full replace*, which means completely loading entire data
  tables, or *Incremental append*, which means loading data records
  which were created, updated or deleted since the last task. The
  *Incremental append* strategy is only relevant for data pipelines for
  which data is continuously updated.
- **Duration:** The duration of the task.
- **Rows:** The number of data records which were loaded by the task.
- **Status:** The status of the task, can be *Successful*, which means
  the task completed successfully, *Failed*, which means the task
  completed with an error, and *Pending*, which means the task is
  currently in progress.

### Task log

You can click on a task row to view logs for the task. The logs provides
detailed information for jobs for each workflow step.

# Data browser

## Overview

Data browser is a core component of AP, designed to empower users to
engage directly with their data through interactive querying across all
available datasets. This allows for exploring data in near real-time,
enabling users to derive insights and make informed decisions rapidly.

In AP, data pipelines automates the ingestion of data from multiple
sources into the central data store. These pipelines are configured to
handle diverse data formats and sources, such as applications, databases
and data files, ensuring that the data is up-to-date and readily
accessible. Once the data is in the platform, users can create queries
which span all data sources and datasets, allowing for integrated data
exploration and analytics.

The main query language of the data browser is SQL, or Structured Query
Language. SQL is a standardized language for managing and manipulating
databases and data warehouses. SQL provides a powerful means to execute
queries on data in a *declarative* style. It allows users to specify
exactly what data they need from a database without requiring detailed
knowledge of how the database is structured or stored. Typical
operations are selecting specific data, aggregating data across
dimensions, filtering data on particular values and joining tables
together. SQL is widely known and used among data professionals, making
it a common language for data exploration and analysis.

![Data browser
overview](../assets/images/user/data_browser_overview.png)

### Navigating data warehouse schema

The schema navigator is placed on the left-side panel of the data
browser. It outlines the entire schema of the data available to the
user. The schema is displayed as a hierarchy, where the first level
represents table *schemas*, the second level represents *tables* and the
third level represents table *columns*.

  Level   Description
  ------- -------------
  1       Schema
  2       Tables
  3       Columns

Expanding an item in the hierarchy will reveal items at the next level.
The column *data type* is displayed next to each column name. If a
column has a *foreign key* to another table and column, a key icon is
displayed to the left of the column name. Hovering the key icon will
display the table and column which the foreign key refer to.

![Data browser schema
navigator](../assets/images/user/data_browser_schema_navigator.png)

To access table operations, hover and click the menu icon to the right
of a table name. The following operations are available.

- **Copy table name:** Copies the name of the table to the clipboard.
- **Copy view data:** Copies a SQL query for reading table data to the
  clipboard.
- **View data:** Inserts a SQL query for reading table data into the
  query area and runs the query.

To access column operations, hover and click the menu icon to the right
of a column name. The following operations are available.

- **Copy column name:** Copies the name of the column to the clipboard.

![Data browser schema navigator
operations](../assets/images/user/data_browser_schema_navigator_operations.png)

### Query editor

The query editor is placed at the center of the data browser. This is
the area where the query can be specified. There are two types of
queries: **SQL** and **Natural text**. The type of query to work with
can be selected at the top bar.

### SQL queries

Users can write their SQL queries directly into the query editor area.
The editor supports auto-completion of SQL statements to make writing
more efficient. To activate auto-complete, press **Ctrl + Space**
(Windows/Linux) or **Command + Space** (macOS), with the cursor at the
relevant position of the query.

![Data browser auto
completion](../assets/images/user/data_browser_auto_completion.png)

After writing a SQL query, click **Run**.

![Data browser SQL
query](../assets/images/user/data_browser_sql_query.png)

SQL queries will vary depending on the schema. A simple example of a SQL
query that summarizes data values and groups by data item, quarterly
time periods and countries:

``` sql
select
  d."DataItem", 
  d."PT Quarterly", 
  d."OU Country", 
  sum(d."Value")
from 
  demo.demo d
group by
  d."DataItem", 
  d."PT Quarterly", 
  d."OU Country";
```

### Formatting queries

To format a SQL query to make it more readable, click **Format** on the
top bar.

### Viewing results

The query response will appear in the result area. The query result is
displayed as a table, with the name and the data type of each column
displayed on the header row. By default, the first 200 rows of the
result are displayed. The number of rows to display can be changed from
the bottom bar drop-down to 500 or 1000.

### Natural text queries

Users who are not proficient in SQL can write queries in natural
language text. Click **Natural text** in the top bar to switch to
natural text queries. Select one or more schemas from the schema
selector at the top bar to narrow down the part of the schema to
retrieve data from.

With text queries, a user can ask simple questions about metadata, for
example:

``` text
Tell me about the ID, names and code of all data elements.
```

The result of metadata queries can be used to ask more sophisticated
data questions, for example:

``` text
Give me the sum of data values for data items related to TB_ART and TB_PREV by 
quarter and OU Region level. Include the data item. Order by data value descending. 
```

![Data browser natural text
query](../assets/images/user/data_browser_natural_text_query.png)

### Explaining SQL query

The data browser can provide explanations for SQL queries in plain
language. When opening a SQL query from a view, click the right-side
panel. The panel will open and display an explanation of the SQL query
in plain text.

![Data browser natural text
query](../assets/images/user/data_browser_query_explanation.png)

### Downloading query result

After a query has run successfully, the result of the query can be
downloaded to a data file in CSV format. The user is provided with two
options.

**Download preview**

Downloads the rows which are visible in the result area. The number of
rows can be changed from the bottom bar drop-down. The format is
uncompressed CSV.

**Download full dataset**

Downloads the entire set of rows produced by the query. This download
option will *stream* results to the web browser. The format is
[Gzip](https://www.gzip.org/) compressed CSV. Note that downloading
extremely large datasets is not recommended.

Tools for decompressing Gzip files are pre-installed on MacOS and Linux.
The [7-Zip](https://www.7-zip.org/) tool is recommended for Microsoft
Windows.

### SQL reference

SQL is a standard query language defined by
[ANSI](https://www.ansi.org/) which ensures interoperability across data
warehouses supported by AP. Numerous courses and guides exist online for
learning purposes.

However, every data warehouse provides a range of specific features and
functions. Users writing SQL queries can learn about data warehouse
specific functions by consulting the respective SQL reference guides
listed below. You can observe the type of data warehouse from the
right-side label at the bottom bar, and explore the respective SQL guide
by clicking the **SQL reference** link next to it, or from the table
below.

  ----------------------------------------------------------------------------------------------------------------------------------------
  Data warehouse                  SQL reference guide
  ------------------------------- --------------------------------------------------------------------------------------------------------
  PostgreSQL                      [www.postgresql.org](https://www.postgresql.org/docs/current/sql.html)

  Amazon Redshift                 [docs.aws.amazon.com](https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_SQLCommandRef.html)

  ClickHouse                      [clickhouse.com](https://clickhouse.com/docs/en/sql-reference)

  SQL Server                      [learn.microsoft.com](https://learn.microsoft.com/en-us/sql/t-sql/language-reference)

  Synapse                         [learn.microsoft.com](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/overview-features)
  ----------------------------------------------------------------------------------------------------------------------------------------

### Text-to-SQL

The following diagram describes the text-to-SQL solution at a high
level.

![Text-to-SQL solution
architecture](../assets/images/sysadmin/ap_text_to_sql_solution_architecture.png)

# Data quality checks

## Overview

Ensuring high data quality is crucial for any organization that relies
on data for decision-making, analysis, and strategic planning.
High-quality data can significantly enhance accuracy in reporting,
consistency in analytics, and reliability in automated decisions.
Conversely, poor data quality can lead to misguided decisions based on
inaccurate, incomplete, or outdated information.

AP provides data quality checks to ensure the integrity and accuracy of
your data. These checks allow users to define specific criteria that
data must meet before it is considered valid for analysis and reporting.
This functionality includes:

- **Outlier detection:** Identify data points that deviate significantly
  from the norm. Outliers may indicate data entry errors or unusual
  events that could skew analysis results.
- **Relationship:** Ensure that relationships between data items make
  sense. For example, the number of tests should most likely not exceed
  positive tests.
- **Data completeness:** Verify that all required data fields are
  populated and that data spans the required time frames or categories.
- **Consistency:** Compare data across time, category and sources to
  ensure data has a consistent format, is free from duplicates and uses
  the same coding system.

Data quality checks in AP are based on SQL queries which define and
enforce these rules. By writing a SQL query, users can precisely specify
the conditions under which data is considered valid. The SQL query
result set will reveal conditions which are in violation of the check.
When a SQL query identifies data that violates a quality check, AP can
trigger alerts or even prevent the integration of flawed data into your
reports and analyses.

## Manage data quality checks

The following section covers how to view, create, update and remove data
quality checks.

![Data quality check
overview](../assets/images/user/data_quality_check_overview.png)

### Create data quality check

1.  Click **Create new** from the top-right corner.

2.  Enter the following information.

      -----------------------------------------------------------------------
      Field                                             Description
      ------------------------------------------------- ---------------------
      Name                                              The name of the check
                                                        (required)

      Short name                                        The short name of the
                                                        check

      Code                                              The code of the check

      Description                                       A description of the
                                                        check

      Result message                                    The message to
                                                        display in check
                                                        results including in
                                                        notifications

      Labels                                            One or many labels on
                                                        the format
                                                        `key:value`

      SQL query                                         A SQL query which
                                                        specifies the
                                                        conditions under
                                                        which data is
                                                        considered valid
      -----------------------------------------------------------------------

3.  Click **Create**.

![Create data quality
check](../assets/images/user/create_data_quality_check.png)

### Edit data quality check

1.  Find and click the data quality check to update in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Edit**.
4.  Update the relevant fields.
5.  Click **Save**.

### Edit SQL query

1.  Find and click the data quality check to edit in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Edit the SQL query**.
4.  In the SQL editor, edit the SQL query.
5.  Click **Save**.

### Remove data quality check

1.  Find and click the data quality check to remove in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Remove**.

## Manage access for data quality check

1.  Find and click the data quality check in the list.
2.  Open the context menu by clicking the icon in the top-right corner.
3.  Click **Share**.
4.  Grant appropriate access levels to users and user groups.
5.  Click **Save**.

## Manage data quality check groups

The following section covers how to view, create, update and remove data
quality check groups.

![Data quality check group
overview](../assets/images/user/data_quality_check_group_overview.png)

### Create data quality check group

1.  Click **Create new** from the top-right corner.

2.  Enter the following information.

      -----------------------------------------------------------------------
      Field                                             Description
      ------------------------------------------------- ---------------------
      Name                                              The name of the group
                                                        (required)

      Short name                                        The short name of the
                                                        group (required)

      Description                                       A description of the
                                                        check

      Result message                                    The message to
                                                        display in check
                                                        results and
                                                        notifications

      Labels                                            One or many labels on
                                                        the format
                                                        `key:value`

      Data quality checks                               The data quality
                                                        checks to be included
                                                        in the group

      Notification recipients                           The recipients of
                                                        notifications
                                                        specified as users
                                                        and user groups
      -----------------------------------------------------------------------

3.  Click **Create**.

![Create data quality check
group](../assets/images/user/create_data_quality_check_group.png)

### Edit data quality check group

1.  Find and click the data quality check group to update in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Edit**.
4.  Update the relevant fields.
5.  Click **Save**.

### Remove data quality check group

1.  Find and click the data quality check group to remove in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Remove**.

### Run data quality group checks

Data quality check groups can be triggered manually. A data quality
check task will start in the background when the group is triggered. Use
the change log to view the task progress.

1.  Find and click the data quality check group to remove in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Run checks**.

## Notifications

Ensuring that relevant people are notified of data quality issues in a
timely manner is an essential part of data governance and data quality
management.

Notifications in AP are delivered as email messages to the email address
of the users.

The main components of the user notification solution is described
below.

- **Data quality checks:** Specifies data quality conditions and SQL
  queries which reveal data quality issues.
- **Data quality check groups:** Groups data quality checks and
  specifies the recipients of notifications for data quality violations.
- **Workflows:** Integrates and schedules data quality check groups in
  workflows.
- **Email/STMP:** Sends email notifications with summaries of data
  quality violations.

![Data quality notification
flow](../assets/images/user/data-quality-notification-flow.png)

## View change log

The *Change log* tab displays an overview of *tasks* for the data
quality check group. A task represent a single data quality check group
run. For each task, following information is available.

- **Start time:** The time at which the task started.
- **Duration:** The duration of the task.
- **Rows:** The number of data quality check violations which were
  identified by the task.
- **Status:** The status of the task, can be *Successful*, which means
  the task completed successfully, *Failed*, which means the task
  completed with an error, and *Pending*, which means the task is
  currently in progress.

![Data quality check group change
log](../assets/images/user/data_quality_check_group_change_log.png)

### Task log

You can click on a task row to view logs for the task. The logs provides
detailed information for the data quality check group run.

## Manage notifications

1.  **Create data quality checks:** Checks with at least one violation
    will be included in notification messages. The *name* of the data
    quality check will render as the title, and the *notification
    message* will render above the violation summary table for each
    check. Provide an informative message including instructions for how
    to investigate potential data quality issues in the notification
    message. The rows part of the result set returned from the SQL query
    will be presented in a table for each check.
2.  **Create data quality check groups:** Include data quality checks
    which are logically related in groups. The *name* and *description*
    of the group will render as the title and subtitle of the
    notification message.
3.  **Create a workflow:** Data quality checks and notifications are
    integrated in workflows, specifically in a workflow step, as a job
    of type *Data quality check group*. It is recommended to include the
    data quality step and job after steps which load data with data
    pipelines, so that the data quality checks are performed on
    up-to-date data. Workflows can be scheduled to run automatically at
    specific intervals, or be run ad-hoc from the context menu of the
    workflow overview screen.
4.  **Configure email/SMTP:** Ensure an SMTP server is available and
    configured for the AP installation. AP will send notifications to
    the email addresses of users specified in data quality check groups.
    The notification message is customizable, and a title, description
    and summary table are included for each data quality check.

### Notification messages

A notification message includes the following information.

- *Message title:* Data quality check group name.
- *Message subtitle:* Data quality check group description.
- *Checks:* Data quality checks in the group are rendered sequentially.
- *Check title:* Data quality check name.
- *Check description:* Data quality check notification message.
- *Check summary:* Rows returned by the SQL query defined in the data
  quality check are rendered as a table.

![Notification email
message](../assets/images/user/notification_email_message.png)

# Schemas

## Overview

A schema refers to a namespace with the platform and data warehouse. It
defines the structure of the data and represents how tables and views
are organized. This setup is similar to having folders within a single
file system, where each folder can contain files with the same names and
provide clarity and structure.

Schemas help in segregating database objects according to their use,
type, access level, or any other criteria that suits the business. This
allows for a cleaner and more organized data structure, making it easier
for users to locate and manage their data. Since objects are contained
within schemas, users can avoid naming conflicts in a shared database
environment.

When creating data pipelines and views, a schema must be selected in the
*Data warehouse target* section. For multi-table data pipeline types,
like DHIS2, it is advisable to create a schema per DHIS2 instance for
improved organization of tables.

![Schema list](../assets/images/user/schema_list.png)

## Manage schemas

The following section covers how to view, create, update and remove
schemas.

### View schema

1.  Click **Schemas** in the left side menu to list all schemas.
2.  Click the name of a schema to view more information.

### Create schema

1.  Click **Create new** from the top-right corner.

2.  Enter the following information.

      -----------------------------------------------------------------------
      Field                                             Description
      ------------------------------------------------- ---------------------
      Name                                              The name of the
                                                        schema (required)

      Description                                       A description of the
                                                        schema

      Tags                                              One or many tags
                                                        which describe the
                                                        schema
      -----------------------------------------------------------------------

3.  Click **Create**.

### Edit schema

1.  Find and click the schema to edit in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Edit**.
4.  Update the relevant fields.
5.  Click **Save**.

### Remove schema

1.  Find and click the schema to remove in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Remove**.

## Permanent schemas

In AP, one or more schemas are defined as *permanent*. This schema is
built into the data warehouse and cannot be renamed or removed. It is
advisable to not overly use the permanent schemas and instead create
more specific schemas for each type of tables and views. The table below
describes the permanent schemas for each data warehouse type.

  Data warehouse         Name           UID
  ---------------------- -------------- -------------
  ClickHouse             default        J5bHYonzwDY
  Amazon Redshift        public         TPUfm314K8k
  Google BigQuery        baoanalytics   Pm7xpuFfueX
  Microsoft SQL Server   dbo            B7zjADK2Jin
  PostgreSQL             public         SRndd67ndLP
  Azure Synapse          dbo            Qxkm9zeMGPl
  Azure Synapse          guest          P91CTQou2sN
  Azure Synapse          sys            ZMX22Oo4UK1

# Variables

## Overview

Variables in the AP are named key-value pairs that you can define once
and reuse across your data pipelines, integrations, and configurations.
They are especially useful for managing values that change between
environments, like development, staging, and production, or that are
used repeatedly in different workflows, such as database credentials and
API tokens.

The variables page in AP provides a central location where you can
create, update, and view all available variables in your workspace. Each
variable has a name and a corresponding value, and once defined, it can
be referenced anywhere in your configuration using the
`${VARIABLE_NAME}` format. For example, referencing `${DB_USERNAME}` in
a connection or script will automatically substitute the value of the
`DB_USERNAME` variable at runtime.

Variables can be either plain or secure. Secure variables are always
hidden from logs and user interfaces to protect sensitive information.
To mark a variable as secure, check the secure checkbox when creating or
editing it. This is recommended for variables that contain secrets like
passwords, API keys or access tokens. Plain variables, on the other
hand, are visible in the interface and logs and are suitable for
non-sensitive values such as configuration flags or environment labels.
Both secure and plain values are encrypted at rest.

Using variables in AP brings several benefits. It reduces duplication by
allowing you to define a value once and reuse it everywhere. It also
enhances security by minimizing the risk of exposing sensitive values in
logs or user interfaces. Most importantly, it makes your workflows more
portable and easier to maintain, as if a value changes, you only need to
update it in one place.

![Variable list](../assets/images/user/variable_list.png)

## Manage variables

The following section covers how to view, create, and remove variables.

## View variables

1.  Click **Variables** to list all variables.

### Create variable

1.  Enter the following information in the input fields at the top of
    the page.

      Field     Description
      --------- ----------------------------------
      Name      Variable name
      Value     Variable value
      Secured   Whether to always hide the value

2.  Click **Create**.

### Update variable

1.  Find the variable to edit in the list.
2.  Click the **Update** icon next to the variable.
3.  Edit the variable name or value.
4.  Click the **Done** icon next to the variable.

### Remove variable

1.  Find the variable to remove in the list.
2.  Click the **Remove** icon next to the variable.

## Example

1.  Click **Variables**.
2.  Create a variable with name `DHIS2_HQ_USERNAME` with the name of a
    DHIS2 user account.
3.  Click **Data catalog**.
4.  Create or edit a DHIS2 data pipeline.
5.  Click **Web API**.
6.  In the **Username** field, enter `${DHIS2_HQ_USERNAME}`.
7.  Click **Save**.

# Firewall rules

## Overview

!!! tip "Note" Firewall rules are only supported when using MS SQL
Server and Azure Synapse as data warehouse

A firewall is a network security system that controls and restricts
incoming and outgoing network traffic based on predefined security
rules. In the context of the Analytics Platform (AP), firewall rules
play a key role in protecting your data warehouse by ensuring that only
trusted sources can establish direct connections.

By default, direct access to the data warehouse is disabled for all
external sources. This is a security-first approach to ensure that your
data is protected against unauthorized access, potential breaches, and
misuse. All data processing, ingestion, and analysis tasks are performed
within the AP environment unless explicit access has been granted.

In cases where you need to connect to the data warehouse from external
desktop applications, such as Power BI, Tableau, or other Business
Intelligence (BI) tools, you can define firewall IP rules to allow
specific IP addresses to connect. This enables secure and controlled
access from your local environment, allowing you to build dashboards,
run custom queries, or analyze data directly from your preferred tools.
Always ensure that only trusted IP addresses are added to minimize
security risks.

![Firewall rule list](../assets/images/user/firewall_rule_list.png)

## Manage firewall rules

The following section covers how to view, create, and remove firewall
rules.

### View firewall rule

1.  Click **Firewall rules** in the left side menu.

### Create firewall rule

1.  Enter the following information in the input fields at the top of
    the page.

      Field     Description
      --------- --------------------------------------
      Name      Rule name
      Address   IP address to allow connections from

2.  Click **Add rule**.

### Remove firewall rule

1.  Find the variable to remove in the list.
2.  Click the **Remove** icon next to the variable.

# Sharing and access control

## Overview

AP provides an access control model referred as *sharing*. The sharing
model works on the *object* level, where objects means specific
instances of the various entities in AP. The following entities support
sharing in AP.

- Data pipelines
- Views
- Scripts
- Data quality checks
- Destinations
- Workflows

The sharing model controls which users and user groups can view and edit
specific objects in AP.

AP provides user groups, which allows for grouping of users and can be
granted access to objects. The sharing model is compatible with the
*RBAC* security model, where user groups represent roles within the
organization, and can be granted access at the role level.

![Sharing overview](../assets/images/user/sharing_overview.png)

### Who has access

The first dimension is *access* and defines *who* has access to an
object. The following three access levels exist.

  Level        Description
  ------------ --------------------------------------------------
  Public       All authenticated users within the organization
  User group   Users which are members of a specific user group
  User         Specific users

Here, *public* refers to all authenticated users witin the client
organization, and **not** anyone on the Internet.

### What actions are allowed

The second dimension is *permission* and defines what *actions* a user
is allowed to perform on an object. The following three permission
levels exist.

  Level      Description
  ---------- ---------------------------
  Can view   Read permission
  Can edit   Read and write permission
  None       No permission

Here, *read* is the ability to view information about an object, while
*write* is the ability to create, update and delete an object.

The combination of *who* has access and what *actions* those users are
allowed to perform on specific objects defines the sharing model in AP.

## Managing sharing

The following section covers how to set and update sharing for an
object.

![Sharing user](../assets/images/user/sharing_user.png)

### Open sharing dialog

1.  In the list of objects (e.g. data pipelines), click the name of the
    object to view more information.
2.  Click the context menu in the top-right corner.
3.  Click **Share**.

### Set who has access

1.  Enter the name of the user group or user in the seach input field.
2.  Check the checkbox next to the user group or user to share the
    object with.
3.  Click anywhere outside the search dialog to close it.

### Set what actions are allowed

1.  Next to the **Public** label, select *Can view*, Can edit\* or
    None\* from the drop-down. To remove public access altogether,
    select *Restricted* from the **Public** drop-down.
2.  Next to each user group and user, select *Can view*, *Can edit* or
    *Remove access* from the drop-down.
3.  Click **Save** to store the sharing settings.

### Data warehouse sharing

Wheng creating a user in AP, a corresponding data warehouse user account
is automatically created with the same username. This user account can
be leveraged for direct connections to the data warehouse, e.g. from
desktop BI and data analysis tools. The data warehouse user account
inherits the sharing access from the owning AP user account. User access
for data pipelines, datasets and views in the AP data catalog is
replicated for the associated tables and views in the data warehouse.

# Settings

## Overview

The settings page allows for specifying desirable platform behavior.
Settings work at the organization level, also called client or tenant.

![Setting overview](../assets/images/user/setting_overview.png)

## Settings

The following settings are available.

  -----------------------------------------------------------------------
  Setting                                            Description
  -------------------------------------------------- --------------------
  Email alert                                        Who should receive
                                                     an email
                                                     notification if a
                                                     pipeline or
                                                     materialized view
                                                     fails to refresh due
                                                     to an error. Enter
                                                     value as one or many
                                                     email addresses,
                                                     separated by comma.

  Access level                                       The default sharing
                                                     level for new
                                                     objects, such as
                                                     data pipelines,
                                                     views and
                                                     destinations, to be
                                                     either public or
                                                     private.

  Data pipeline error handling                       Whether staging
                                                     tables should be
                                                     cleaned up, meaning
                                                     removed, or retained
                                                     when a data pipeline
                                                     data load operation
                                                     fails. Staging
                                                     tables are temporary
                                                     database tables
                                                     which are created as
                                                     part of the data
                                                     loading processs.
                                                     Retaining staging
                                                     tables can be
                                                     helpful for
                                                     troubleshooting. As
                                                     standard practice,
                                                     cleaning up staging
                                                     tables is advisable.

  Retain temporary files                             Whether temporary
                                                     data files generated
                                                     during data loading
                                                     processes in the
                                                     platform backend
                                                     should be retained
                                                     to facilitate
                                                     debugging and
                                                     troubleshooting.
                                                     Note that enabling
                                                     this property should
                                                     only be done by a
                                                     system administrator
                                                     for short periods of
                                                     time.
  -----------------------------------------------------------------------

After specifying one or many system settings, click **Save** to have the
setting changes saved, or click **Discard** to have the setting changes
discarded.

# Connection information

## Overview

The connection information page provides you with information about the
data warehouse integrated in AP. This information is useful when
connecting desktop applications, like Power BI, Tableau and other BI
tools, or cloud services, like AWS and Azure.

When connecting directly to the AP data warehouse, it is typically
required to open a port in the firewall. Make sure to follow best
security practices when allowing direct connections.

If your AP user account has permission for accessing the data warehouse,
this means that a corresponding user account exists in the data
warehouse. You can authenticate to the data warehouse using the same
password as you use for logging in to AP.

![Connection
information](../assets/images/user/connection_information.png)

## Information

The connection information page offers the following information.

  Field      Description
  ---------- ----------------------------------
  Provider   Data warehouse management system
  Hostname   Hostname of data warehouse
  Port       Port of data warehouse
  Database   Database name
  Username   Database username
  Password   Use the AP user account password

# Data warehouses

## Overview

AP supports the following data warehouses.

- ClickHouse
- Amazon Redshift
- Azure SQL Database
- Azure Synapse
- Google BigQuery
- Microsoft SQL Server
- PostgreSQL

Notes:

- ClickHouse is the default data warehouse for AP.
- Azure SQL Database is a managed database service in the Azure cloud
  based on Microsoft SQL Server.
- Azure Synapse is a cloud data warehouse which largely adheres to
  Microsoft SQL Server data types and SQL syntax.

## ClickHouse

ClickHouse is an open-source columnar database management system
optimized for online analytical processing. It enables fast data
insertion and real-time query performance, making it well-suited for
handling large volumes of data.

  Topic            Value
  ---------------- ---------
  Default port     8123
  Default schema   default

### Data type mapping

  AP                        ClickHouse
  ------------------------- -------------
  Small int                 Int16
  Integer                   Int32
  Big int                   Int64
  Numeric                   Decimal
  Real                      Float32
  Double                    Float64
  Boolean                   Bool
  Char                      String
  NChar                     String
  Varchar                   String
  NVarchar                  String
  Text                      String
  NText                     String
  Date                      String
  Timestamp                 DateTime64
  Timestamp with timezone   DateTime64
  Time                      DateTime64
  Time with timezone        DateTime64
  Geometry                  String
  JSON                      String
  Binary                    FixedString

## Microsoft SQL Server

Microsoft SQL Server is a relational database management system
developed by Microsoft, designed to support a wide range of data
applications, including transaction processing, business intelligence,
and analytics.

  Topic            Value
  ---------------- -------
  Default port     1433
  Default schema   dbo

### Data type mapping

  AP                        SQL Server
  ------------------------- ----------------
  Small int                 smallint
  Integer                   int
  Big int                   bigint
  Numeric                   numeric
  Real                      real
  Double                    float
  Boolean                   varchar
  Char                      char
  NChar                     nchar
  Varchar                   varchar
  NVarchar                  nvarchar
  Text                      varchar
  NText                     nvarchar
  Date                      date
  Timestamp                 datetime2
  Timestamp with timezone   datetimeoffset
  Time                      time
  Time with timezone        time
  Geometry                  varbinary
  JSON                      nvarchar
  Binary                    varbinary

## PostgreSQL

PostgreSQL is a powerful, open-source object-relational database system
known for its robustness, scalability, and support for advanced SQL
compliance. It offers a wide range of features, including complex
queries, foreign keys, triggers, views, transactional integrity.

  Topic            Value
  ---------------- --------
  Default port     5432
  Default schema   public

### Data type mapping

  AP                        PostgreSQL
  ------------------------- ------------------
  Small int                 smallint
  Integer                   integer
  Big int                   bigint
  Numeric                   numeric
  Real                      real
  Double                    double precision
  Boolean                   boolean
  Char                      char
  NChar                     char
  Varchar                   varchar
  NVarchar                  varchar
  Text                      varchar
  NText                     varchar
  Date                      date
  Timestamp                 timestamp
  Timestamp with timezone   timestamptz
  Time                      time
  Time with timezone        timetz
  Geometry                  geometry
  JSON                      json
  Binary                    bytea

# Super BI

## Overview

AP and the *Super BI* web app allow for embedded data visualizations and
business intelligence (BI) with Apache Superset integrated within DHIS2.

## Apache Superset

*Apache Superset* is an open-source data exploration and visualization
platform designed to be intuitive and highly accessible for business
intelligence purposes. Superset is integrated in AP in the following
ways.

- **Single Sign-On:** SSO provides a seamless user experience as users
  can sign in once and later navigate between AP and Superset without
  having to log in again.
- **Datasets for views:** When a user creates a view in AP, a
  corresponding dataset is automatically created in Superset. The
  dataset can be used as basis for Superset charts and dashboards.
- **Embedded dashboards:** Superset dashboards can be embedded in DHIS 2
  with the Super BI web app, allowing for exploration of data stored in
  AP from within DHIS2.

The following is an overview of the data analytics model in Apache
Superset.

- **Dataset:** A data table, can be *physical*, meaning based on a data
  warehouse table, or *virtual*, meaning based on a SQL query.
- **Chart:** A visualization, such as a column chart, bar chart, line
  chart, bubble chart, box plot, tree map, table or pivot table.
- **Dashboard:** A collection of visualizations which are organized and
  arranged to provide a comprehensive view of your data at a glance.

For more information about Apache Superset, consult the [official
documentation](https://superset.apache.org/docs/intro/).

![Superset dashboard](../assets/images/user/superset_dashboard.png)

## Manage embedded dashboards

The following section covers how to view, create, share and remove
embedded dashboards.

### Create embedded dashboard

The following describes the high-level flow for embedding a dashboard
with Super BI.

1.  Create a dashboard in Apache Superset.
2.  Enable embedding for the dashboard and take note of the embed ID.
3.  Create a dashboard in Super BI and use the embed ID to embed the
    Superset dashboard.

The following describes the steps in more detail.

**Create Superset dashboard**

1.  In Apache Superset, click **Dashboards**.
2.  Click the **+ Dashboard** button from the top-right corner, which
    will open the dashboard screen.
3.  Drag charts from the right-side bar.
4.  Click **Save**.

**Enable embedding for Superset dashbboard**

1.  Click the three-dot context menu in the top-right corner.
2.  Click **Embed dashbard**, which will open the embed dialog.
3.  Click **Enable embedding**, which will reveal an embed ID.
4.  Copy and store the embed ID.

![Superset dashboard embed
ID](../assets/images/user/superset_dashboard_embed_id.png)

**Create Super BI dashboard**

1.  In the Super BI web app for DHIS2, click the **+** button in the
    top-left corner.

2.  Enter the following information.

      -----------------------------------------------------------------------
      Field                                                  Description
      ------------------------------------------------------ ----------------
      Name                                                   The name of the
                                                             dashboard
                                                             (required)

      Superset embed ID                                      The Superset
                                                             embed ID
                                                             previsouly
                                                             retrieved from
                                                             Superset
                                                             (required)

      Restrict data by org unit hierarchy                    Whether to
                                                             restrict data
                                                             queries by the
                                                             user data
                                                             analysis org
                                                             unit hierarchy
      -----------------------------------------------------------------------

3.  Click **Save**.

![Super BI create
dashboard](../assets/images/user/superbi_create_dashboard.png)

### Edit Super BI dashboard

1.  Click the name of the dashboard to edit in the top bar.
2.  Click **Edit** next to the dashboard name.
3.  Edit relevant values.
4.  Click **Save**.

![Super BI edit
dashboard](../assets/images/user/superbi_edit_dashboard.png)

### View Super BI dashboard

To view a Super BI dashboard, click the name of the dashboard in the top
bar.

![Super BI dashboard for
immunization](../assets/images/user/superbi_dashboard.png)

### Share Super BI dashboard

Super BI dashboards support the regular DHIS2 sharing access control
model of DHIS2. This means that dasboards can be shared publicly, with
user groups and users. View and edit permissions can be granted for each
subject.

1.  Select the dashboard from the top bar.
2.  Click **Share** from the dashboard bar.
3.  Share the dashboard with the appropriate subjects and permissions.
4.  Click **Close**.

### Remove Super BI dashboard

1.  Select the dashboard from the top bar.
2.  Click **Delete**.
3.  In the confirmation dialog, click **Delete** again.

## Org unit hierarchy access control

Super BI allows for restricting access to data by org unit hiearchy.
This is similar to the DHIS2 org unit hierarchy access control solution.

### Restrict data

To restrict data queries by org unit hierarchy for a specific dashboard,
select the checkbox **Restrict data by org unit hierarchy** in the
create or edit dashboard dialog.

Access control is based on the *data output and analysis org units*
associated with the DHIS2 user account. Access includes the
sub-hierarchy of the selected org units. This means that by associating
a user account in DHIS2 with one or many data output and analysis org
units, access to data in Super BI will be restricted to those org units
and org units in their respective sub-hierarcies.

Access control by org unit hierarchy requires that a column
`OrgUnitHierarchyPath` is present in all datasets, including tables and
views, underlying the specific dashboard. This column is available in
the data tables generated by AP, such as the aggregated data value
analytics table, the program enrollment analytics tables and the program
event analytics tables, as well as the og unit and org unit structure
metadata tables.

### Access control

Access control in Super BI is based on *row-level security* through
Superset inserting a SQL filter clause in all data queries for datasets
used in the specific dashboard. The filter is on the format
`"OrgUnitHierarchyPath like '{user-org-unit-path}%'`. The user org unit
path value is a concatenation of identifiers of hierarchy ancestors of
the org unit, starting wit the root org unit. For example, for an org
unit with identifier `qjboFI0irVu` at level 4, the path value is
`/ImspTQPwCqd/TEQlaapDQoK/vn9KJsLyP5f/qjboFI0irVu`, where the first
identifier `ImspTQPwCqd` represents the root org unit, and the last
identifier `qjboFI0irVu` represents the org unit itself.

By inserting a SQL `like` filter on `OrgUnitHierarchyPath` with the
paths of the data output org units of the current DHIS2 user, the
returned data records will be effectively limited to the org units of
the sub-hierarchies of these org units. The filter is inserted into data
queries by Superset at runtime.

## Solution

The Super BI web app and solution in AP provides several benefits with
regards to data storage, access, visualization and analytics.

- The comprehensive data visualization capabilities of Apache Superset
  can be utilized within DHIS2.
- Users can utilize their existing DHIS2 account, removing the to
  introduce a new set of user accounts.
- Users will see and use a regular DHIS2 web app, minimizing the need
  for training.
- Access to the Super BI web app and dashboards is controlled with
  regular DHIS2 user roles.
- Data can be queried and processed directly in the high-performance AP
  data warehouse.
- Data can be analyzed without having to be loaded into the DHIS2
  database.
- The DHIS2 data pipeline in AP provides near real-time access to new
  and update data.

## Architecture

The following diagram describes the DHIS2 / Superset / AP architecture.

![DHIS2 Superset
architecture](../assets/images/user/ap_dhis2_superset_architecture.png)

## Examples

This section provides examples of Super BI dashboards.

![Super BI dashboard for
ANC](../assets/images/user/superbi_dashboard_anc.png)

![Super BI dashboard for
TB](../assets/images/user/superbi_dashboard_tb.png)

# Apache Superset

Apache Superset is a modern, open-source data exploration and
visualization platform designed to be intuitive and accessible for users
of all technical skill levels. Superset empowers individuals and teams
to explore and analyze their data through an easy-to-use web interface.
It offers an intuitive chart builder, allowing users to create a wide
variety of visualizations by simply dragging and dropping fields.
Superset offers a rich library of over 40 built-in visualization types,
ranging from basic charts to sophisticated geospatial visualizations.

AP allows users to connect any modern BI tool. Apache Superset is
specifically provided with AP, providing an integrated user experience.
The integration includes several levels.

- **Single Sign-On:** AP and Superset participates in Single Sign-On
  (SSO), allowing users to log in to AP then accessing Apache Superset
  while already being logged in.
- **Datasets:** When creating a view in AP, a dataset in Superset is
  automatically created, making it easy and efficient to produce
  visualizations.
- **Embedded dashboards:** Superset dashboards can be integrated in
  operational systems like DHIS2 using the Super BI web app for DHIS2.

![Superset dashboard for
immunization](../assets/images/user/superset_dashboard_immunization.png)

More coming soon!

# Users

## Overview

AP provides user and user group management.

Consult the *Sharing* section regarding object level access control.

## Permissions

The authorization model in AP is based on granting user accounts
individual permissions to perform actions in the platform.

Most permissions follow a *View* and *Manage* model.

- *View* means the ability to view information about objects of a
  specific entity.
- *Manage* means the ability to create new objects, edit existing
  objects and remove objects of a specific entity. The *Manage*
  permission includes the *View* permission, in other words, if a user
  is granted *Manage*, the user is implicitly granted *View* permission.

The following permissions are supported.

### Admin

  -----------------------------------------------------------------------
  Permission                                           Description
  ---------------------------------------------------- ------------------
  Super Admin                                          Perform all
                                                       actions in the
                                                       system
                                                       (super-user)

  -----------------------------------------------------------------------

### Analytics Platform

  -----------------------------------------------------------------------
  Permission                                           Description
  ---------------------------------------------------- ------------------
  Access to Analytics Platform                         Access to AP and a
                                                       corresponding user
                                                       account is created
                                                       in the AP data
                                                       warehouse

  View data for all data pipelines                     Whether the data
                                                       warehouse user
                                                       account can view
                                                       all data tables

  Data pipelines                                       View or manage
                                                       data pipelines

  Schemas                                              View or manage
                                                       schemas

  Variables                                            View or manage
                                                       variables

  Settings                                             Manage settings

  Views                                                View or manage
                                                       views

  Data quality checks                                  View or manage
                                                       data quality
                                                       checks

  Data quality check groups                            View or manage
                                                       data quality check
                                                       groups

  Firewall rules                                       Manage firewall
                                                       rules

  Workflows                                            View or manage
                                                       workflows

  Destinations                                         View or manage
                                                       destinations
  -----------------------------------------------------------------------

### Users

  Permission    Description
  ------------- --------------------
  Users         Manage users
  User groups   Manage user groups

## Data warehouse users

When granting the *Access to Analytics Platform* permission to an AP
user, a corresponding data warehouse user account is automatically
created with the same username. This user account can be leveraged for
direct connections to the data warehouse, e.g. from desktop BI and data
analysis tools. The data warehouse user account inherits the sharing
access from the owning AP user account. User access for data pipelines,
datasets and views in the AP data catalog is replicated for the
associated tables and views in the data warehouse. In other words, when
read access to a data pipeline or dataset is granted or denied for the
AP user, the same level of access will be granted or denied for the
underlying tables and views in the data warehouse for the corresponding
data warehouse user account.

## Managing users

The following section covers how to view, create, update and remove
users.

### View user

1.  Click **Users** in the left-side menu to list all users.
2.  Click the name of a user to view more information.

### Create user invitation

Users in AP are primarily created by sending an invitation to create a
user account over email to the relevant person. This allows the person
to type in their own password, avoiding the need to send the password
with out-of-band communication.

1.  Click **Add new user**.

2.  Enter the following information.

      -----------------------------------------------------------------------
      Field                                             Description
      ------------------------------------------------- ---------------------
      Name                                              The full name of the
                                                        user (required)

      Username                                          The username of thser
                                                        (required and unique)

      Email                                             The email address of
                                                        the user (required)

      Start page                                        The space to use as
                                                        start page when the
                                                        user logs in

      Enable SSO                                        Whether to enable
                                                        Single Sign-On for
                                                        the user account

      Permissions                                       Select the
                                                        permissions to grant
                                                        to the user
      -----------------------------------------------------------------------

3.  Click **Send invitation**.

### Edit user

1.  Find and click the user to edit in the list.
2.  Find the section which contains the information to edit.
3.  Click the edit icon in the top-right corner of the section.
4.  Update the relevant fields.
5.  Click **Save**.

### Reset password

1.  Find and click the user for which to reset the password in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Reset password**.

### Disable user

1.  Find and click the user to disable in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Disable**.

### Remove user

1.  Find and click the user to remove in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Remove**.

## Managing groups

The following section covers how to view, create, update and remove user
groups.

### View group

1.  Click **Groups** in the left-side menu to list all user groups.
2.  Click the name of a user to view more information.

### Create group

1.  Click **Add new group**.

2.  Enter the following information.

      -----------------------------------------------------------------------
      Field                                             Description
      ------------------------------------------------- ---------------------
      Name                                              The name of the group
                                                        (required and unique)

      Code                                              The code of the group
                                                        (required)

      Description                                       A description of the
                                                        group
      -----------------------------------------------------------------------

3.  Click **Add new group**.

### Edit group information

1.  Find and click the group to edit in the list.
2.  Click the edit icon in the top-right corner of the group information
    section.
3.  Update the relevant fields.
4.  Click **Save**.

### Add and remove user group members

1.  Find and click the group to edit in the list.
2.  Click the edit icon in the top-right corner of the users section.
3.  Enter the search criteria for the user to add or remove as a member
    in the search input field.
4.  Select or unselect the checkboxes next to the names of the users to
    add or remove.
5.  Click **Done**.

### Remove group

1.  Find and click the group to remove in the list.
2.  Click the context menu in the top-right corner.
3.  Click **Remove**.

# Convo

## Overview

AP and the Convo app for DHIS2 allow for conversational analytics and
visualizations for DHIS2. The web app offers a simple and clean
interface for asking questions about your DHIS2 data in natural text.

The app will create meaningful data visualizations based on the user
question and data response.

- For time series data, line charts will be created.
- For comparison of administrative and organizational units, column
  charts will be created.
- For categorical data, pie and sunburst charts will be created.
- For calculated data using percentages, gauge charts will be created.

![Convo line chart](../assets/images/user/convo_line_chart.png)

The app will provide a range of output based on the user input.

- Charts
- Data table
- Interpretation

The Convo app offers a new take on data visualization. Creating a large
number of visualizations and dashboards up front is tedious and often
presents a challenge for users in finding relevant information. BI and
data visualization tools are often complex and difficult to use for
users without extensive training. With Convo, data visualization and
tables are created instantly and on demand based on the user query.

![Convo column and sunburst
chart](../assets/images/user/convo_column_chart.png)

## Ask questions and get answers

To get answers, simply ask questions in natural text. Include relevant
data items in the question.

To download a chart as a PNG image, click the download icon in the
top-right corner of the visualization.

To view a chart in full-screen mode, click the expand icon in the
top-right corner of the visualization.

To download a data table as a CSV file, click the download icon in
the-right corner of the data table.

![Convo gauge chart and data
interpretation](../assets/images/user/convo_gauge_chart.png)

## Data retrieval logic

The Convo app retrieves data using the following logic.

- Analyses the user natural text question and identifies DHIS2
  dimensional metadata, such as indicators, time periods and org untis.
- Sorts data based on time period.
- Analyses the structure of the retrieved data and identifies relevant
  chart types.
- Retrieves data for relevant chart types using relevant dimensions and
  filters.
- Renders relevant charts.
- Creates a data text intepretation.

The app supports the following dimensional metadata.

- Data elements
- Indicators
- Program indicators
- Org units
- Org unit levels
- Fixed time periods
- Relative time periods
- Org unit group sets

# Use cases

## Overview

This page describes common use-cases for data analytics with Analytics
Platform.

## Natural language queries and analytics

AP provides the following main capabilities for natural language
analytics, also referred to as conversational analytics.

- **Convo web app:** Convo is a web app for DHIS2 for conversational
  analytics which allows users to ask questions about their data using
  natural text. Convo understands natural text questions and uses DHIS2
  API calls to retrieve the relevant information. See the "Convo"
  documenation page.
- **Data browser:** The data browser in AP allows users to ask questions
  about their data using natural text. The text-to-SQL engine converts
  questions in SQL queries which retrieve the relevant data.
- **Script editor:** The script editor supports Python and R scripting
  and allows users to declare what type of data analytics and data
  science they want to achieve in natural text. The text-to-script
  engine converts the natural text input into Python and R scripts.

## Geospatial visualization and maps

AP does not provide a native geospatial visualization and maps (GIS)
component. The following systems are recommended to be used with AP for
geospatial analysis:

- **DHIS2 Maps:** The DHIS2 platform featues a web-based and powerful
  thematic mapping and geospatial visualization tool. DHIS2 is
  integrated with AP through the DHIS2 data pipeline.
- **ArcGIS**: ArcGIS from Esri is a comprehensive geographic information
  system (GIS) platform that enables users to create, manage, analyze,
  and map spatial data to gain actionable insights. ArcGIS is integrated
  through the DHIS2 to ArcGIS Connector app.

## Audit trail

AP includes audit records when loading data from source systems with
data pipelines, assuming that the source system stores audit
information. The main entities in AP, including workflows, data
pipelines, data quality check groups, materialized views and
destinations, create and store a detailed change log for every task. For
example, every time a workflow runs, either scheduled or ad-hoc, a
change log with detailed logs are stored. The changelog includes the
time the task started and completed. For ad-hoc runs, the changelog
includes information about which user started the task. This ensures
auditability of the system, both for source system data changes and for
operations within AP including data loading and transformation.

## Project-based configuration model

AP supports stand-alone deployment in on-premise server environments. AP
is designed with a multi-tenant architecture. These features allows for
setting AP in way where multiple projects or teams within the same
organization can work independently and isolated from each other. To
achieve this, separate clients for each team are created. This allows
each team to create data pipelines, views, scripts and workflows, as
well as a dedicated data warehouse, independently of each other.

## Unified and integrated data layer

AP allows for importing data from data sources into a central data
warehouse through the data pipeline system. Each data pipeline creates
and loads data into several data tables in a dedicated data warehouse
schema. SQL views and Python and R scripts are used to join the data
tables, either by speciying SQL and script code directly, or from
natural text, where AP converts the natural text input to SQL and script
code with the use LLM/AI technology.

## High-performance query execution on large datasets

AP utilizes ClickHouse as the recommended data warehouse. ClickHouse is
an open-source, column-oriented database management system specifically
designed for high-performance Online Analytical Processing (OLAP). It
allows users to generate analytical reports in real-time using SQL
queries on trillions of rows and petabytes of data with sub-second
latency. ClickHouse is designed for lightning fast queries through a
range of features.

- **Column-oriented data storage:** Reduces disk reads by only reading
  columns specified in queries.
- **Vectorized query execution:** Processes data in large blocks.
- **Massive parallelism:** Utilizes many CPUs and compute nodes for a
  single query.
- **High data compression:** Reduces disk usage and speeds up data
  reading.
- **Specialized storage engines:** Allows for high ingestion rates and
  fast data range scans.
- **Data skipping indexes:** Allows for skipping over ranges of blocks
  while reading data.

## Data capture

AP does not provide a built-in data capture app or module. Instead, AP
provides no-code, easily configured connectors (data pipelines) for a
variety for data capture solutions. The following software systems are
integrated and recommended for data capture.

- **DHIS2:** A highly flexible configurable information system with
  support for data capture of aggregate data, survey and event data, and
  individual data. Supports desktop web clients and Android mobile
  clients. Data is ingested into AP using the DHIS2 data pipeline.
- **ODK:** Flexible and easy-to-configure software for survey data
  collection through an Android mobile app. Data is ingested into AP
  using the ODK data pipeline.
- **CSV:** Data can be collected in CSV and Excel files. Data is
  ingested into AP using the CSV file upload data pipeline.

## Data governance

AP supports the key aspects of data governance through its extensive
feature set.

- **Data quality:** The data quality checks, SQL views, scripting
  environment, workflows and notification system in AP are useful for
  ensuring that data is accurate, complete, and consistent.
- **Data security and privacy:** The multi-dimensional sharing and
  access control system, secure-by-default approach and compliance are
  useful for managing the data security aspect.
- **Role based access control:** User groups in AP allows for grouping
  users by their role within the organization, allowing access and
  permission to be granted at the role level.
- **Metadata management:** The data catalog in AP allows for creating a
  detailed inventory of data sources and datasets through its rich
  metadata model.
- **Auditing:** All main entities in AP including data pipelines, data
  quality checks, views and workflows provide changelogs with detailed
  logs for every task, ensuring all activity is audited.
- **Data lifecycle management:** In addition to data loading, AP
  supports data archiving and data deletion, allowing for managing the
  lifecycle of datasets and data records.
- **Data retention and versioning:** AP retains and stores the raw data
  files for every data load operation, allowing for comparing the
  current data with historical versions.

## Role based access control (RBAC)

Role-Based Access Control (RBAC) is a security model that restricts
system access and permissions based on roles within an organization,
rather than individual user identities. It simplifies administration by
granting permissions to roles, such as *data owner*, *data steward*,
*data analyst* and *data viewer*, and not directly to individual users.
AP supports user groups, users can be assigned to groups depending on
their role in the organizations, after which user groups can be granted
access to datasets, workflows, views and other entites and objects.

## Data archiving and removal

AP supports archiving of data. A "frozen" data pipeline can be used to
ingest data from a source system and then prohobit the data pipeline to
subsequently be refreshed or deleted. This allows for archival of
historical datasets, and means the historical data can be deleted from
the source system, and instead be accessed with a data exploration or BI
tool in AP. Data can be permanently removed based on policies and time
thresholds. This is achieved through SQL statements with filters and
workflows. Automated data lifecycle management can be enforced through
scheduled workflows running on a fixed interval. The data archiving and
removal operations are audited through the workflow changelog system.

## Permanent removal of data

AP allows permanent and complete removal of the data associated with a
data source (data pipeline). When a data pipeline is removed, associated
data is by default deleted in its entirety, including data files stored
in the data lake, data tables in the warehouse and change log records.
Note that AP stores data files for every run of data pipelines, and that
data files for all runs are deleted as part of deleting a data pipeline.

## ETL and ELT

AP follows the *ELT* approach for data integration. The ELT (Extract,
Load, Transform) approach is a modern data integration methodology
designed to leverage the high-performance processing capabilities of
modern cloud data warehouses. Unlike traditional ETL frameworks, ELT
shifts the transformation layer to the end of the pipeline, allowing for
more agile and scalable data management.

- **Extract:** Raw data is ingested from disparate source systems, such
  as databases and APIs. AP features a range of no-code data connectors
  for this purpose.
- **Load:** Data is moved directly into the destination storage in its
  raw format. In AP this is a cloud-native or on-premise data lake and
  data warehouse like ClickHouse.
- **Transform:** Once the data is stored in the data warehouse, it is
  processed, cleaned, modeled and transformed using the AP and data
  warehouse native compute power. This includes SQL for scalable data
  processing, and integrated Python and R scripts for complex
  algorithmic logic and machine learning.

The ELT process has several advantages version the traditional ETL
approach.

- **Scalability:** Utilizes the elastic, parallel-processing power of
  cloud environments to handle massive datasets without dedicated
  staging hardware.
- **Scalability:** By loading raw data before transformation,
  organizations maintain a single source of truth that can be
  re-processed as business requirements evolve.
- **Low latency:** Minimizes the time between extraction and
  availability, as data does not need to wait for complex transformation
  cycles before being stored.

## New connector development

AP features two main approaches for development on new connectors (data
pipelines).

- **Framework:** AP has a *connector development framework* which avoids
  boiler-plate code and makes connector development more efficient. The
  AP team leverages [agent skills](https://agentskills.io/) and *agentic
  coding* for rapid development of new connectors.
- **Python scripting:** The integrated Python scripting environment is
  suitable for rapid development of light-weight and customized API
  connectors for local and bespoke software applications. Python scripts
  can be used for data loading and be integrated in AP workflows
  together with built-in connectors.

## API access

AP is designed with the *API first* principle in mind. The data model
and all platform operations are exposed in the API. This means that
clients and third-party tools, such as CLIs, HTTP clients, web apps and
BI tools, an utilize the comprehensive API to communicate, integrate and
extend the platform. AP provides complete API documentation based on the
*OpenAPI* (Swagger) standard. The API first design allows for extending
the platform to meet specific needs and requiremenents.

## OWASP Top 10 standard

AP maintains a rigorous security posture which embeds Static Application
Security Testing (SAST) directly into the CI/CD pipeline, ensuring every
build is automatically vetted against the OWASP Top 10 security risks.
The managed hosting environment reinforces this by deploying a Web
Application Firewall (WAF) that leverages the latest OWASP Core Rule Set
to stop malicious traffic and abnormal patterns at the doorstep.
Finally, the platform's architecture provides native protection against
high-risk vulnerabilities like cross-site request forgery (CSRF),
cross-site scripting (XSS), and session hijacking, strictly adhering to
industry-standard web security guidelines for comprehensive
application-level defense.

## Outlier detection

AP supports data outlier detection using advanced statistical methods
including the *Modified Z-score* and *Inter-quartile range* (IQR)
algorithms. These checks are implemented with SQL queries to achieve
high performance. For more complex data scenarios such as seasonal
diseases, the integrated Python scripting environment is used to
identity outliers using advanced time series decomposition, unsupervised
machine learning and statistical profiling.

# Terminology

  --------------------------------------------------------------------------
  Term                                              Description
  ------------------------------------------------- ------------------------
  AP                                                Analytics Platform (AP)
                                                    is a software platform
                                                    for data integration and
                                                    advanced analytics.

  Apache Superset                                   Apache Superset is an
                                                    open-source data
                                                    exploration and
                                                    visualization platform
                                                    designed to be intuitive
                                                    and highly accessible
                                                    for business
                                                    intelligence purposes.

  BI                                                Business Intelligence
                                                    (BI) refers to the
                                                    technologies,
                                                    applications and
                                                    practices used to
                                                    collect, analyze,
                                                    integrate, and present
                                                    business information.

  Data pipeline                                     A series of processing
                                                    steps to move data from
                                                    a source system into the
                                                    AP. The AP implements
                                                    "ELT" (extract, load,
                                                    and transform)
                                                    pipelines.

  Data catalog                                      An inventory of the data
                                                    warehouse's datasets
                                                    (tables) as well as a
                                                    dataset's metadata such
                                                    as table name,
                                                    description, data types.

  Destination                                       Allows for users to make
                                                    data available to
                                                    downstream/destination
                                                    systems.

  Logical view                                      A SQL query which
                                                    provides the
                                                    instructions for
                                                    creating a virtual
                                                    table, where the table
                                                    is not stored in the
                                                    database.

  Materialized view                                 A SQL view that is
                                                    stored in the database
                                                    as a table.

  RBAC                                              Role based access
                                                    control.

  Schema                                            A schema provides a
                                                    mechanism to organize
                                                    objects such as tables
                                                    and views.

  SQL                                               Structured Query
                                                    Language, a standardized
                                                    programming language
                                                    that is used to manage
                                                    relational databases and
                                                    perform various
                                                    operations on the data
                                                    in them.

  Variable                                          Text-based placeholders
                                                    that are proxies for
                                                    secrets, such as as
                                                    passwords and API
                                                    tokens.
  --------------------------------------------------------------------------
