Data pipeline overview¶

Overview¶

BAO Analytics Platform (AP) offers data pipelines for ingesting data from a variety of data sources into the platform. A data pipeline is a mechanism for moving data from data sources into the AP. Within the AP, the data is ingested into a data catalog, data store and data warehouse. A data source can be an application, a blob (file) store and a data file. Data pipelines are pull based, meaning data will be pulled from the data source before being loaded in the platform. Specific data pipeline types are capable of loading data in near real-time.

The following data pipelines are supported.

Applications¶

BHIMA An open-source hospital and logistics information management system used for electronic medical records (EMR), inventory and commodity tracking and billing in low-resource settings.
CommCare: A mobile-based platform for managing front-line health programs, providing case management, data collection, and real-time decision support for community health workers.
DHIS2: A flexible, an open-source, web-based platform for collecting, analyzing and visualizing health data, widely used for managing and monitoring health programs, particularly in low-resource settings.
FHIR: Fast Healthcare Interoperability Resources (FHIR) is a healthcare data standard and framework with an API for representing and exchanging electronic health records (EHR).
Google Sheets: A cloud-based spreadsheet application within Google Drive that enables users to create, edit, and format spreadsheets online while collaborating in real time with others.
iHRIS: An open-source human resources information system designed for managing health workforce data, allowing organizations to track employee information.
Kobo Toolbox: An open-source data collection and management tool that enables field researchers and humanitarian organizations to design surveys and collect data offline or online using mobile devices.
ODK: A suite of open-source tools used for mobile data collection, especially in challenging environments, with features like offline data capture and GPS integration.
Ona: An open-source data collection and analysis platform designed for mobile and web-based surveys, commonly used for monitoring and evaluating projects in health, agriculture, and development sectors.
Talamus: A hybrid cloud-based software for patients and health-care providers to facilitate high quality health-care, providing hospitals, labs, pharmacies and imaging centers with a digital platform.

Blob stores¶

Amazon S3: A scalable and durable object storage service in the AWS cloud.
Azure Blob Storage: A scalable and durable object storage in the Microsoft Azure cloud.

Databases¶

SQL Server: A relational database management system with robust data management capabilities developed by Microsoft.
MySQL: An open-source and flexible relational database management system widely used for web applications and large datasets.
Oracle RDBMS: A powerful, enterprise-grade relational database management system designed for complex transactional processing.
PostgreSQL: A sophisticated, widely adopted and open-source relational database management system with a rich ecosystem of extensions.
Amazon Redshift: A managed and scalable data warehouse designed for fast querying of large-scale datasets in the AWS cloud.

File formats¶

CSV file upload: A simple, text-based file format used to store tabular data, where each line represents a data record and each field is separated by a comma.
Parquet file upload: A columnar storage file format designed for high-performance data querying and storage optimization in analytics workloads.

Public data portals¶

Global Health Observatory: The WHO primary data portal for health-related statistics, providing a central repository of over 1,000 indicators used to monitor and analyze global health situations and trends.
UNICEF Data: A comprehensive repository which provides globally comparable statistics and cross-country datasetsfocused on the health, education, protection, and well-being of children and women worldwide.
World Bank Open Data: A data portal which provides global access to thousands of development indicators, economic statistics, and country-level profiles to foster transparency and research.

Schedule data refresh¶

Data pipelines can be scheduled to automatically refresh data from the data source. The refresh schedule is set in the add and update data pipeline pages. There are two types of scheduling.

Regular interval: Next to the Refresh schedule label, the preferred interval for when to refresh the data source can be selected from the drop-down field. Data will be regularly refreshed from the data source at the selected interval. Data in AP will be fully replaced for every refresh, meaning, existing data will be entirely replaced with data from the source system.
Continuously updated: Next to the Refresh schedule label, the Continuously updated checkbox can be selected. Data will be continuously refreshed from the data source. Data in AP will be incrementally appended for every refresh with a one minute delay in between. In addition, data in AP will be fully replaced once per night. For the incremental appends, new, updated and removed data records in the source system are first loaded in a staging table in AP, then corresponding data records are removed, then finally the new and updated data records are loaded in AP. Continuous update scheduling is currently only supported for the DHIS2 data pipeline.

Authentication¶

Data pipelines need to authenticate to the data source for secure exchange of data records. Data pipelines typically use the following types of authentication.

API token: This method uses a token that is sent in the HTTP header of requests to authenticate with APIs. The token acts as a secure key to access the API, ensuring that only authorized users or services can interact with the data source. It is often used for RESTful APIs and provides a straightforward way to handle authentication without exposing user credentials.
API username/password: A simple authentication scheme built into the HTTP protocol, also known as basic auth. It involves sending a username and password with each HTTP request. These credentials are typically base64 encoded for transmission but are easily decoded, making this method less secure unless used in conjunction with TLS encryption to protect the credentials in transit.
Database username/password: This method involves connecting to a database using a username and password. The credentials are used to establish a JDBC or similar database connection, ensuring that only authenticated users can execute queries and access data. This traditional form of database access control is widely used due to its simplicity and direct support in most database systems.

General metadata¶

All data pipeline types have a common section titled General settings with various metadata fields. This section allows you to store extensive metadata for each data pipeline (dataset). Maintaining descriptive metadata for each dataset allows for a comprehensive data catalog, allowing users to get an overview of which datasets exists for your organization. The metadata fields include name, description, owner, URL, tags, reference, link to source and link to terms of use. Tags are entered as free text. Tags which already exist will be suggested as you type in a tag name.

Data source connections¶

AP pulls data from your data sources using a set of fixed IP addresses. To ensure that AP can connect to your data sources, you must allow list these IP addresses in your firewall. This typically only applies to databases, where a connection is made directly with the database. It typically does not apply to HTTP API-based applications which are already accessible on the Internet. It also only applies for the managed AP offering from BAO Systems, not for other deployment models.

AP environment	Region	IP address
Production	US East	3.93.131.28
Test	US East	54.173.36.156

Technical overview¶

This section explains technical aspects of the data pipeline system.

Regular data loading¶

The steps for data loading are described below.

Connect to the data source system, typically an encrypted HTTP API connection using the TCP/IP protocol, or an encrypted database connection using the JDBC or ODCB protocol.
Retrieve and write table data into buffer data files in the AP storage area. Data files use the pipe character as delimiter for text values, and are compressed with gzip using the DEFLATE algorithm.
Upload the data files into the client-specific blob store. Depending on the client-specific hosting environment, this is either an on-premise file system or public cloud blob store.
Create a staging table, which is a regular table with a _staging suffix.
Load data from the data files into the staging table.
If the data warehouse supports indexing, add indexes for appropriate tables and columns.
Drop the existing main table, and promote the staging table to become the main table. This is done by dropping the main table and renaming the staging table in an atomic operation.

Real-time data loading¶

AP supports near real-time data loading for specific data pipeline types. This behavior is referred to as continuous data loading. To enable continuous data loading for a data pipeline, select the "Continuously updated" checkbox under "Refresh schedule" in the general settings section when creating or updating a data pipeline or workflow. Continuous data loading is supported for specific data pipeline types only, and the implementation depends partly on the structure of each data pipeline source system.

The steps for continuous data loading are described below. The steps are partly overlapping with the steps for regular data loading described above. The incremental data loading tasks run with a fixed delay of one minute. The incremental data load process is typically taking between 1 and 3 minutes. This means that data becomes available in AP between 2 to 4 minutes after data is captured in the source system.

Store a timestamp for the beginning of the last successful data load task.
Connect to the data source system.
Retrieve data which were created, updated or deleted since the beginning of the last successful data load task. This is done with a SQL filter particular to the source system. Write data into buffer data files.
Upload the data files into the client-specific blob store.
Create a staging table.
Load data from the data files into the staging table.
Delete data records from the main table which match data records in the staging table using a unique record identifier.
Append data records from the staging table to the main table.

This proccess ensures that data records which were created or updated since the last data load task are effectively added or updated, and that data records which were deleted since the last data load task are deleted in the AP managed data warehouse.

Data retention and versioning¶

The data pipeline system in AP retains and stores the raw data files for every data load operation. Each data load is associated with a job identifier. The raw data files are kept in the data storage area in a directory named after the job identifier. This allows data analysts to compare the current data with historical versions and perform historical trend analysis, and system administrators to perform root-cause debugging and conduct point-in-time data re-processing.