Data Pipeline & Persistence Design

Overview

This document captures the design decisions made for the data ingestion pipeline, persistence strategy, and the boundary between the synchorinisation and analysis subsystems.

Notes: The analysis subsystem and ingestion layer will be documented in detail in subsequent documents.

System Decomposition

The system could be divided into two separated subsystems with distinct responsibilities:

Subsystem Responsibility
Synchronisation Fetch raw data, parse it, and persist into DB
Analysis Execute queries, render charts, compose dashboards and reports

The synchronisation subsystem produces tables in DB, while the analysis subsystem consumes them.

Synchronisation Subsytem

Design Principle

New data types, e.g. finance, health, etc., require a new adapter. The role of this adapter is to persist raw input from a specified datasource to a specific table in DB. Keep in mind, the adapter does not handle how to fetch data from datasouce, since this is the responsibility of a separated layer.

While the alternative, which is allowing runtime-defined transformation logic such as user-supplied SQL expressions evaluated during ingestion, offers better flexibility when integrating with new data types, but introduces an unacceptable security surface.

Adapter Contract

Each adapter takes a raw input, transform this data, and persist to DB. That means, an adapter is composed of two parts:

  • Mapper: maps the raw input into a specified domain record
  • Repository: handles the interaction between the application and the DB

Analysis Subsystem

Design Principles

The analysis subsystem is modelled on three principles: SQL-first, query-centric, and decoupled from ingestion concerns. User interact with DB tables directly through the UI by writing queries, defining transformation views, configuring charts, and composing dashboards.