October 15, 2021 · By Markus Thielen

Anatomy of a Compliant Standard Data Management System

Version 1.0, October 2021

Markus Thielen -
René Herzer -

Abstract

This document describes basebox, a prototype of a data management system that combines storage (databases) with universal APIs (based on compiled GraphQL requests) and other functionality in a regulatory compliant way to form a standard data platform for data sensitive sectors.

Introduction

During the past few years, we have accompanied various digital health startups along their way from idea to prototype and/or product. Doing so, we have learned that virtually all digital health projects require the same set of basic data management functionality. Seeing them implementing the same functionality all over again and again, possibly making the same mistakes, spending a lot of time and money, we came up with the idea for a data management system that covers these requirements for good.

When beginning a new project, nobody really starts from scratch. For instance, you would never even think of creating a new database system - you would choose an existing. The same applies to many other things like programming languages, frameworks, etc. So there is a varying level of standard functionality that everybody relies on. With basebox, we want to raise this level to a point where one can start working on project-specific things right away.

Last, but not least, we believe that existing projects can benefit from switching to basebox, too.

Scope

Assuming client applications, data collection devices etc are at the top of a project's technology stack, basebox will provide all functionality including and below the HTTPS/GraphQL layer.

On top of basebox, projects implement their specific frontends using about any technology they like, for example:

Browser based dashboard apps (using Vue, React, Angular, plain JavaScript...)
Smartphone apps (iOS, Android, ...)
Smart watches
Patient monitoring devices
Cameras
...

The interface to basebox uses GraphQL on top of https which are both simple, yet almost infinitly flexible industry standards. The simplicity is important when connecting small devices, e.g. microcontroller-based designs which don't have the capacity for complex frameworks.

Design Goals

basebox' design goals are strictly based on the demands for a digital health data management system. While this may seem obvious, some demands are easily overlooked or at least underrated.

Demands

We learned that, when project teams design and choose components for their technology stack, they usually do so using a common set of criteria, for example (in order of assumed precedence):

Ease of use / familiarity
Flexibility
Performance
Scalability
Security
Data Retention Policies
Data Analysis and Reporting
Regulatory Compliance

In our experience, the last four points do not receive the attention they deserve. They are prone to tie a lot of project resources, assuming they are even considered early enough, or can cause projects to fail, even after completion, if neglected.

Ease of Use

Ease of use is the most obvious criterion: It pays off very quickly and saves time and money. To meet this demand, basebox features:

A setup routine based on just a GraphQL schema description and a small number of control/configuration files
A universal API that compiles and handles all schema-compliant GraphQL requests on the fly (i.e. maps database objects to GraphQL entities and vice versa)
Possibility to add custom functionality to the API (aka remote procedure calls)
Boilerplate functionality (user management, authentication, export, ...)
Comprehensive, high quality developer documentation

Performance

We design and build basebox from scratch, having performance in mind; an important use case of basebox is the integration of patient monitoring devices, which can produce lots of data, especially during initial trials and before algorithms can be developed that can process/aggregate data on the devices.

One of the most important design decisions affecting performance is the use of Rust as the programming language.

Rust is a compiled language (like C or C++), thus it is run as native machine code that does not have to be translated while being run (as is the case for interpreted languages like PHP or JavaScript). It has a very tight memory management and does not require a garbage collector (like Java).

Another important factor for basebox' performance is the GraphQL query compiler. This unique feature reduces the number of database hits per query to about 1. Since in virtually all applications database I/O is the most time consuming part of serving an API request, we expect a huge performance benefit.

And last but not least, we are using one of the fastest HTTP servers, Actix Web.

Scalability

Knowing that for most projects, the number one performance bottleneck is I/O, we designed basebox to support multiple databases, each running independently from each other. Basebox decides and keeps track of which patient's or device's data is stored on which database based on configurable rules.

This not only improves the scalability factor to almost 1 but also allows adding new databases as needed as new patients or devices join.

Security

Preventing data breaches is an all important goal for basebox. Some statistics:

80% increase in people being affected by health data breaches between 2017 and 2019
$ 7.13 million is the average cost of a data breach in the healthcare industry: https://www.ibm.com/downloads/cas/RZAX14GX
52% of breaches are caused by malicious attacks
525 recorded medical/healthcare data leaks in the U.S. in 2019
more here

Data breaches in the health sector are devastating for affected patients. Some examples:

Hospitals in Ireland, New Zealand and Scripps Health in San Diego are reeling from digital extortion attacks.
Hack of psychotherapy records in Finland affects thousands

Using Rust as our programming language, which is secure by design, basebox already has a solid ground.

On top of that, basebox features a security concept based on Keycloak. Using Keycloak's role model, basebox allows the definition which role may access which entity (database table) in what mode (read/write).

Some users will still have broad access (e.g. doctors or therapists); rate throttles will help reducing the speed at which a hijacked user of this kind can be used to extract data to configurable levels (e.g. a therapist can only access a given number of patient records during a given time window).

Data Retention Policies

Most database-based projects start without any concept of data retention policies and simply start collecting data. However, apart from resource issues like disk space and performance, there are legal reasons to define how long records should stay in the database and when to delete them.

Without any concept of data retention policies, projects often run into serious problems further down the road when excess data must be removed, e.g. for legal reasons, and removing the data takes much longer than expected and impacts the performance of the live system.

basebox will be designed with data retention policies in mind and removes excess data right when it is no longer needed. Where applicable, concepts for quickly removing large amounts of data will be considered.

Regulatory Compliance

basebox itself cannot be certified to be HIPAA or ISO13485 compliant, as it is not a medical product (merely a part of one). Instead, basebox is an off-the-shelf (OTS) component SOUP in terms of IEC 62304.

basebox' compliance goal is no certification issues raised for basebox components. To achieve this, basebox is developed on the basis of a quality management system that meets ISO 13485 requirements. Over time, regulators will get to know basebox, further reducing certificaton efforts.

Features

MVP Version

Easy setup routine based on a GraphQL schema definition and a control file
Universal GraphQL API featuring a GraphQL compiler that serves all schema compliant requests
KeyCloak-based authentication
Standard users and role management via KeyCloak
Data Analysis and Report Module optimized for high volume data (simplified for MVP)
Database support for PostgreSQL (other databases, including no-sql DBs will be added later)

Easy Setup Routine

The setup routine is the first usage impression developers get from basebox, so we designed this process to be as smooth as possible. basebox ' setup is controlled by the project's database schema description in GraphQL format and a control file. By default, the control file can be almost empty.

The default routine will create various docker machines:

Broker - HTTP API server
Database Server
KeyCloak Server

Universal GraphQL API

When creating REST or GraphQL APIs, each API entrypoint, or, in GraphQL terms, query or mutation resolver, must be implemented, maintained and tested explicitly and separately. For REST APIs, this means that client applications are often forced to call multiple endpoints to get the data they require for the task at hand. Depending on the server implementation, this often leads to information returned to the client and thus database I/O that is unnecessary to fulfill the client's initial requirement. GraphQL improves on that by allowing the client to build requests containing only the required information (or "entities", in GraphQL terms). However, regular GraphQL servers resolve queried entities separately, which leads to at least one database hit for each entity.

To mitigate both problems, basebox features a unique GraphQL request compiler that translates GraphQL requests into SQL (or NoSQL) statements. This approach has the following benefits:

Drastically reduces implementation efforts by automatically serving all schema compliant requests
Significantly speeds up requests by merging entity queries into single database requests (JOINs for SQL), thus sparing unnecessary database round trips.

GraphQL Compiler Performance Benefits

When serving API requests, database I/O is by far the most expensive (time consuming) part, so we can expect a significant performance gain by

avoiding unnecessary database hits altogether
JOINing queries whenever possible (see e.g. this article)

Both goals are achieved with basebox' GraphQL compiler.

Example

To illustrate the benefits, we assume the following example scenario:

The client application wants to display a list of patients with name, gender, insurance and last visit:

Patient	Gender	Insurance	Last Visit
Martha Mayer	female	Health Inc	31-12-2022 14:57
Peter Hollister	male	Vita Inc	14-12-2022 15:10
...

The data for this information is spread across multiple database tables - which is a common case:

Fields	Table
First Name, Last Name, Gender	`patients`
Insurance Company Name	`insurances`
Last Visit	`visits`

Legacy REST

null

For REST, the number of queries often matches the number of database hits. The total number is

n * 2 + 1

where n is the number of patients. So for 50 patients, to display the patient table, 101 queries and database hits are necessary.

Regular GraphQL

In GraphQL, you can write a query that requests all needed data at once, e.g.:

graphql

query {
  allPatients {
    id
    dateOfBirth
    firstName
    lastName
    lastVisit {
      dateTime
    }
    insurance {
      name
    }
  }
}

query {
  allPatients {
    id
    dateOfBirth
    firstName
    lastName
    lastVisit {
      dateTime
    }
    insurance {
      name
    }
  }
}

Which would result in the following flow:

null

For GraphQL, it would be just one request, but also n * 2 + 1 database hits, where n is the number of patients. So for 50 patients, still 101 database hits are necessary.

basebox

For basebox, the GraphQL query would be the same, but the flow differs significantly:

null

So for basebox with its GraphQL compiler, there is 1 request and 1 database hit.

Custom Business Logic

On top of that, custom queries or mutations can be implemented via customer-developed loadable modules on the server side to implement arbitrary functionality that may not even be related to database objects, much like remote procedure calls using GraphQL for schema definition and communication.

KeyCloak Authentication

The basebox authentication system is based on KeyCloak, which is a widely trusted, open source authentication system based on open standards.

Data Analysis and Report Module

basebox was designed with patient monitoring data in mind. Since monitoring devices can produce lots of data, basebox features a dedicated data analysis and export module that is optimized for large data sets.

During initial project phases, raw data is typically collected which not only causes a lot of data to acumulate but also requires analysis to repeatedly run over the entire data set until it's clear how the data should be processed. Thus, an analysis tool needs to be performant and allow collecting and merging data from multiple databases in parallel while providing an interface to powerful data analysis tools. basebox' report module does that by collecting data from all DBs in parallel, based on GraphQL queries, merging the data as needed and calling configurable Pyhton functions which can then use the rich set of tools available for Python (numpy, various AI frameworks, etc.)

Once it's clear how the raw data needs to be processed, the corresponding algorithm can either be moved to the devices collecting the data or a trigger-based analysis tool can be configured to do so directly on one of the databases, the latter not being part of the MVP.

Beyond MVP

The following features are planned for later versions:

Additional Database Support that integrates NoSQL and various relational databases
ePROM Module for creating and delivering evaluable patient questionnaires
FHIR Support
Client data transport libraries with reliable offline support for at least iOS and Android

basebox' Anatomy

The following graphic illustrates basebox' architecture:

basebox Overview

basebox (drawn in double lines) consists of the following main components:

Broker - basebox' integral component, containing an HTTPS GraphQL server, GraphQL compiler, ...
Web Server - hosts generic HTML based components, such as the generic user administration
Catalog Database - contains user records and the database mapping table (catalog)
SQL/NoSQL databases
KeyCloak - central authentication

Various examples of what be connected to basebox:

Web browser based applications
Smartphones with connected BLE sensors
Autonomous sensors

This list is, of course, by no means exhaustive.

basebox Broker

basebox' central module is the so called Broker, providing the following functionality:

API Server - accepts GraphQL requests via HTTPS
GraphQL compiler - parses GraphQL and generates database requests
Database Interface
Response generator - converts database response data to JSON

basebox Broker

Conclusion

We believe that basebox can be a huge benefit for data sensitive sectors. Specifically in data sensitive sectors like the Health Tech and Gov Tech industry.

We are looking forward to your feedback. Please drop us a note at or one of the authors' email addresses listed at the top of this document.

Anatomy of a Compliant Standard Data Management System ​

Abstract ​

Introduction ​

Scope ​

Design Goals ​

Demands ​

Ease of Use ​

Performance ​

Scalability ​

Security ​

Data Retention Policies ​

Regulatory Compliance ​

Features ​

MVP Version ​

Easy Setup Routine ​

Universal GraphQL API ​

GraphQL Compiler Performance Benefits ​

Example ​

Custom Business Logic ​

KeyCloak Authentication ​

Data Analysis and Report Module ​

Beyond MVP ​

basebox' Anatomy ​

basebox Broker ​

Conclusion ​

Anatomy of a Compliant Standard Data Management System

Abstract

Introduction

Scope

Design Goals

Demands

Ease of Use

Performance

Scalability

Security

Data Retention Policies

Regulatory Compliance

Features

MVP Version

Easy Setup Routine

Universal GraphQL API

GraphQL Compiler Performance Benefits

Example

Custom Business Logic

KeyCloak Authentication

Data Analysis and Report Module

Beyond MVP

basebox' Anatomy

basebox Broker

Conclusion