October 15, 2021 · By Markus Thielen
Version 1.0, October 2021
Markus Thielen -
René Herzer -
This document describes basebox, a prototype of a data management system that combines storage (databases) with universal APIs (based on compiled GraphQL requests) and other functionality in a regulatory compliant way to form a standard data platform for data sensitive sectors.
During the past few years, we have accompanied various digital health startups along their way from idea to prototype and/or product. Doing so, we have learned that virtually all digital health projects require the same set of basic data management functionality. Seeing them implementing the same functionality all over again and again, possibly making the same mistakes, spending a lot of time and money, we came up with the idea for a data management system that covers these requirements for good.
When beginning a new project, nobody really starts from scratch. For instance, you would never even think of creating a new database system - you would choose an existing. The same applies to many other things like programming languages, frameworks, etc. So there is a varying level of standard functionality that everybody relies on. With basebox, we want to raise this level to a point where one can start working on project-specific things right away.
Last, but not least, we believe that existing projects can benefit from switching to basebox, too.
Assuming client applications, data collection devices etc are at the top of a project's technology stack, basebox will provide all functionality including and below the HTTPS/GraphQL layer.
On top of basebox, projects implement their specific frontends using about any technology they like, for example:
The interface to basebox uses GraphQL on top of https which are both simple, yet almost infinitly flexible industry standards. The simplicity is important when connecting small devices, e.g. microcontroller-based designs which don't have the capacity for complex frameworks.
basebox' design goals are strictly based on the demands for a digital health data management system. While this may seem obvious, some demands are easily overlooked or at least underrated.
We learned that, when project teams design and choose components for their technology stack, they usually do so using a common set of criteria, for example (in order of assumed precedence):
In our experience, the last four points do not receive the attention they deserve. They are prone to tie a lot of project resources, assuming they are even considered early enough, or can cause projects to fail, even after completion, if neglected.
Ease of use is the most obvious criterion: It pays off very quickly and saves time and money. To meet this demand, basebox features:
We design and build basebox from scratch, having performance in mind; an important use case of basebox is the integration of patient monitoring devices, which can produce lots of data, especially during initial trials and before algorithms can be developed that can process/aggregate data on the devices.
One of the most important design decisions affecting performance is the use of Rust as the programming language.
Rust is a compiled language (like C or C++), thus it is run as native machine code that does not have to be translated while being run (as is the case for interpreted languages like PHP or JavaScript). It has a very tight memory management and does not require a garbage collector (like Java).
Another important factor for basebox' performance is the GraphQL query compiler. This unique feature reduces the number of database hits per query to about 1. Since in virtually all applications database I/O is the most time consuming part of serving an API request, we expect a huge performance benefit.
And last but not least, we are using one of the fastest HTTP servers, Actix Web.
Knowing that for most projects, the number one performance bottleneck is I/O, we designed basebox to support multiple databases, each running independently from each other. Basebox decides and keeps track of which patient's or device's data is stored on which database based on configurable rules.
This not only improves the scalability factor to almost 1 but also allows adding new databases as needed as new patients or devices join.
Preventing data breaches is an all important goal for basebox. Some statistics:
Data breaches in the health sector are devastating for affected patients. Some examples:
Using Rust as our programming language, which is secure by design, basebox already has a solid ground.
On top of that, basebox features a security concept based on Keycloak. Using Keycloak's role model, basebox allows the definition which role may access which entity (database table) in what mode (read/write).
Some users will still have broad access (e.g. doctors or therapists); rate throttles will help reducing the speed at which a hijacked user of this kind can be used to extract data to configurable levels (e.g. a therapist can only access a given number of patient records during a given time window).
Most database-based projects start without any concept of data retention policies and simply start collecting data. However, apart from resource issues like disk space and performance, there are legal reasons to define how long records should stay in the database and when to delete them.
Without any concept of data retention policies, projects often run into serious problems further down the road when excess data must be removed, e.g. for legal reasons, and removing the data takes much longer than expected and impacts the performance of the live system.
basebox will be designed with data retention policies in mind and removes excess data right when it is no longer needed. Where applicable, concepts for quickly removing large amounts of data will be considered.
basebox itself cannot be certified to be HIPAA or ISO13485 compliant, as it is not a medical product (merely a part of one). Instead, basebox is an off-the-shelf (OTS) component SOUP in terms of IEC 62304.
basebox' compliance goal is no certification issues raised for basebox components. To achieve this, basebox is developed on the basis of a quality management system that meets ISO 13485 requirements. Over time, regulators will get to know basebox, further reducing certificaton efforts.
The setup routine is the first usage impression developers get from basebox, so we designed this process to be as smooth as possible. basebox ' setup is controlled by the project's database schema description in GraphQL format and a control file. By default, the control file can be almost empty.
The default routine will create various docker machines:
When creating REST or GraphQL APIs, each API entrypoint, or, in GraphQL terms, query or mutation resolver, must be implemented, maintained and tested explicitly and separately. For REST APIs, this means that client applications are often forced to call multiple endpoints to get the data they require for the task at hand. Depending on the server implementation, this often leads to information returned to the client and thus database I/O that is unnecessary to fulfill the client's initial requirement. GraphQL improves on that by allowing the client to build requests containing only the required information (or "entities", in GraphQL terms). However, regular GraphQL servers resolve queried entities separately, which leads to at least one database hit for each entity.
To mitigate both problems, basebox features a unique GraphQL request compiler that translates GraphQL requests into SQL (or NoSQL) statements. This approach has the following benefits:
When serving API requests, database I/O is by far the most expensive (time consuming) part, so we can expect a significant performance gain by
Both goals are achieved with basebox' GraphQL compiler.
To illustrate the benefits, we assume the following example scenario:
The client application wants to display a list of patients with name, gender, insurance and last visit:
Patient | Gender | Insurance | Last Visit |
---|---|---|---|
Martha Mayer | female | Health Inc | 31-12-2022 14:57 |
Peter Hollister | male | Vita Inc | 14-12-2022 15:10 |
... |
The data for this information is spread across multiple database tables - which is a common case:
Fields | Table |
---|---|
First Name, Last Name, Gender | patients |
Insurance Company Name | insurances |
Last Visit | visits |
Legacy REST
For REST, the number of queries often matches the number of database hits. The total number is
n * 2 + 1
where n is the number of patients. So for 50 patients, to display the patient table, 101 queries and database hits are necessary.
Regular GraphQL
In GraphQL, you can write a query that requests all needed data at once, e.g.:
query {
allPatients {
id
dateOfBirth
firstName
lastName
lastVisit {
dateTime
}
insurance {
name
}
}
}
query {
allPatients {
id
dateOfBirth
firstName
lastName
lastVisit {
dateTime
}
insurance {
name
}
}
}
Which would result in the following flow:
For GraphQL, it would be just one request, but also n * 2 + 1 database hits, where n is the number of patients. So for 50 patients, still 101 database hits are necessary.
basebox
For basebox, the GraphQL query would be the same, but the flow differs significantly:
So for basebox with its GraphQL compiler, there is 1 request and 1 database hit.
On top of that, custom queries or mutations can be implemented via customer-developed loadable modules on the server side to implement arbitrary functionality that may not even be related to database objects, much like remote procedure calls using GraphQL for schema definition and communication.
The basebox authentication system is based on KeyCloak, which is a widely trusted, open source authentication system based on open standards.
basebox was designed with patient monitoring data in mind. Since monitoring devices can produce lots of data, basebox features a dedicated data analysis and export module that is optimized for large data sets.
During initial project phases, raw data is typically collected which not only causes a lot of data to acumulate but also requires analysis to repeatedly run over the entire data set until it's clear how the data should be processed. Thus, an analysis tool needs to be performant and allow collecting and merging data from multiple databases in parallel while providing an interface to powerful data analysis tools. basebox' report module does that by collecting data from all DBs in parallel, based on GraphQL queries, merging the data as needed and calling configurable Pyhton functions which can then use the rich set of tools available for Python (numpy, various AI frameworks, etc.)
Once it's clear how the raw data needs to be processed, the corresponding algorithm can either be moved to the devices collecting the data or a trigger-based analysis tool can be configured to do so directly on one of the databases, the latter not being part of the MVP.
The following features are planned for later versions:
The following graphic illustrates basebox' architecture:
basebox (drawn in double lines) consists of the following main components:
Various examples of what be connected to basebox:
This list is, of course, by no means exhaustive.
basebox' central module is the so called Broker, providing the following functionality:
We believe that basebox can be a huge benefit for data sensitive sectors. Specifically in data sensitive sectors like the Health Tech and Gov Tech industry.
We are looking forward to your feedback. Please drop us a note at or one of the authors' email addresses listed at the top of this document.