Why Databox?

Databox results from many years of research into personal data and their ecosystems. This short note lays out the primary motivations and the thinking behind Databox without delving into the technical detail. As background, I recommend watching the “What is Databox” video on YouTube to obtain a high-level view of the Databox approach. Fundamentally, the forces that motivate Databox arise from the EU General Data Protection Regulation, the advent of the Internet of Things, and the need to balance consumer concerns such as privacy and accountability with commercial desire to exploit new opportunities provided by the widespread generation, collection and analysis of data.

The EU General Data Protection Regulation (see Wired GDPR article for a recent review) has been on the agenda since Horizon Digital Economy Research began over seven years ago. While we have seen amendments to details, the structure has remained robust for a great many years, which is unsurprising given the GDPR enacts elements of the European Convention on Human Rights with regards to personal information processing. Simultaneously we have seen the growth of movements seeking access to data in machine readable forms for reuse for new applications, where such data was previously either jealously guarded or simply viewed as not worth the effort. This equally applies to the open data movement seeking to open government data (whether for transparency or new business opportunities) and personal data (where access by the public is viewed as empowering them as consumers). In the UK, the latter has been the subject of the midata campaign, most recently delivering the Open Banking APIs. It is only natural that these converge in the “right to portability” enshrined in GDPR.

One simple purpose of online data access is to promote switching between providers. In the current cloud computing era, the prevalent technical implementation is a cloud-hosted web service through which a consumer shares their data obtained from one provider, either directly with a competitor, or more usually with a comparison site. However, computationally, as each consumer’s data for the purported purposes is essentially processed independently, each consumer could perform the calculation themselves if provided with easy to use software on commonly used platforms such as PCs and smartphones – and in the future, Databox…

This example brings to the fore a core and increasing concern for consumers, one of privacy. In sharing our data, the rights obtained by the service provider, and data controller in Data Protection terms, are often very general and permit reuse (targeted advertising being merely the tip of the iceberg with which many consumers have come to accept) and resale. We must be cognisant that such reuse and resale may in fact be a business requirement as the actual web service itself may not be profitable in its own right – although given the cost of data protection compliance, a provider of software rather than a service might become profitable simply by removing those compliance costs. This also raises the interesting question with regards to the GDPR “Privacy by Design” (PbD) requirement - see Art. 25 GDPR Data protection by design and by default (*): if a function can be provided without sharing data (i.e. data is kept strictly private rather than confidentially shared), should sharing approaches simply be illegal? Databox addresses this by providing an environment in which consumers can execute software supplied by providers on their data, in private.

The issue of resale of data also highlights the need for accountability. Normal citizens should be able to be aware of how their data is used and what consequences arise. This right is embedded within the GDPR but, as with all matters of accountability, without the ability of third parties to audit the systems, the valid concerns raised by lobbying groups and a steady stream of breaches continue to undermine public confidence. Databox addresses this through three interlinked measures: (i) the requirement for a machine-readable manifest that stipulates the data to be accessed, the processing to be performed, and any data to be exported; (ii) conversion of the manifest to a service level agreement through direct interaction with the consumer, and which Databox subsequently enforces as the processing is carried out; and (iii) mandatory logging by the Databox platform of all application activities, including data access and export, for inspection by concerned consumers perhaps on receipt of notification.

So far, the discussion has centred on existing applications, but a key desire of Databox is to open new opportunities for applications that perform data fusion across multiple consumer data sources. Data protection is but one of a series of regulatory requirements that companies must adhere to. For example, in the UK financial processing must be performed in accordance with rules issued by the Financial Conduct Authority (FCA), while much medical data falls within the requirements of the NHS Information Governance framework. A service aiming to process “all” a consumer’s data by providing a service involving data sharing will find themselves in a complex, possibly even conflicting, compliance pickle. In Databox, a company with a great idea, e.g., to fuse consumer purchasing, medical and activity information, simply supplies the software to the consumer for them to execute, avoiding on-going compliance overheads.

The general class of solution for these issues are often referred to as Personal Information Management (PIM) platforms. Most have proposed cloud based solutions, including personal virtual machines to host PIMs. Even though the current Databox implementation is based on Docker containers and could run on any platform, Databox eschews the cloud approach. Under GDPR, the clear separation between “data controller” and “data processor” is no longer present and so cloud providers may well find themselves in difficult compliance situations where they may previously have viewed themselves merely as data processors. Even worse, in the UK they will likely be viewed as Communications Service Providers for the purposes of the IP and DE bills, leading to further significant compliance costs.

An emerging class of data is that of the Internet of Things (IoT). For Databox the primary context for considering IoT is the domestic environment, encompassing equipment installed in the home and the ubiquitous personal mobile devices such as smartphones and tablets that form the control and management interface to such infrastructures. In this domestic market, the ecosystem is still emergent – high profile security failures and data breaches have rightly raised concerns about building what will be national infrastructure, in a completely ad-hoc manner. The role of Databox in the IoT context includes all the foregoing benefits (privacy, accountability and new opportunities), but also adds resilience. Reliance on continuous Internet connectivity and service availability for basic in-home operation of heating, lighting and security is not tenable. Even more-so when considering that in home IoT is advocated to enable elderly to continue to live independent lives in their own homes. This requires the in-home infrastructure to be able to supply functionality for significant periods of time without reliance on cloud infrastructure – we propose that the integration point in the home is the Databox.

Finally, with regards to domestic IoT, Databox provides scalability. Many in-home sensors can produce substantial data rates that need local processing to ensure they do not consume the home owner’s upstream broadband service – consider as examples face recognition used as part of a home intrusion detection, or “condition monitoring” of home appliances using high frequency sampling of the electricity supply. In-home processing in a Databox avoids the need for high value servers in data centres, along with associated bandwidth and storage costs, by pushing the processing to the edge. This approach also removes the central services that hold significant personal data and are control points for possibly millions of devices – such services are honey pots for hackers, whether they are simply mischievous, engaged in direct action campaigning on some topic, or with criminal intent.

So, is it necessary the Databox be a separate computing appliance in the home? As noted, the current Databox implementation is based on Docker containers, so in the home the software could run on a PC, home router, set top box, fridge, etc. Indeed, running Databox as a peer-to-peer application on several different devices in the home network would improve resilience and enable defence against compromise of individual systems. However, given the envisaged usage of Databox, not only for IoT but as the location of secure private computation, we do envisage that at least one Databox instance will be available in the home at all times – and a specialist appliance seems likely to be the most secure, cost effective and energy efficient solution.

w: https://databoxproject.uk
e: info@databoxproject.uk
t: @databoxproject

(*) You happy @tnhh?

Written on November 28, 2017