The UK Biobank hopes to manage and mine a quarter of a billion data points.
By Vicki Glaser
May 19, 2009 | Andy Harris, Information Services Director of UK Biobank, was entrusted with a daunting task when he joined the huge data repository project, which was first conceptualized a decade ago: Design an IT system with the capability to manage more than a quarter of a billion data points, track over 20 million blood samples, and, perhaps most challenging, analyze and mine all that data without knowing how the information might be used 5, 10, or 20 years down the road.
And one more thing: all this information is highly personal and must be collected, stored, and retrieved in a secure, limited-access system that preserves and protects the anonymity of the registrants.
Harris viewed this challenge as the Rubik’s Cube of system design—get all the pieces of the puzzle in place and the reward would be a boundless resource for future clinical trials. A repository of personal demographic information, linked to medical history, physiological parameters, and blood samples, which together could form the basis for a better understanding of disease predisposition and the development of screening tests, targeted diagnostics, more accurate prognostic measures, and new therapeutic strategies.
From its inception, the driving force behind the creation of the UK Biobank was “a strategic need to establish large blood-based prospective epidemiological studies with prolonged and detailed follow-up of cause-specific morbidity and mortality,” states the Website. This vast database has a set of goal of 500,000 registrants, with approximately 500 data points for each individual—250 million data points to be collected, stored, retrieved, and analyzed (see “Jewels in the Biobanking Crown,” Bio•IT World, May 2007).
A massive challenge is to ensure that at every step, the pivotal considerations are privacy, security, and ethics. These overriding principles guide every programming detail, keystroke, and decision about access to a cache of highly sensitive information. Ensuring an unalterable link between an individual and his or her evolving data file, while not revealing the person’s identity, is a challenge.
Guaranteeing the integrity and security of the data as it is collected at temporary Assessment Centers throughout the country and transferred via the Internet between geographically dispersed processing and archival sites is a headache. And determining the best way to store this data to optimize its value as a resource for future medical research, when it is unclear how the information will ultimately be used, is an IT team’s nightmare.
“We have to be open-minded at every stage, while making sure we are meeting all of the ethical and security requirements,” says Harris, who has led the software development and IT management team for more than four years. The team consists of six people who built the IT infrastructure from scratch, with design decisions and system modifications implemented as the concept evolved. Many of the day-to-day data management tasks are outsourced to contract service providers.
The clinic IT system used by UK Biobank was developed at the Clinical Trials Services Unit of Oxford University by Core Programming Team. The UK Biobank chose to build patient records using the Oracle application Healthcare Transaction Base (HTB), designed to be HL7-compliant. The IT team developed Oracle- and Java-based software to convert the data upstream of HTB. Once data are in the repository, they are Oracle-dependent. In addition to HTB, Biobank uses two main commercial products: Thermo Nautilus LIMS and Microsoft Exchange BackOffice function, which facilitates communication among the staff.
Protect the Innocent
Harris’s team deliberately did not design the system architecture as an interactive environment, intending to limit access to data stored in the core systems. The group developed an IT system specifically to manage assessment center visits to register new participants. Each center typically has 30 to 40 touch-screen desktop workstations linked through a local area network (LAN) to an on-site server. Only the center’s server and the administrator’s unit have Internet access.
When new registrants arrive clutching their invitation letters, they are given a USB stick containing an encrypted key that translates the name, address, and identifier printed on their letter to their assessment specific identifier. “Linkage of name to identifier is a crucial part of the infrastructure,” says Harris.
A nurse inserts the USB stick into a workstation, enters the identifier, and completes a detailed consent process that includes a digital signature. Participants then answer some 300 questions on a touch-screen module before an interactive session with a nurse, who enters information on medical history, medications, and physical measurements (blood pressure, height, weight, bone density, lung function, etc.). After obtaining six vacutainers of blood and one of urine, the nurse gives participants a copy of their signed consent form and the basic results of their tests.
“We now have sensitive data in the assessment center computers,” says Harris, which are encrypted locally using the Blowfish algorithm. Every 20 minutes, new information is transmitted through encrypted tunnels over the Internet to the UK Biobank central server in Oxford. With each successful transmission, the operator receives a confirmation flag and the data packet is automatically cleared from the Center’s workstation.
Dedicated software also guides the processing of the blood samples. The vacutainers are bar-coded with color-coded caps corresponding to how each will be processed and stored. Once a blood sample is registered, the software initiates a timed process, with a visual progress bar and periodic reminders to guide operators through the processing steps. The LIMS tracks the samples, records how and where they are stored, and links the associated test results. The samples are shipped overnight to the Biobank center in Manchester.
Every night, all the data transferred that day to the central server are re-encrypted and loaded onto a secure web server at the Biobank center. At this point, “The data has finally arrived on the edges of the repository,” says Harris. The information is also archived on tape at two sites, including Oxford. In the future, follow-up data from primary care records, hospital visits, and test results maintained by the UK’s National Health Service (NHS) will be appended to the Biobank’s core data repository and linked to pre-existing data. The Biobank also hopes to gather data on death certification and cancer registration.
Although the UK Biobank may include some free text entries to establish the exact nature of a medical decision, it will primarily rely on capturing coded information. For every item entered, the IT team had to develop terminologies that map a code back to the words entered. For example there is a terminology set for diagnoses borrowed from the International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM) codes.
Merging free text with codes presents dilemmas. For example, one participant might enter “Holland” as his birthplace, another “The Netherlands.” Different entries cannot share a single code, but there is reluctance to alter the source data because, as Harris points out, an accurate audit trail “is critical to clinical trials.” The Biobank solution is to create a new “version 2” record, which translates “Holland” to “The Netherlands” and assigns the appropriate code.
The data stored in the central repository reside in XML format, because of a desire to be compatible with the NHS switch to Health Level 7 (HL7), a framework for the integration and exchange of electronic health information. This standard provides a common language for communicating clinical data. “UK Biobank chose to use it rather than develop our own standards,” says Harris. Although he describes HL7 as being “verbose and extensive,” it offers the advantage of being system independent.
When a participant’s consent materials and baseline information arrive on the periphery of the repository, it is checked for proper encryption and completion. Next, the system creates an internal message in HL7 to register the person in the repository. This serves as a “person record” and creates a portal for entry of clinical data, though no information (except for the consent) is stored there yet.
Harris’s group has been developing an extraction tool that can retrieve data and format it into an output that would make sense to an epidemiologist trying to link participant parameters to disease susceptibility, diagnoses, and outcomes.
“We are seeing remarkable results,” says Harris. “We can turn around 125 million data points in about 5 minutes. It’s not real-time, but it’s heading in that direction.” The retrieval system is still in development, Harris adds, and “the ethical and security issues have to be wrapped around it,” to ensure that the data are linked to the correct person and that only approved people can access the data.
- Completed registry will contain information and blood samples for 500,000 Britons between the ages of 40 and 69 years at time of entry.
- Infrastructure started in 2003-04; data collection began in April 2007; more than 250,000 people already enrolled.
- Some 10% individuals who receive an invitation letter respond.
- Each Assessment Center operates for 6-12 months before moving to another city (rolling data collection).
- Biobank registers about 3,500 new participants/week; about 14 staff can process more than 100 people/day; baseline assessment takes about 90 minutes.
- UK Biobank maintains operations in Oxford, Manchester, and Cardiff.
This article also appeared in the May-June 2009 issue of Bio-IT World Magazine.
Subscriptions are free for qualifying individuals. Apply today.