Almost exactly two years ago I started my first job after graduating from University. I
Dr Yang Zhang’s lab does genome structure sequence modelling and analysis. Over the past year and a half they have been involved in generating new knowledge of proteins encoded by the genome SARS-CoV-2, also known COVID-19. Zhang lab, along with numerous other organisations have been using what we call ‘Open Source Principles’ to improve, to outsource, and generally speed up critical processes that typically take a lot longer. Here are a few examples I found when talking to Harish Pilay and then I expand on why and how these same principles should be applied in a new UK government initiative.
Prof Zhang Yongzhen and the first SARS-CoV-2 genome
On the 3rd of January 2020, while some folks were recovering from New Years Eve, samples of the pneumonia virus from Wuhan were received by a gene sequencing lab in Shanghai. After 40 hours, on the 5th of January, the sequencing was completed using high performance computing and top of the line equipment. The researchers analysed the sequence and recognised the pathogen was similar to the SARS (severe acute respiratory syndrome) family of viruses. This raised a lot of flags because some SARS viruses are deadly to humans and there were reports of fatalities.
With this information the researchers notified authorities. But they were told not to publish their research because similar things were going on in other cities. Presumably they were waiting for another city to publish research and thus come under less scrutiny. On the 8th of January the severity of the pathogen was obvious but despite having the data that would lead to a solution, the researchers were still told not to publish their findings.
By the 11th of June other researchers, through association or friends of the lab, got wind. Researchers in Singapore were encouraged by others around the world to ignore the mandate and to publish their findings anyway. They shared it with one of these encouragers, a researcher at the University of Sydney, who then published it for them at virological. A discourse site which contains heaps of virus related information that has been discovered over the years, for the world to see and do with as they see fit.
Some Linux people out there will see the parallels here to kernel.org. This is essentially the upstream of the corona genomic details. The entire sequence is available and is about 20,000 lines long. But unless you know what it's about, this is like the source code of the virus, you really need expertise to parse it.
So then the sprint began. With access to their work labs all over the world were able to address the sequence and begin the process of working on a vaccine.
From the 3rd to the 11th they went from discovery, to people working on a vaccine, which was then available 12 months later. This is significant when you compare it to previous, similar cases, where the process was for a single lab or team of researchers to go through trial and error to reach the same result. So even with the resistance from the government, this ‘sprint’ of work, with contributions from around the world greatly accelerated the progress to a vaccine.
This is a phenomenal achievement because allowing people to act and build upon it openly gave researchers a huge head start. Unfortunately this kind of work still can not be done in the world without people taking significant risks to move in the right direction. Prof Zhang Yongzhen in Shanghai, now Time’s 100 most influential people of 2020 and Professor Ed Holmes, exhibited the principles of open source and had a global impact.
They used the principles of open source to build trust and confidence to share with everyone in the world. Built on the premise ‘Release early and release often.’ The sooner your start, the sooner you can fix it.
Open Trace Protocol
Sometime in March 2020, Singapore was reeling with infections, and given their experience with a SARS virus in 2003 the ministry of health swung into action to handle the outbreak. The biggest lesson they learnt was that contact tracing is critical for mitigating and managing the spread of viruses.
GovTech were experimenting and exploring technology to use to conduct contact tracing. The team designed a protocol to answer the crucial epidemiological question ‘who was exposed?’ To answer that question you need to understand two things. The duration of exposure per person, and how close those who were exposed were. Then you can make a proper, conservative, estimate of how many people are exposed and do contact tracing effectively.
In their case, the research showed that exposure of 10-20 minutes within a radius of 0-6 meters is enough to say you were exposed to the virus. Armed with this information they developed the ‘Blue trace protocol’. A communication protocol designed to do this kind of contact tracing while preserving individual privacy and to do it across borders. On the 9th of April, an app called trace together, building on BlueTrace was released and an open source version was released called open trace.
This open sourcing achieved trust and confidence because it meant anyone could look and see exactly what code was running in the application. The application was intended to be installed on phones, obviously phones have lots of functionality beyond just the bluetooth the app needed. So before open sourcing, companies tried to reverse engineer the app to figure out if their data was at risk. Open sourcing the app removed this barrier and built trust. A community developed around the app and it became trusted, and robust. And organisations started to implement it.
Unfortunately, without understanding the concepts and principles of open source, companies jumped on the code and repurposed it for their own implementations, as intended, but did not then contribute back to source. The work any of the organisations did, the testing they’d conducted, to improve and strengthen the protocol, didn’t make it back upstream. And so numerous companies worked on the exact same problems and were none the wiser. Things became more fragmented, and less useful, implementing the same code, not knowing they didn’t need to.
The Goldacre review
I’m writing this article now because at the AI conference/festival CogX I caught Matt Hancock’s (secretary for health and social care) discussion about the role of AI in the pandemic. The conversation while surface level ran in the right direction of openness and privacy. The question that peeked this article though was one Hancock deferred to a review he has asked Dr Ben Goldacre, Director of the DataLab, from the University of Oxford to answer. Paraphrased the question was,
‘What will be done with the data and the technologies the NHS collects and develops as a result of this pandemic and what restrictions or lines will be drawn up around what is NHS property and what can be used commercially.’
The report has 12 terms of reference which I postulate answers to:
(1) How do we facilitate access to NHS data by researchers, commissioners, and innovators, while preserving patient privacy?
In the same way in my second example the open trace protocol didn’t collect users GPS data because it didn’t matter for contact tracing, research on this scale of data doesn’t require patient identification. Privacy can be maintained by removing or at least encrypting identifying patient data.
Similarly I can imagine a tool akin to GitHub that allows for collaboration, licensing and proper access rights to store and preserve private or open data. Then, when patients release their data, because as you’ll see in the open patient video, when you’re suffering you want to make sure everyone who needs it has your information, the ‘repo’ can be ‘open sourced’.
2. What types of technical platforms, trusted research environments, and data flows are the most efficient, and safe, for which common analytic tasks?
This is not my area of expertise but I could ask a myriad of people in data science who I would be confident would give me a good answer. The issue here is not what types of platforms but how they are run and implemented in a way that is conducive to open principles.
3. How do we overcome the technical and cultural barriers to achieving this goal, and how can they be rapidly overcome?
Transparency. Full transparency into the research, the technical implementations and cultural challenges allows anyone who is interested in learning and overcoming these barriers to get involved. If this is accompanied with a simple way for interested people to feedback and contribute this no longer becomes a barrier but an opportunity.
4. Where (with appropriate sensitivity) have current approaches been successful, and where have they struggled?
See above examples.
5. How do we avoid unhelpful monopolies being asserted over data access for analysis?
Open source licensing with assurances that derivative or downstream projects contribute back to the source. If that data is open sourced, the same way the Linux kernel is open source, those with the expertise to properly learn from and develop that data will and anyone trying to monopolise in a closed way are at a disadvantage because of the open project which would have significantly more hands on deck.
6. What are the right responsibilities and expectations on open and transparent sharing of data and code for arm’s length bodies, clinicians, researchers, research funders, electronic health records and other software vendors, providers of medical services, and innovators? And how do we ensure these are met?
Responsibility is to contribute. Whatever these data holders do with the data there needs to be a way for them to feedback their learnings and results so everyone else can benefit. Then, assuming privacy follows my other points, that’s it.
7. How can we best incentivise and resource practically useful data science by the public and private sectors? What roles must the state perform, and which are best delivered through a mixed economy? How can we ensure true delivery is rewarded?
The state should make the initial commitment and fund a team of data scientists, researchers and open source maintainers to bootstrap the effort. The effort should get a very transparent amount of funding from the government which the initial team use however they see fit. Then, when various projects grow out of this initial place each should have a clear and simple way to contribute back to the source, and collect donations from contributors or people without the skills to help but with money to contribute. Then, for significant projects that break ground and make a difference the state should award further funding to the splinter projects to award their development and contributions.
8. How significantly do the issues of data quality, completeness, and harmonisation across the system affect the range of research uses of the data available from health and social care? Given the current quality issues, what research is the UK optimally placed to support now, and what changes would be needed to optimise our position in the next 3 years?
I am unaware of this issue but again I could ask numerous people and get the right answer. But at least this is the right question, even if it's using the optimise buzzword all over the place. ‘What can we do?’ but it needs to be followed by ‘How can we do it?’
9. If data is made available for secondary research, for example to a company developing new treatments, then how can we prove to patients that privacy is preserved, beyond simple reassurance?
Open it up. If the secondary researchers have the data, so should anyone else have access to it. Before opening it in this way there needs to be removal or encryption of identifiable data.
10. How can data curation best be delivered, cost effectively, to meet these researchers’ needs? We will ensure alignment with Science Research and Evidence (SRE) research priorities and Office for Life Sciences (OLS) (including the data curation programme bid).
See open source software tools.
11. What can we take from the successes and best practice in data science, commercial, and open source software development communities?
Everything I have just mentioned and more. Please reach out for more.
12. How do we help the NHS to analyse and use data routinely to improve quality, safety and efficiency?
Open source it and have, likely, hundreds of people working on improving quality, safety, efficiency, clarity and doing analysis of their own good will.
I care very much about both the NHS and its people, patients and health care professionals, and about furthering the use of open source principles. If, by some chance, Dr Goldacre of Matt Hancock see this I would very much like to help in this effort.
If you’re wondering why both of my examples are based in Singapore, it’s because I took them straight from Harish Pillay over on YouTube. I wanted to write it up here to bring a little more attention and so I could write my thoughts about it all here too. I believe his presentation was spurred from the ‘Open Patient’ short film Red Hat produced. An excellent example of open source principles and volunteer contributors.
And if you know of any more stories of open source principles being used in the pandemic I’d love to hear them, please tweet me @rhys_the_davies on twitter. Maybe we can compile an inspirational list of examples.