Antibody Testing: What We Know From Anonymised Test Data and How We Handle It

12th November 2020

As a non-for-profit organisation, Testing For All (TFA), has been providing affordable test services to counter Covid-19 since September. Until the end of October, nearly ten thousand tests have been carried out. The test data provides us with valuable first-hand insight into the presence of COVID-19 antibodies among the population in the UK, and helps us inform the public health authority’s responses to the pandemic.

What does the data tell us?

As mentioned in our September article Understanding your Roche Anti-SARS-CoV-2 test result, first and foremost, the test results give us an overall picture of the percentage of participants who tested positive among all participants – a positive result means the participant has very likely had an immune response resulting in the development of antibodies for COVID-19 coronavirus.

From the charts we can see, in September 2020, about 13.5% of all valid tests came out positive, and this number raised slightly to 14% in October. But it’s worth mentioning that our data is heavily skewed towards people who had or thought they have been exposed to the virus, therefore, this percentage is not an accurate representation of the general population. And fortunately or unfortunately, this rate of increase does not seem to resemble a ‘herd immunity’ trajectory, even among our customers who voluntarily sought antibody tests.

What is interesting to see is the perceived exposure vs the actual test outcome. As part of our test kit activation process, we ask the participants whether they have or think they have been exposed to the virus. As you can see from the chart, about two-thirds of participants who had or thought they had exposure tested negative for antibodies; but among those who did test positive, nearly 90% of them had confirmed that they had exposure – what this means is that whether you think you had exposure is not a reliable indicator of whether you have contracted the virus; and of course, if you don’t think you had exposure, it’s unlikely you have developed antibodies.

While we don’t have enough data for people aged 80+, we don’t see a strong correlation between the test results and other age groups either. Although the 20-39 age group has the highest percentage of positive cases (16.34%), as you can tell, it is only about 2.5 points higher than the overall percentage; and many factors can contribute to this, for example, the likelihood of exposure and ordering a test online. So this data alone can not be interpreted as “young people are more likely to develop antibodies”.

We also use the data to improve our services – for example, we identified that “insufficient blood sample” is the top reason that our customers failed to complete the test – therefore, we are seeking to improve the product design in our next iteration. Also, through monitoring our service standard, we are delighted to find out that the time it takes from order creation to result delivery has been shortened. In October, the median turnaround time from order to result is 7 days, which includes the kit delivery, customer taking the sample and sending it back to the lab. We’ve also noticed that the vast majority of our customers view their results as soon as they are notified – which gives us a sense of purpose – what we do is meaningful to our customers.

How do we anonymise the data?

It is important to mention that all the data used for such analysis is handled anonymously at TFA. We employ a variety of data anonymisation techniques to ensure the data is no longer personally identifiable while keeping it meaningful.

Removing data that is not meaningful for the purpose

Data that is not meaningful for our analysis is removed. For example, fields like names and addresses are unlikely to provide any insight into our analysis, therefore, they are removed straight away. This technique is also called ‘Attribute Suppression’. After this step, the majority of highly sensitive data is already removed.

Data masking

Sometimes, a data field is useful for analysis but keeping the full information can result in individuals becoming identifiable, for example, postcodes. For data like this, we perform data masking to keep only part of the data – in this case, the first part of a postcode.

Data segmentation

Data segmentation is used to reduce the granularity of the data. In our case, the age of the participants at the time of the observation is meaningful, but the date of birth probably not. And we go one step further when dealing with age information – because age can be combined with other pieces of information, such as postcodes, in extreme cases, to identify a person – so we performed a technique called “bucketing” to group the age information into several age groups: 0-19, 20-39, 40-59, 60-79 and 80+.

Hashing

Hashing is useful when we need to join the data from different systems together – we don’t want the data to be personally identifiable, but at the same time, we want to keep a unique identifier to distinguish individuals – which is as tricky as it sounds. ‘Keyed Cryptographic Hashing’ is a primary technique for this purpose recommended by European Data Protection Supervisor (EDPS) and Information Commissioner’s Office (ICO) of the UK. This technique is used for data like ID fields. 

A hash function is a ‘one-way’ cryptographic algorithm. A hash is like a fingerprint of data, it is unique for this piece of data, but it does not contain the full information, therefore, cannot be used to reconstruct the original data.  And ‘keyed’ means it is protected by a (usually very long) secret. When combined, even if a hacker knows the secret and the hashing algorithm, the only way they can uncover the original information is through brutal-force attack – trying each possible combination of the original text one by one.

Although anonymised data is not subject to GDPR, the handling of such data is still crucially important. The anonymised data is stored separately from the operational data; we use enterprise-grade products like Google’s BigQuery to ensure the data is encrypted in transit and at rest; only authorised individuals have access to the analytics tools, and only aggregated reports are disclosed to a wider audience. The practices are in line with Anonymisation: Code of Practice published by ICO.

We hope each of the small steps we are taking conveys the message that we take customer’s privacy seriously, and the outcome of the analysis can provide you with a better understanding of the current state of antibody testing.