As some of you might know, I recently made a (fairly large) career change. I’m now a Research Fellow in Health Data Analytics at The University of Leeds, UK. Since I’m coming from a maths/HPC background and have almost no prior knowledge of healthcare I’ve had a bit of catching up to do! Overall I’m aiming to apply machine learning and large-scale data analysis to large datasets arising from health and socio-economic domains, where my skills in linear algebra and HPC will undoubtedly become useful.
In this post I aim to summarise what I’ve learned in my first few weeks on the job. This is partly to archive things for myself but should also be interesting reading for anyone who wants to get involved with health data. I’ll touch upon
- the legal issues arising from using private data,
- interesting talks and papers that I’ve found, and,
- some recurring opinions on the future challenges in the field.
The most daunting aspect so far has (easily!) been the legal framework required for working with health data (in the UK at least). There are a myriad of guidelines, reports, and rulings to be aware of; whilst many of the legal clauses also have exemptions for research. The basis of the legal framework is covered primarily by the Data Protection Act, though there are plenty of subtleties in deciding who the data controllers and/or processors might be for a given scenario. Further guidelines targetted more specifically at health include the “Caldicott” report and the NHS Information Governance Toolkit system. There is some online training on the Data Protection Act given by the Medical Research Council here. It seems that this will only become more complicated as of 25th May 2018 when the stricter EU General Data Protection Regulations (GDPR) come into effect.
A more accessible part of the legislation (at least for me) was the requirements on anonymisation of datasets; both for distribution to data processors and when publishing results. Typical techniques used include salted hash functions (often called pseudonyms) and k-anonymity (e.g. removing the last few digits of a postcode so that a wider geographical area is used). These techniques are aimed at reducing the risk of re-identifying a patient given your dataset and others that might be freely available. There is, of course, a trade-off between security and the utility of the data in analyses to consider.
As an academic I am interested in publishing articles. When moving to a new field it is rather difficult to decide which journals are worth your time and which are junk and common metrics, such as impact factors, do not necessarily translate between fields well. For instance, in medicine it appears that the impact factors for a journal can easily be 10 times greater than those in maths.
Here are a list of some journals recommended by my new colleagues.
When searching through the journals above for papers combining machine learning with healthcare in interesting ways, two review papers were very helpful in gaining a picture of the current research situation.
NIPS – Machine Learning for Health
The machine learning conference NIPS has a session dedicated to machine learning for health data. Whilst the talks for the 2015 sessions seem to be unavailable now, the 2016 talks can still be found here. All the main talks were very interesting and raised a number of (almost philisophical) questions about the field. Some of the most fundamental problems include accounting for bias and incomplete data. For example, when using routinely collected data as opposed to in a trial, one never truly knows whether an individual does or does not have a disease unless their physician tests them for it. Therefore if we are modelling the probability of an individual having a certain disease, your regression model is biased since the people who were diagnosed with the disease already had physicians who suspected they might have it! There may be many individuals who had the disease but were not diagnosed, which can happen for a variety of reasons including frailty, life expectancy, costs, and simply not wanting to be tested (which is not uncommon for cancer and Parkinson’s etc.).
Throughout the literature and talks that I’ve ingested thus far, there have been a number of recurring themes about the challenges that researchers in this area will face over the next decade. These include, but are by no means limited to
- the retrieval of similar patient information (via distance metric learning),
- treatment recommendations (taking into account drug interactions etc.), and,
- the dubious quality of the data (e.g. missing or wrong) and its increasing size (wearable trackers).
It’s certainly clear that there is plenty of work to do in this exciting field and, hopefully, I can find my niche somewhere along the way.