Health data quality – a two-edged sword

In Brian Ahier’s “Data Not Drugs” post on the O’Reilly Radar blog, a couple of phrases leapt out at me:

The more accurate information available to researchers, care providers, and consumers, the better decisions we can all make and the more options we will have for successful outcomes” and Goetz’s comment “…let the data do the work…“.

Hmm, “the data” – it is a two-edged sword, with a potential for enormous good and the potential for… well, bad things can happen with poor quality data.

Our health data is scattered – temporally and geographically; paper and electronic; on scraps in drawers and folders in filing cabinets; EHRs, EMRs, PHRs and research database silos; and distributed amongst a huge range of different providers, starting with our usual family physician and specialists but then extending to those forgotten emergency rooms; late night home visits; physiotherapists, dentists, naturopaths, iridologists and the myriad of others you have consulted over a lifetime. Our health information is probably more fragmented that most realise – all that potentially valuable data not easily available to us…

In my opinion, the consumer/patient/individual/client – whatever term you want – is a natural holder of the records from all these sources as they are the one common factor, the subject of the information. No matter how well your primary care doctor has kept their records on an individual for the last 20 years, they will not have as a rich and broad a record as one that might be gathered from all sources by the patient, including their own data entry.  A key opportunity in the evolution of the personal eHealth revolution is to support patients in gathering and managing that health information optimally, and to support them in taking a more pro-active role in their healthcare.

Quality of data is always an issue wherever it is gathered, and sources of health information should be clearly expressed. Consumers should not be singled out, suggesting that consumer-entered data is of lesser value or importance.  Clinician data entry and maintenance in any professional EHR or EMR face exactly the same types of issues.

Ensuring data quality in one record by one source (either provider or consumer) is not a trivial task. Ideally, the health data in an electronic record should be accurate, up-to-date and complete; but unfortunately the real world is far from ideal.  And high data quality needs us to have a very clear understanding of the meaning, context and intent of the data – unambiguous and, ideally, standardised computable definitions of data which can form the basis for future safe decision-making.

Take a relatively common example – what could the term ‘diagnosis’ with a value of ‘Type II diabetes’ mean in your health record?

  • Does it mean that you have been diagnosed with diabetes? This is the usual assumption, but not always correct.
  • Does it mean that one of your differential diagnoses is diabetes or that at one point in your life it was considered as a diagnosis and then subsequently discounted?
  • Does it mean that your grandmother or a child has diabetes? Is it actually about you or part of your family history?
  • Does it mean that a diagnosis of diabetes was excluded, actually meaning that at a certain time you were found not to have diabetes? Negation is often handled very ambiguously in electronic health records.
  • Has it been entered as a billing code on discharge from hospital, but has somehow been integrated into your list of diagnoses?  ePatientDave experienced this with Google Health in 2009.

If we then aggregate data from multiple sources the problems of ensuring data quality is amplified manyfold. Examples include: out-of-date data expressed as ‘current’; conflicts in the data; and only having partial data. Opening EHR/EMR data to consumer scrutiny and/or custodianship will certainly help improve the quality of clinician’s records. And while consumer-entered data may also provide insights into tracking personal parameters that no clinician could ever hope to collect, it may also introduce misunderstandings through the lack of a standard vocabulary and ambiguities through use of euphemisms for conditions such as ‘lumbago’ and ‘nerves’.

I have been observing the rise and rise of the ‘pro-active health consumer’ since before 2000 – the transformation of the paternalistic clinician/patient relationship to a more equal partnership.  My first interest was triggered by reading about the ground-breaking work by Kate Lorig at Stanford University on supported self management by patients with chronic disease. Ten years later, the healthcare landscape is vastly different but the combination of quality health data, models such as Jen McCabe’s microchoices and Goetz’s decision trees, personalised medicine, evolving social networks, personal health records and clinician/consumer decision support gives huge potential to influence long-term health outcomes.

Finding the PatientsLikeMe (PLM) website for the first time in 2008 was an exciting discovery for me – I still regard it as one of the first significant Health2.0 applications.  It is potentially a powerful agent for change – meeting consumer needs for a personalised and visual tool providing dynamic feedback about our health information; not just a traditional passive record. It also links the consumer into a community of patients undergoing a similar health journey, sharing stories via data, and providing insights into how others are coping and being treated – perhaps being the trigger for the consumer to engage with the medical system in a more informed way. These are real strengths.

However I also have concerns regarding the wisdom and validity of pooling PLM data for research purposes, especially after viewing the recent TED talk by PLM founder Jamie Heywood where he demonstrates how patterns in others’ data could be the basis for personal decision-making.  How many people have entered ‘dummy’ data into their PLM account to test it out – it is certainly what I do on most websites until I have some understanding of the implications of entering real and personal data. How accurate is the real data that has been entered? How much is well-meaning, but just plain wrong?  How much is critical but missing? This where I suspect that PLM is exceeding its remit, and questions should be asked about the quality of the pooled PLM community data. Comparing a single individual against the data profiles of other community patients with no idea of the quality or completeness of the collective data is not the basis for sound and safe decision-making. Yet in the TED talk Jamie Heywood tells us that PLM can “predict what will happen” to an individual based on the accumulated profiles of all the other ‘equivalent’ patients.  That is a grand claim – one that needs qualifying by those with expertise greater than mine and in my opinion should be approached very cautiously by consumers.


There is no doubt that we can do great things with high quality data, we can cause harm with bad quality data, and the majority of our data is probably somewhere in between. If we leverage and re-use existing data such as a list of diagnoses or a family history, and especially if it is derived from multiple sources, we need to understand the semantic strength and limitations of the data – and to draw conclusions only where it is valid. The eHealth community has a responsibility to improve and standardise data quality for both clinicians and consumers – both data entry and maintenance of existing data – and ensure that the data is ‘fit for purpose’. In addition, we, as informed clinicians and consumers, need to understand our data better, make assessments about the data completeness and quality, and factor this into our decision-making process to ensure that we make wise and safe health choices.

4 thoughts on “Health data quality – a two-edged sword

  1. Thanks for writing this Heather. You have made some excellent points. Data quality is absolutely key. I would venture to say that there are times when no data is better than bad data…

  2. Pingback: ICMCC News Page » Health data quality – a two-edged sword

  3. Great article Heather. Whilst, of course, totally signed up to the potential of semantically interoperable records, I think we will have to live with siloed ‘controlled’ records e.g PCEHR, GPEHR, SCEHR (Secondary care) so that data quality and governance can be maintained.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s