How To Use Imperfect Data

George Box, the statistician, once quipped that “All models are wrong, but some are useful“.  The same could be said of data, in that all data is imperfect, but some data can be useful.  When working with data there is a tendency to either completely ignore the data as irrelevant, or embrace it as the ultimate source of truth.  All data, or at least all of the data I’ve encountered, is neither.  There is one thing I’ve learned about data and that is this: How you use data is just as important as whether you use it.  So, how can one use data knowing it is, most likely, imperfect?  Let me explain one way to approach it.

First, accept it for what it is.  And I mean warts and all.  It is flawed.  Get over it.  You will remain forever paralyzed if you cannot accept this precondition for working with data.  It is not whether the data is imperfect; the more important thing to know is in what ways is it imperfect.  Understanding the limitations of your data allows you to focus on how best to use it.  If a part of your data contains cost information and your cost accounting system is from the 19th century, than you now know this data element will probably not be very helpful.  Move onto the data that is helpful.

Second, just about all data can provide directional information.  If you are hoping your data will point you to the exact address, but all it is telling you is go north, then that is all the information you will probably beat out of it.  As Ronald Coase, the economist, once said, “If you torture the data enough, nature will always confess.”  Unfortunately, there is a tendency to want to analyze it too much when we cannot find the specific answer we want.  Accept what it tells you and no more.

Third, when you have multiple sources of data, triangulate to get closer to the truth.  A few years ago, we were reviewing a single source of data on physician outcomes.  The findings were inconsistent.  Higher complication rates than expected, but shorter hospital length of stay and lower costs than expected.  Patients with complications stay longer and cost more, not less.  Two additional sources of data allowed us to determine that the issue was not clinical quality, but a documentation and coding issue.  As imperfect as those sources were, we were able to triangulate and identify the real issue.

Fourth, be clear about what level the data is describing.  If the data is only describing an issue at the department level, then don’t assume the issue is across the entire system.  Limit your assessment to only what the data is describing, and not to what you think it might be describing.  This becomes important when you implement a solution to the issue.  If the issue is only at the department level and you implement a systemwide solution, then this is a waste of resources.  If the issue is at the system level and you only implement a department level solution, then the real problem never gets addressed.

Finally, keep in mind that any data set has only so many answers.  Understanding how to use imperfect data depends on the questions we ask of it.  Or as W. Edwards Deming once said, “If you do not know how to ask the right question, you discover nothing.”  Data is imperfect, but the right question can maximize what it can tell us without having to resort to torture.


Comments are closed.