Tuesday, February 28, 2012

The flu and likelihood ratios

An interesting study was just published in the Annals of Internal Medicine. It was a meta-analysis of rapid influenza diagnostic tests (RIDT) and their characteristics. Since we have been spending so much time talking about test characteristics, this study provides a nice opportunity to discuss another way of looking at the values of positive and negative tests. This is going to be a fairly short post, since I simply want to discuss these additional tools.

In the current study the investigators evaluated how well these RIDTs predicted the disease. As we have discussed in multiple places on this blog (here, here and here, to name a few), it matters whether the test is used for screening or diagnosis. In this case, the testing was done in symptomatic populations, so for diagnostic reasons. The authors report that there was quite a bit of heterogeneity in their findings, but the ultimate result is reported as a positive (34.5) and negative (0.38) likelihood ratios. What are they and how do we interpret them?

A positive likelihood ratio, or LR+ is the ratio between sensitivity and 1-specificity (LR+=[sensitivity]/[1-specificity]). Sensitivity is the proportion of patients with the disease who are identified as having the disease, or true positives (TP), and specificity is the proportion of persons without the disease who are identified as not having the disease, or true negatives (TN). The opposite of TN, 1-TN, is the false positives (FP). So, the LR+ equates to the TP/FP, or the odds that a positive indicates true disease. In the current study it is 34, meaning that the odds are 34 to 1 that a positive test indicates the presence of the disease. Another way of putting it is that of the 35 total positive test results, 34 (97.1%) represent true disease. This is essentially equivalent to the positive predictive value (PPV).

Now, let's examine the negative likelihood ratio, or LR-. This is defined as the ratio between the opposite of sensitivity (1-sensitivity) and specificity (LR-=[1-sensitivity]/specificity]). 1-sensitivity is the proportion that are false negative (FN), while the specificity is the proportion of persons without the disease who are identified as such. In the study this LR- was 0.38, meaning that the odds that a negative result truly indicates the absence of disease are about 1 to 2 (0.36:1), or not so great. In other words, out of the total of 3 negative tests, 2 are truly negative, while 1 is a false negative, giving us the negative predictive value (NPV) of about 65% (actually it is 1-0.36=0.64, or 64%).

So, there you have it. The clinical take-away, as the authors noted, is that these RIDTs are good at ruling in the flu, but not at ruling it out. In other words, the problem here is the opposite of what we discussed in all those previous test, or the rate of false negatives. And this makes sense, given that the pre-test probability is reasonably enriched in populations with symptoms, in addition to the relatively poor sensitivities of these technologies.  


Monday, February 27, 2012

Some thoughts on denominators

I decided to re-post this piece dating back from May 2009 in the wake of the recent reports about bird flu. As it turns out, this flu may not be quite as deadly as we had previously thought. And, yes, this revelation is all about the denominator.

Let's face it: denominators keep numbers (and people reporting them) honest. Imagine if I said that there were 3,352 cases of a never-before-seen strain of flu in the US. To be sure, 3,352 cases is a large enough number to send us rushing to buy a respirator mask! But what if I put it slightly differently and said that out of the population of roughly 300,000,000 individuals, 3,352 have contracted this strain of flu. I think this makes things a little different, since it means that the risk of contracting this flu to date is about 1 in 100,000, a fairly low number as risks go. Now, I am going to give you another number -- 86. This represents the number of the novel H1N1 flu-related deaths in Mexico reported on April 25, 2009, by the health minister of Mexico, and at that time this flu had been thought to have sickened 1,400 people. This gives us the risk of death with the flu of roughly 6%, a very high risk indeed! Well, that was then. Now that we have all steadied our pulses, and the health authorities have gone back and done some testing, as of yesterday Mexico had confirmed 2,059, cases with 56 fatalities, equating to a 2.7% risk of dying with the disease. Still a high number, to be sure, but lower than what was though before.

In the US, we have had 3 fatalities among 3,352 cases reported as of yesterday, yielding the risk of death from H1N1 in this country of about 1 in 1,000. But, of course, the denominator of 3,352 persons represent only those who sought medical attention and got tested, so probably it is an underestimate of the true burden of this strain of flu, and necessarily also an over-estimate of its attendant mortality. Now, apply this to the situation in Mexico, and it's likely that the risk of death from H1N1 is also lower than what we have observed precisely due to the under-estimation of the denominator. 

So how could we get a true estimate of the numbers of people afflicted with the H1N1 influenza? Well, we could screen absolutely everyone (or more likely a large and representative group of individuals). Then what? Do we treat them all with anti-virals? Do we observe them? Since the Centers for Disease Control and Prevention recommends testing only severe cases and treating only persons at a high risk for complications, universal testing does not seem like a practical approach. So, the bottom line is that we are not likely ever to get at the correct denominator for the risk of dying with this disease, and any number that we get is likely to be an over-estimate of the true risk.

So, what are the lessons here? First, don't let anyone get away with only giving you the numerator, as that is not even a half of the story. Second, even when the denominator appears known, be skeptical -- does it really represent the entire pool of cases that are at risk for the event that the numerator describes? The likely answer will most of the time be "no". Clearly, it is the denominator that is the key to being an educated consumer of health information.

Wednesday, February 22, 2012

Endometriosis and cancer: When a "breakthrough" may not be all that

There is an interesting new study that was just published online at the Lancet Oncology. It is a pooled analysis of a bunch of case-control studies to explore the association between endometriosis and certain types of ovarian cancer. Here is the abstract:


Endometriosis is a risk factor for epithelial ovarian cancer; however, whether this risk extends to all invasive histological subtypes or borderline tumours is not clear. We undertook an international collaborative study to assess the association between endometriosis and histological subtypes of ovarian cancer.


Data from 13 ovarian cancer case—control studies, which were part of the Ovarian Cancer Association Consortium, were pooled and logistic regression analyses were undertaken to assess the association between self-reported endometriosis and risk of ovarian cancer. Analyses of invasive cases were done with respect to histological subtypes, grade, and stage, and analyses of borderline tumours by histological subtype. Age, ethnic origin, study site, parity, and duration of oral contraceptive use were included in all analytical models.


13 226 controls and 7911 women with invasive ovarian cancer were included in this analysis. 818 and 738, respectively, reported a history of endometriosis. 1907 women with borderline ovarian cancer were also included in the analysis, and 168 of these reported a history of endometriosis. Self-reported endometriosis was associated with a significantly increased risk of clear-cell (136 [20·2%] of 674 cases vs 818 [6·2%] of 13 226 controls, odds ratio 3·05, 95% CI 2·43—3·84, p<0·0001), low-grade serous (31 [9·2%] of 336 cases, 2·11, 1·39—3·20, p<0·0001), and endometrioid invasive ovarian cancers (169 [13·9%] of 1220 cases, 2·04, 1·67—2·48, p<0·0001). No association was noted between endometriosis and risk of mucinous (31 [6·0%] of 516 cases, 1·02, 0·69—1·50, p=0·93) or high-grade serous invasive ovarian cancer (261 [7·1%] of 3659 cases, 1·13, 0·97—1·32, p=0·13), or borderline tumours of either subtype (serous 103 [9·0%] of 1140 cases, 1·20, 0·95—1·52, p=0·12, and mucinous 65 [8·5%] of 767 cases, 1·12, 0·84—1·48, p=0·45).


Clinicians should be aware of the increased risk of specific subtypes of ovarian cancer in women with endometriosis. Future efforts should focus on understanding the mechanisms that might lead to malignant transformation of endometriosis so as to help identify subsets of women at increased risk of ovarian cancer.


Ovarian Cancer Research Fund, National Institutes of Health, California Cancer Research Program, California Department of Health Services, Lon V Smith Foundation, European Community's Seventh Framework Programme, German Federal Ministry of Education and Research of Germany, Programme of Clinical Biomedical Research, German Cancer Research Centre, Eve Appeal, Oak Foundation, UK National Institute of Health Research, National Health and Medical Research Council of Australia, US Army Medical Research and Materiel Command, Cancer Council Tasmania, Cancer Foundation of Western Australia, Mermaid 1, Danish Cancer Society, and Roswell Park Alliance Foundation.
Alas, I do not have access to the full article (paywall), so cannot go though it in a detailed way. Nevertheless, we can try to put these findings in perspective. So, briefly, the investigators put together data from many case-control studies and discovered that the risk of some, though not all, ovarian cancers was 2-3 times higher in the presence of endometriosis than in its absence, and concluded that clinicians should be aware of this increase in risk. Fair? But you know I am going to deconstruct it, right? Here we go.

By now we all understand what a case-control study is, right? It is a study where cases (those patients with the disease of interest) are compared to controls (subjects who are in all ways the same as the cases with the exception that they do not harbor the disease in question). So the study identifies subjects with an outcome, and follows them backward to the exposure that is of interest vis-a-vis this outcome. These studies are notoriously difficult to do well, particularly when it comes to the choice of a control, and very few do it well in my experience. I cannot comment on the current conglomeration of 13 of them, so will not venture a guess on whether some or how many of them may lead us astray. Though these studies are difficult, for various reasons they are the way to go when examining an uncommon outcome. So the choice of the design is legit.

No, let's examine the risk for misclassification for both the disease and the exposure. I think you will agree that a case of ovarian cancer is difficult to misclassify, so I will not pick on this as a potentially major threat to the validity. But what about endometriosis? This is a chameleonic condition that is probably way under-recognized. Its symptoms and signs are varied and, unless the studies required a look "inside," which I sincerely doubt they did -- note, the abstract states that this exposure was self-reported -- there is a very real and grave threat to validity here. Ever heard of recollection bias? If such exists, then more women with cancer are likely to report symptoms that may be indicative of endometriosis than those without cancer. So, the observed increase in the risk of ovarian CA in the presence of endometriosis may be due to just that -- a recollection bias.

These limitations notwithstanding, the authors felt that the study was a breakthrough (emphasis mine):
"This breakthrough could lead to better identification of women at increased risk of ovarian cancer and could provide a basis for increased cancer surveillance of the relevant population, allowing better individualization of prevention and early detection approaches such as risk-reduction surgery and screening,” lead author Celeste Leigh Pearce, at the University of Southern California, Los Angeles, said in a journal news release. 
What does Dr. Pearce mean by "increased cancer surveillance?" Is she talking about screening women with endometriosis because of this possibly heightened risk? And if so, is this really wise? Let's simulate some of these numbers.

Let us suppose that endometriosis does indeed increase the risk of ovarian cancer 2-3-fold. This means that the incidence now goes from 13 per 100,000 women up to 39 per 100,000. Let us now also assume that there is a test that is 99% sensitive (able to identify ovarian CA when it is present) and 99% specific (able to demonstrate that no ovarian CA is present when it is not present). Recall that at the population incidence of ovarian CA, the USPSTF does not recommend screening due to a very high risk of a false positive. The question is does this 3-fold elevation in risk change the positive predictive value of screening substantially enough for it now to be recommended? I think I know the answer, but let's go through the exercise anyway, just to be explicit.
Disease present
Disease absent

So, the corresponding positive predictive value is... drum roll, please... 3.7%. This means that out of 100 women who have tested positive, fully 96 have a false positive result and are now likely to be subjected to invasive procedures. If we imagine that a test can have near-perfect specificity of 99.99% (no test that I know of can come close to this in any consistent way), still 20% of all positive results are false positives. So, is this indeed food for screening thought? I really don't think so, particularly given that the current risk calculation is likely a gross over-estimate.

So, there you have it. I don't think I am engaging in hyperbole when I say that "breakthrough" is very likely an overstatement.                  

Tuesday, February 21, 2012

The taxonomic laziness of "anti"

Sometimes I hear my kids arguing somewhere in the house, and I can almost predict what is coming next: my daughter will come to me and complain that her brother is being mean, and her brother will follow shortly and say that no, she is the one being mean. This is when I have to take a deep breath and ask what it is exactly they mean by "mean." What I am talking about here is an inference that they have made based on some behavior that they disliked in the other. Furthermore, this inference has morphed into a sledgehammer term "mean" that is now coloring their entire discussion. Deconstruction is always the order of the day.    

I observed a similar phenomenon, this time among my professional peers, when I published my paper a few years ago on how there is no evidence that VAP bundles do anything to prevent VAP: even some people whom I consider to possess equanimity decided that I must be "anti-quality." Are you serious? How can anyone be anti-quality? I am merely against spinning our wheels just so that we look like we are doing something. But this is not what this post is about.

Earlier today I came upon this remarkably intelligent post from Jack Stigloe, a blogger who is new to me. The post is aptly titled "Anti-anti-science," and takes the nuanced position that it is way too lazy and reductionist to call any dissenters with anything that remotely resembles a scientific idea "anti-science" (my words, not his). Indeed, Jack states,
My over-riding impression is that ‘anti-science’ is a term that is imaginary and unhelpful. It describes almost nobody and it gets us nowhere. Climate deniers are not anti-science, they are anti- a political view that considers environmental protection as important. Creationists, too, have moral objections to the implications of an evolutionary worldview (John Evans is very good on this). In both cases, these groups use science arguments as their vehicle because they are more sophisticated sociologists of science than the scientists themselves. Where scientists see their evidence as a solid stage on which the public drama of policy can take place, creationists, denialists, anti-vaccinationists and others see a precariously balanced house of cards. Yes, they are stupid and wrong, but calling them ‘anti-science’ doesn’t help. Hitting these people over the head with bigger and bigger science hammers will not win the argument, it will simply confirm their suspicions.
And I do completely agree -- denying the evidence for evolution is simply blind, but to lump this faction together with people who oppose GMO foods is just lazy and destructive to a civil discourse.

The author then goes on to make a poignant confession that
One reason the term ‘anti-science’ raises my hackles is that I think the big beasts of science who use it might be talking about a group that includes me. We social scientists and policy folk have been known to ask difficult questions of science that have been interpreted as attacks.
And then he says that the term "anti-science" represents a "privatization of the idea of progress that is dangerous for science and society." The implication is that those calling everyone else "anti-science" have some sort of a monopoly on scientific ideas and discourse. What is crystal clear is that scientific ideas belong to everyone and should not be subject to "us vs. them" schoolyard brawls. Science demands precision in its taxonomies. So to start calling people who argue out of ignorance anything other than ignorant, or those who argue out of political expediency anything other than politically expedient is imprecise and, well, unscientific.

But there is an even more important and subtle reason why such taxonomic laziness is pernicious. This reason is that anyone who disagrees with us can easily become "anti-science," an insulting waste basket connoting ignorance. Why is this so bad? At one level, an overt insult thrown at an individual or a group generally raises their hackles, and, for the most part, eliminates any chance for a civilized discussion of idea. In this way the groups become even more polarized and entrenched without any way to get to any common ground. At another level, such labels give rise to a deeply flawed impression that science is precise. Indeed, those who fling these labels tend to fall on the old refrain of "look, I am the first one to admit that scientific knowledge is fungible." However, in the instances of current confrontations, they somehow cannot imagine that our current knowledge is incomplete and that further questions are not only legitimate, but are indeed a scientific imperative.

So, what is the take-home? Simple: respect dissent and use correct terminology. In other words debate from a place of thirst for knowledge. I realize that it is tiresome to have to argue that creationism is not science, that evolution is more than just a theory, that vaccines have saved millions of lives. But to write off these arguments as beneath us and to throw insults at each other only works in Washington, and look how well that has been going. But even more importantly, if we are going to answer such deeply pragmatic scientific questions of our time as "what are the full implications of GM salmon," we cannot shroud ourselves in the "sacredness" of science. Science can only stay beautiful and true if it steers clear of being dogmatic. It is time to take this discourse out of the boxing ring of childish insults and back into the civic society where if belongs.   

Monday, February 20, 2012

Tinkering with health is not a laboratory job

Do you love Brussels sprouts? I do. And broccoli, chard and kale, too. Why do I ask?

Well, last week my friend Kent Bottles did a blog post on what the future of medicine may look like according to two of our prominent medical futurists, David Agus and Eric Topol. It left me scratching my head, so I went to find Agus on the interwebz, and came upon his 2011 TEDMED talk, which can be found here. The Brussels sprouts were just the beginning of nearly 20 minutes of bewilderment. I will get to the meat (ahem) of it momentarily, but I just could not get past his assertion at around 3:48, where he states that people in their 90s do not take up healthcare resources -- no mechanical ventilation, no weeks in the ICU -- and that they "die with dignity from whatever process ails them at that point." Really? In what country?

Here is the reality -- you can consult the Dartmouth Atlas for a lot of this info, but many other sources exist as well. There are approximately 2.5 million deaths in the US annually. Fully 1/3 of them occurs in the hospital, more that 1/2 of which involve ICU care. And incidentally, not to get all cost-conscious or anything, but 80% of all of the associated costs were due to ICU care. But wait, you say, this is not necessarily people in their 90s, right? OK, let's take it down a layer.

Among the 1/3 of all the annual deaths that occur in the hospital, nearly 3/4 are among the Medicare population, or those who are 65 years old or older. Furthermore, according to none other than the Dartmouth Atlas, up to 1/4 of all Medicare enrollees spend 1 week in the ICU in the last 6 months of life. OK, then. So, where are the data that old age is associated with low medical costs? Not here, that's for sure.

After this dubious beginning, Dr. Agus states the undeniable: humans are complex systems, and we need to think of them as such. Additionally, he advocates skepticism because much of what is done in medicine is not based on "true" data. OK, I can certainly go along with that. Then he goes astray. Here is how.

At around 10:30 he gets into technological solutions. You may be surprised that I do not fundamentally disagree with technology as the answer to disease. No, I disagree that technology is the answer to health -- this is where Brussels sprouts come in. At about 11:00 he starts to talk about aspirin and all the fantastic health benefits that are associated with it -- here is a screen shot of his slide:
And he suggests that aspirin should be mandatory, and that society should not have to pay for these diseases that develop due to what? Aspirin deficiency? Now, as veteran readers of my blog, do you see something funny about this slide? Is there something missing? Yes, you are right, where are the references for these statements? No, I did not cut off the bottom -- there are no studies referenced. One other critical piece of information is missing: whenever data on benefit are presented, data on risks must also be presented. Where are they? So, yes, be very skeptical. One more picky point: he brings up Michael Dell at 11:30 or so, telling the story of how his employees who smoked had to pay higher insurance premiums. By extension, Dr. Agus contends that we should charge higher premiums to employees who do not comply with aspirin. Unfortunately to get into the full controversy about the role of aspirin in these conditions is way beyond the scope of this post. But do let me give you a taste of what a balanced discussion of aspirin as a prevention for heart disease looks like -- here the Mayo Clinic web site is exemplary in providing a well-informed approach. Around 12:00 Agus builds the same argument for statins. And then he knocks down vitamins and supplements and suggests that people who take them should be penalized with higher premiums. OK, in my humble opinion this branch of inquiry has always been a fool's errand in a society that is fairly well fed, but higher premiums? Come on. I will not belabor this. And finally he sprinkles his comments with a few words on the microbiome. 

So what does Dr. Agus seem to say overall? My impression is that he thinks that we should tinker with maintaining our health by looking to manufactured drugs, such as aspirin and statins, as well as whatever we learn from the microbiome (more drugs?). What is missing here is the discussion of the risks vs. the benefits of such tinkering in healthy people. What is also missing is data to back up some of his fundamental assertions (see above). 

So, final words? Technology is not the enemy. If used correctly it can help us understand and cure disease. Tinkering with the healthy human is the job of evolution, not the laboratory. The potential "unintended consequences" of such tinkering are too colossal to ignore.  



Saturday, February 18, 2012

The implications of a blood test for depression

So, a coupe of days ago we spent considerable (virtual) ink on discussing the risk of a false positive result in the setting of screening for rare events. Today something else has caught my eye: blood test for depression. The Atlantic reported this much-retweeted piece on the same day that I was droning on about lung cancer screening. So what is this about, and does it bear any similarity to what we discussed here?

Let us examine the lede of the Atlantic article:
New research shows that blood screenings can accurately spot multiple telltale biomarkers in patients with classic symptoms of depression.
The writer uses the word "screening" while talking about patients with "classic symptoms of depression." This choice of language is problematic. In clinical medicine the term "screening" generally refers to a population without any signs or symptoms of disease. Think of breast cancer and prostate cancer screening. When symptoms or signs are present, the testing becomes diagnostic, not screening, in its purpose. This difference is actually critical to appreciate in the context of what we talked about in the lung cancer screening post. The presence of signs that make a disease suspect presumably increase the pre-test probability of that disease. This means that the prevalence of this disease is higher in the population that has these particular signs/symptoms than in the overall population without them. And recall that it is this very pre-test probability that drives the predictive value of a positive test. Namely, the higher the pre-test probability of the disease, the more credence we can put in a positive test result.

OK, so let's move on to the data. I went to the primary source, but all I could get to was the abstract (paywall and all), so bear in mind that I do not have all of the data. I am reproducing the abstract here for your convenience:
Despite decades of intensive research, the development of a diagnostic test for major depressive disorder (MDD) had proven to be a formidable and elusive task, with all individual marker-based approaches yielding insufficient sensitivity and specificity for clinical use. In the present work, we examined the diagnostic performance of a multi-assay, serum-based test in two independent samples of patients with MDD. Serum levels of nine biomarkers (alpha1 antitrypsin, apolipoprotein CIII, brain-derived neurotrophic factor, cortisol, epidermal growth factor, myeloperoxidase, prolactin, resistin and soluble tumor necrosis factor alpha receptor type II) in peripheral blood were measured in two samples of MDD patients, and one of the non-depressed control subjects. Biomarkers measured were agreed upon a priori, and were selected on the basis of previous exploratory analyses in separate patient/control samples. Individual assay values were combined mathematically to yield an MDDScore. A ‘positive’ test, (consistent with the presence of MDD) was defined as an MDDScore of 50 or greater. For the Pilot Study, 36 MDD patients were recruited along with 43 non-depressed subjects. In this sample, the test demonstrated a sensitivity and specificity of 91.7% and 81.3%, respectively, in differentiating between the two groups. The Replication Study involved 34 MDD subjects, and yielded nearly identical sensitivity and specificity (91.1% and 81%, respectively). The results of the present study suggest that this test can differentiate MDD subjects from non-depressed controls with adequate sensitivity and specificity. Further research is needed to confirm the performance of the test across various age and ethnic groups, and in different clinical settings.
So, what did they really do? Well, let us go through the info applying the PICO framework. The population (P) is people with a major depressive disorder as diagnosed by clinical criteria. The intervention (I) is the new multi-assay serum test for 9 biomarkers associated with depression. The comparator (C) is the clinical diagnosis of MDD, and the outcome (O) is the concordance of the serum test and the clinical diagnosis. OK so far?

Bear in mind that there were actually two studies, and here is how they played out. For the first study the researchers recruited 36 patients with MDD (disease present) and 43 subjects without MDD (disease absent). Given the sensitivity of 91.7% and specificity of 81.3%, here are the results:

Disease present
Disease absent

Based on these numbers, the positive predictive value is 80.4% and the negative predictive value is 92.1%. What does this mean? This means that in a population with a 45.6% (36/79) prevalence of MDD, only 20% of all positive tests will be false positives, or identifying the disease when it is absent. Conversely, of all negative tests, 8% will be false negative, or missing the disease when it is present. And for the second study, where 34 MDD patients were involved, frankly not enough information is given in the abstract to say anything about it -- I do not have the denominator (the total pool of subjects including those with and without MDD), and therefore cannot say anything about the PPV or NPV.

So what does all of this mean? Well, there are 3 take-home points:
1. When a test is used to diagnose rather than to screen for a disease, you are dealing with a population that has a higher pre-test probability of the disease. So, when the pre-test probability is close to 50%, even a test with suboptimal sensitivity and specificity can be fairly accurate.
2. Your test is only as good as the "gold standard" against which it is being tested. In this case we are talking about a clinical diagnosis of a major depression. The assumption here is that this gold standard test is perfect already. In the absence of anything else to compare it to, it really is: 100% sensitive, 100% specific and quick. How can you improve upon that? And if this is the case, then why do we need a serum test that will give us false results a good part of the time? One argument for this is given here:
...one of the paper’s co-authors said at the very least establishing a physiological link to depression will hopefully get patients to look at their depression as a treatable condition rather than something that’s wrong with their minds. 
But I guess I am not sure that this is really a valid reason for developing a test. It is much like looking for biological mechanism for homosexuality for the purpose of proving that it is OK to be gay. I already know that it is OK, and find its biological origins of mere intellectual curiosity with little practical consequences. Perhaps we just need to change our minds about it, that is all.
3. Finally, would this test be used to screen people for depression? In medicine there is a temptation to go after "the answer" even when the question is rather oblique. In other words, will the testing in the wild of clinical practice really be limited to those with suspected MDD, or is it likely to metastasize into others, those with milder presentations or even those whom the clinician just finds annoying? If it is the latter (and I can almost guarantee that), then we are in deep doo-doo as far as false positive rates are concerned. If you think that we have had an epidemic of depression up until now, just you wait.

My final word for the day is "caution." I want to be very clear that asking scientific questions is never a bad idea, and that the answers do not always have to bring practical or applied value. I just want to inject some caution into the breathless discussion of screening for everything and our dogged search for "hard" evidence.  

Thursday, February 16, 2012

In medicine, beware of what seems too good to be true

Update 2/17/12:
A reader brought to my attention (thanks!) a very slight inaccuracy in the first table below, which I have corrected. I did the calculations in Excel, which, as you may know, likes to round numbers. 

File this under "misleading." Here is the story:
What's the Latest Development? 
A California start up has developed a breath test that can diagnose lung cancer with a 83 percent accuracy and distinguish between different types of the disease. The procedures which currently exist to test for lung cancer, which is the leading cause of cancer deaths worldwide, result in too many false positives, meaning unnecessary biopsies and radiation imaging. The new devices works by drawing breath "through a series of filters to dry it out and remove bacteria, then [carries it] over an array of sensors."  
What's the Big Idea? 
The company is now testing a version of the machine 1,000 times more accurate than its latest model, which could increase the accuracy of diagnoses to 90 percent, the level likely needed to take the device to market. Because the machine is not specific to a particular group of chemicals, the breath tester could, in principle, test for any disease that has a metabolic breath signature, for example, tuberculosis. "A breath signature could give a snapshot of overall health," says the company's founder, Paul Rhodes. 
Am I just being a luddite by not getting, well, breathless about this? I'll just lay out my argument, and you can be the judge.

There is not doubt that lung cancer is a devastating disease, and we have not done a great job reducing its burden or the associated mortality. However, there are several issues with what is implied above, and some of the assumptions are unclear. First, what does "accuracy" mean? In the world of epidemiology it refers to how well the test identifies true positives and true negatives. If that is in fact what the story means, then 83% may not be bad; we'll regroup on that point at the end late in this post. This brings me to my second point: what is the gold standard that the test is being measured against? In other words, what is it that has the 100% accuracy in lung cancer detection? Is it a chest X-ray, a CT scan, a biopsy, what?

The SEER database, the most rigorous source of cancer statistics in the US, classifies tissue diagnosis as the highest evidence of cancer. However, in some cases a clinical diagnosis is acceptable. The inference of cancer when no tissue is examined is possible when weighing patient risk factors and the behavior of the tumor. So, you see where I am going here? The gold standard is tissue or tumor behavior in a specific patient. Is that what this technology is being measured against? We need to know. And here is another consideration. What if the tissue provides a cancer diagnosis, but the cancer is not likely to become a problem, like in the prostate cancer story, for example?

But all of these issues are but a prelude to what is the real problem with a technology like the one described: the predictive value of a positive test. The story even alludes to this, pointing the finger at other current-day technologies and their rates of false positivity, and away from itself. Yet, in fact, this is the crux of the matter for all diagnostics. Let me show you what I mean.

The incidence of lung cancer in the US is on the order of 60 cases per 100,000 population. Now, let us give this test a huge break and say that it yields (consistently) 99% sensitivity (identifies patients with cancer when cancer is really present) and 99% specificity (identifies patients without cancer when they really do not have cancer). What will this look like numerically given the incidence above if we test 100,000 people?

Cancer present
Cancer absent
Test +
Test -

If we add up all the "wrong" test results, the false negative (n=1) and the false positives (n=999), we arrive at a 1% "inaccuracy" rate, or 99% accuracy. But what is hiding behind this 99% accuracy is the fact that of all those people with a positive test only a handful, a paltry 6%, actually have cancer. And what does this mean to the other 94%? Additional testing, a lot of it invasive. And what does this testing mean for the healthcare system? You connect the dots.

Let's explore a slightly different scenario. Let us assume that there is a population of patients whose risk for developing lung cancer is 10 times higher than the population average. Let us say that their incidence is 600 cases per 100,000 population. Let us perform the same calculation assigning this same bionic accuracy to the test:

Cancer present
Cancer absent
Test +
Test -
The accuracy remains at 99%, but the value of the positive test rises to 37%. Still, 63% of all people testing positive for cancer will go on to unnecessary testing. And imagine the numbers when we try to screen millions of people, rather than just 100,000.

Let us do just one final calculation. Let us reflect the data back to the test in question, where the article claims that the accuracy of the next version of the technology will be 90%. Assuming a high risk population (600 cases per 100,000 population), what does a positive result mean?

Cancer present
Cancer absent
Test +
Test -
From this table, the accuracy is indeed 90%, concealing the very low value of a positive test of 5%! This means that of the people testing positive for lung cancer with this technology, 95% will be false positives! What is most startling is that to arrive at the same mediocre 37% value for a positive test that we saw above in this population, we would need a population where cancer incidence is a whopping 6,000 per 100,000, or 6%!

I do not want to belabor this issue any further. Screening for disease that is not yet a clinical problem is fraught with many problems, and manufacturers need to be aware of these logic pitfalls. What I have shown you here is that even when the "accuracy" of a test is exquisitely (almost impossibly) high, it is the pre-test probability of, or the patient's risk for the disease that is the overwhelming driver of false positives. Therefore, I give you this conclusion: beware of tests that sound too good to be true -- most of the time they are.

h/t to @gingerly_onward for the story link  

Medicine: The art of applied science

I read this NPR article this morning and had to do a post in response. The gist is that the military is turning to what we might call the less conventional (for us in the West) medical modalities to deal with the injuries sustained by the current crop of vets. Instead of getting them hooked on pain meds for life (we saw plenty of this in the VAs in the '80s and '90s among Vietnam vets), they are turning to stuff like massage and acupuncture. And, predictably, it is stirring up controversy.

The story that is told is of a Sgt. Rick Remalia who fractured his back and pelvis in Afghanistan:
Remalia broke his back, hip and pelvis during a rollover caused by a pair of rocket-propelled grenades in Afghanistan. He still walks with a cane and suffers from mild traumatic brain injury. Pain is an everyday occurrence, which is where the needles come in.
And lately he has been receiving acupuncture treatments, with this result:
"I've had a lot of treatment, and this is the first treatment that I've had where I've been like, OK, wow, I've actually seen a really big difference," he says.
And incidentally, her gets these treatments from a military physician, who, herself a skeptic, admits to perceiving a personal benefit from her own exposure to it:
"I actually had a demonstration of acupuncture on me, and I'm not a spring chicken," she says, "and it didn't make me 16 again, but it certainly did make me feel better than I had, so I figured, hey ... let's give it a shot with our soldiers here."
So, all good so far, right? Well, Harriet Hall is quoted in the same article, and to her this falls right into what she likes to call "quack-ademic" medicine. She says,
"The military has led the way on trauma care and things like that, but the idea that putting needles in somebody's ear is going to substitute for things like morphine is just ridiculous," Hall says.
Now, as you know, I have had some debates with the SBM crowd in the past, and as it turns out, we agree on the science more than we disagree. However, I am thinking that this argument is not about science, but about politics.

I am well aware that a group of anecdotes does not amount to science. And I am also well aware that what we are hearing here are anecdotes. But here is the thing: when your kid tells you that she likes chocolate ice cream better than vanilla, do you ask for evidence that chocolate is better than vanilla at the population level? No, that's absurd! OK, you say, but this is a strawman: nobody is going for a claim of superiority of chocolate ice cream over vanilla. That is true, but is this about the science or about being able to make a claim? If my kid likes chocolate, why not let her have that when ice cream is on the menu? If acupuncture seems to provide some relief to Sgt. Remalia, why not let him have that relief? After all, whose opinion about what works counts in this individual example, ours or the patient's? And if the ethics of using placebo are the concern, there is nothing wrong with letting him know that in large clinical trials the evidence is equivocal, which means that it may work for some and not for others. In fact, this might be a good disclaimer to make before commencing any treatment, one with the right to claims and one without.

Another argument is that there is no way that insurance (or our taxes) should pay for this unproven treatment. Still about science? Do any of you want to stand up and tell Sgt. Remalia, who fought for our freedom, that we will not pay for the only thing that seems to help him, that is pretty cheap and safe and that has very few, if any, long-term adverse effects, in stark contrast to pain killers? Yes, I understand that this is not science, but is there no room for humanism in the practice of medicine? After all we have throaty debates as to whether or not it is ethical to deny a $100,000 payment for a treatment that, on average, prolongs life by 2 weeks. Surely, denying Sgt. Remalia access to this relief would diminish our humanity. And what about the costs of treating addiction to pain killers?

So, here are my points:
1. I completely agree that that acupuncture "works" for Sgt. Remalia, does not mean that "acupuncture works" in the scientific sense. It may or may not work; furthermore, our current models of the universe do not allow us to have an adequate mechanistic explanation. But that is not the point -- it works for this young man whose life will never be the same because he signed up to defend his country. To this extent his "claim" has all kinds of internal validity.
2. Making claims is subject to legal and regulatory frameworks that have very little to do with science. I have done much blogging on clinical vs. statistical considerations in clinical research that feeds regulatory approvals and hence claims, and I remain of the opinion that a lot of the acceptable claims are specious. I know, I know, this is a "tu quoque" argument, but if we are talking about the goose and the gander, well...
3. Whether or not a treatment should be paid for is more prone to political than evidence-based decisions. Given that most medicines work in a minority of patients, and none comes without adverse events, the extent of which remains largely unknown because of our negligence to build real regulatory systems to quantify them, we are spending a lot of dollars on stuff that does not work at the individual level.

Medicine has to be part science and part art; in fact the art is in how and when to apply the science. That latter portion must be about humanism.