Monday, January 31, 2011

Reviewing medical literature, part 5: Inter-group differences and hypothesis testing

Happy almost February to everyone. It is time to resume our series. First I am grateful for the vigorous response to my survey about interest in a webinar to cover some of this stuff. Over the next few months one of my projects will be to develop and execute one. I will keep you posted. In the meantime, if anyone has thoughts or suggestions on the logistics, etc., please, reach out to me.

OK, let's talk about group comparisons and hypothesis testing. Scientific method that we generally practice demands that we articulate an hypothesis prior to conducting a study which will test this hypothesis. The hypothesis is generally advanced as the so-called "null hypothesis" (or H0), wherein we express our skepticism that there is a difference between groups or an association between the exposure and outcome. By starting out with this negative formulation, we set the stage for "disproving" the null hypothesis, or demonstrating that the data support the "alternative hypothesis" (HA, or the presence of the said association or difference). This is where all the measures of association that we have discussed previously come in, and most particularly the p value. The definition of the p value once again is "the probability that the found inter-group difference, or one that is greater than what was found, would have been found under the condition of no true difference." Following through on this reasoning, we can appreciate that the H0 can never be "proven." That is, the only thing that can be said statistically when no difference is found between groups is that we did not disprove the null hypothesis. This may be because there truly is no difference between the groups being compared (that is the null hypothesis approximates reality) or because we did not find the difference that in fact exists. The latter is referred to as the Type II error, and can be present for various reasons, the most common of which is a sample size that is too small to detect statistically significant difference.

This is a good place to digress and talk a little about the distinction between "absence of evidence" and "evidence of absence." The distinction, though ostensibly semantic, is quite important. While "evidence of absence" implies that studies to look for associations have been done, done well, published, and have consistently shown the lack of association between a given exposure and outcome or a difference between two groups, "absence of evidence" means that we have just not done a good job looking for this association or difference. Absence of evidence does not absolve the exposure from causing the outcome, yet so often it is confused with the definitive evidence of absence of an effect. Nowhere is this more apparent than in the history of the tobacco debate, which is the poster child for this obfuscation. And we continue to rely on this confusion in other environmental debates, such as chemical exposures and cell phone radiation. One of the most common reasons for finding no association when one exists, or the type II error, is, as I have already mentioned, a sample size that is too small to detect the difference. For this reason, in a published study that fails to show a difference between groups it is critical to assure that the investigators performed the power calculation. This maneuver, usually found in the Methods section of the paper, lets us know that the sample size is adequate to detect a difference if one exists, thus minimizing the probability of type II error. The trouble is that, as we know, there is a phenomenon called "publication bias." This refers to the scientific journals' reluctance to publish negative results. And while it may be appropriate to reject studies prone to type II error due to poor design (although even these studies may be useful in the setting of a meta-analysis, where pooling of data overcomes small sample sizes), true negative results must be made public. But this is a little off topic.

I will ask you to indulge me in one other digression. I am sure that in addition to "statistical significance" (this is simplistically represented by the p value), you have heard of "clinical significance." This is an important distinction, since even a finding that is statistically significant may have no clinical significance whatsoever. Take for example a therapy that cuts the risk of a non-fatal heart attack by 0.05% in a certain population. This means that in a population at a 10% risk for a heart attack in one year, the intervention will bring this risk on average to 9.95%. And though we can argue whether or not this is an important difference, at the population level, this does not seem all that clinically important. So, if I have the vested interest and the resources to run the massive trial that will give me this minute statistical significance, I can do that and then say without blushing that my treatment works. Yet, statistical significance always needs to be examined in the clinical context. This is why it is not enough to read the headlines that tout new treatments. The corollary to this is that the lack of statistical significance does not equate to the lack of clinical significance. Given what I just said above about type II error, if the difference appears significant clinically (e.g., reducing the incidence of fatal heart attacks from 10% to 5%), but does not reach statistical significance, the result should not be discarded as negative, but examined as to the probability of the type II error. This is also where Bayesian thinking must come into play, but I do not want to get into this now, as we have covered these issues in previous posts on this blog.

OK, back to hypothesis testing. There are several rules to be aware of when reading how the investigators tested their hypotheses, as different types of variables require different methods. A categorical variable (one characterized by categories, like gender, race, death, etc.) can be compared using the chi square method if there is an abundance of events or the Fisher's exact test when values are scant. A normally distributed continuous variable (e.g., age is a continuum that is frequently distributed normally) can be tested using the Student's t-test, while one that has a skewed distribution (e.g., hospital length of stay, costs), requires testing with the Mann-Whitney U-test or the Wilcoxon rank-sum test or the Kruskall-Wallis test. Each of these "non-parametric" tests is appropriate in the setting of a skewed distribution. You do not need to know any more than this: the test for the hypothesis depends on the variable's distribution. And recognizing some of the situations and test names may be helpful to you in evaluating the validity of a study.

One final frequent computation you may encounter is survival analysis. This is often depicted as a Kaplan-Meier curve, and does not have to be limited to examining survival. This is a time-to-event analysis, regardless of what the event is. In studies of cancer therapies we frequently talk about median disease-free survival between groups, and this can be depicted by the K-M analysis. To test the difference between times to event, we employ the log-rank test.

Well, this is a fairly complete primer for most common hypothesis testing situations. In the next post we will talk a little more about measures of association and their precision, types I and II errors, as well as measures of risk alteration.                                         

Friday, January 28, 2011

Soliciting contributions: "Healthcare professional as e-patient" series

I am contemplating a series of posts arising from my own recent experience as an e-patient to help the broader e-patient community navigate the stormy medical waters with a bit more comfort. I am looking for other healthcare professionals who have had their own experiences as an e-patient that may be instructive for non-healthcare professionals as patients to contribute to the series. Namely, I am most interested in helping people establish better communication lines and channels with their healthcare providers. I am not looking for a comprehensive description of every aspect of your encounter, but rather one specific point that may be particularly instructive. If you have more than one to share, that is great too, we can do that as well. No bitching or moaning, just lessons that we can all learn from.

Issues I would like to touch upon range from how to bring out risk-benefit balance to how to feel OK about confronting your physician with dissenting information to how best to communicate (not everyone is good on e-mail, for example), given our individual styles and time constraints.

I think contributions from healthcare professionals may be very valuable, as we can see both sides of the coin, so to speak.

Would love feedback from both, healthcare professionals and e-patients on what would be valuable. If you are interested in contributing, please, either let me know in the comments section or e-mail me at If you have an idea for a post, please, be very specific about what your theme is, as I will make decisions based on how relevant it is to what I am envisioning.

This is jut a thought at this point, but seems like there may be something to it. Looking forward to your ideas.

Thursday, January 27, 2011

The price of marginal thinking in healthcare policy

I find it fascinating how our brains have this propensity to latch on to what is at the margins at the expense of seeing the bulk of what sits in the center. This peripheral only vision is in part responsible for our obscene healthcare expenditures and underwhelming results.

I have blogged ad nauseam about the drivers of early mortality in the US. In one post I reproduced a pie chart from the Rand Corporation, wherein they show explicitly that a mere 10% of all premature deaths in the US can be attributed to being unable to access medical care. The other 90% is split nearly evenly between behavioral, social-environmental and genetic factors, of which 60%, the non-genetic drivers, can be modified. Yet instead of investing the bulk of our resources in this big bucket of behavioral-environmental-social modification, we put 97% of all healthcare dollars towards medical interventions. This investment can at best produce marginal improvements in premature deaths, since the biggest causes of the effect in question are being all but ignored.

A couple of other striking examples of this marginal magical thinking have surfaced in a few recent stories covered with gusto in the press. One of the bigger ones is the obesity epidemic (oh, yes, you bet it was intended), and its causes. This New York Times piece with its magnetic headline "Central Heating May Be Making Us Fat" entertains the possibility that because of the more liberal use of heat in our homes we are no longer engaging our brown fat, which is a furnace for burning calories. And this is all well and good and fascinating, in a rounding out sort of a way. And it is just as interesting to hear that lack of sleep may be contributing to our expanding waistlines. But it is also baffling that we are still expending these enormous amounts of energy (OK, this one was not intended) on finding the silver bullet, when the target is not a supernatural being, but a super-sized expectation. Is it really that mysterious that we are fatter now than we were 20 years ago, when our current portion sizes are 70% bigger and we spend our days worshipping at the temple of the screen, in all its manifestations? While I am all for learning as much as we can, what we need right now is immediate action to abrogate this escalating epidemic, and I think we can all agree that the way to do it is not through lowering house temperatures. Plenty of behavioral research is available to inform our strategies to get people to eat less and move more. Let's start translating it into practice rather than latch on to one marginal magical idea after another.

And finally, I have to touch upon lung cancer, of course. The current fodder for this was provided by the Washington Post with this story about the growing advocacy among lung cancer patients for early detection. You may recall that recently I did several posts on the heels of the large NCI-sponsored study National Lung Screening Trial (NLST) whose purpose was to understand whether early detection of lung cancer in heavy smokers may improve lung cancer survival. I do not wish to go into all of the specifics of this study and my interpretation of the results -- you can find my thoughts on this study in particular and on screening in general here. What I do want to reiterate is that 85% of all lung cancer is caused by a single exposure: smoking. And guess what? The same behavioral strategies that can help people stop overeating can be deployed towards smoking cessation. Yet, instead of spending 85% of all expenditures on smoking cessation efforts, we prefer to allocate it to early detection. My point is that we need both, but the balance has to be informed by pragmatism, not the marginal magical thinking.

And so it goes that the Pareto principle is bleeding into our healthcare policy decisions -- this is the steep price of the marginal magical thinking. What will it take to get the blinders off and face up to the idea that some intervention points are just more impactful than others? Marginal panaceas will improve our lives, but only at the margins. And without being addressed, the big elephants in the room are likely to stampede us.

SoMe in medicine: It's about communication, stupid!

My generation of doctors was almost proud of its paternalistic overbearing know-it-all archetype, with the my-way-or-the-highway attitude to patient care. Even today there are inter-specialty fights in medicine that demonstrate these entrenched and seemingly fundamental, albeit willfully exaggerated, differences of opinion and clinical approach. It used to be, and still is to an extent, a badge of honor for an internist to disagree with a surgeon, for a pulmonologist to recommend a course of action diametrically opposed to that suggested by the infectious diseases specialist, and for everyone to disparage neurologists (apologies to my neuro friends). The extent of the discussion with patients as modeled by some of my senior colleagues was to say "You have this, and I am giving you this prescription, and see you in 2 months." And even today, I have observed the best of doctors still respond to a cogent "why?" question from a patient with a "because this is how we do it" answer.

My peers' lack of communication skills is the stuff of urban legends. Yet here we are at what seems like a pivotal moment for so many aspects of medicine -- science, healthcare system, communication technologies -- where effectively communicating outside the profession is a make-or-break proposition. Along these lines, in this BBC documentary Sir Paul Nurse, the head of the Royal Society, examines the societal forces that are coalescing to bring "Science Under Attack." The unifying message that comes out of his inquiry is that other less informed parties with political agendas are co-opting the discussion. Yet there is a distinct lack of the antidote of countervailing communication by scientists in terms that are understandable to the lay public. Nurse's battle cry is that scientists need to do a better job communicating their craft themselves, and not just to each other.

In some ways the prevailing elitism of medicine in the 20th century set the stage for the backlash we are experiencing today. The erosion of trust in the profession, commodification and consequent devaluation of medicine, while multifactorial at their root, could no doubt have been mitigated with better communication. Yet, great communicators rarely choose medicine as the path.

And this brings me to the contentious topic of the role of social media in medicine. For many of the early adopters, the question is no longer "should we", but "how best to." But my sense is, that physicians engaging in social media are still a minority. I am not even sure what proportion of MDs are amenable to communication via e-mail with their patients, though these data may be out there. So, for what seems to me as the majority of MDs who are not sold on e-mail, Twitter, Facebook, blogging or Quora, the value must not be that obvious. This makes me wonder if there are certain unifying characteristics of these docs, one being lack of perceived value of communication outside the profession across all media, including in-person contact.

I am friends with many docs on Twitter and in the blogosphere. The vast majority of them have shown themselves to be patient-centric, knowledgeable and collaborative, the kind of people I would not hesitate to send a loved one to. Yet, this is a skewed sample born out of a selection bias. These are the people who are interested and confident in their ability to communicate outside medicine. These are the people to whom medicine is a humanistic pursuit, where communities of patients and doctors strengthen the discussion of how to transform our system and the patient encounter. My guess is, and this is purely unscientific, that many of those who are skeptical of social media are also skeptical of communication itself, or just do not see the value of it in the equation of providing good patient care within the crushing time constraints of today's healthcare.

So my point is this: before social media tools can be expected to diffuse broadly into the medical community, the value of all communication needs to become clear to physicians in general. At this moment of increasing societal skepticism of science and of usefulness and integrity of the medical profession, against the backdrop of healthcare changes and increasingly unfiltered media noise, willingness and skills to communicate clearly may be as useful to today's doctors as a stethoscope. Once communication becomes the backbone of all medicine, tweets and blog posts are sure to start flowing freely from the fingers of physicians everywhere. And that will be good for the patients, the science and the healthcare system.

Wednesday, January 26, 2011

Webinar survey results

Last week I posted a survey link to gauge interest in and potential content for a webinar on how to review medical literature critically. I had a great response, and wanted to share the data with you.

The web page got 302 hits, resulting in 82 survey responses. This is a 27% rate of response, which certainly sets the results up to be biased and non-generalizable. But what the heck? I was looking to hear from people with some interest in this, not all-comers. So, here are the questions and the aggregated answers.

Q1: "I am thinking about creating a webinar based on some of the posts I have done on how to review medical literature. Would this be of interest to you?
R1: 82 people responded, of whom 81 (99%) answered "yes".

Q2: Are you a healthcare professional/researcher, an e-patient, or just an innocent bystander?
R2: 82 responses, 60 (72%) healthcare professionals/researchers, 5 (6%) e-patients, 17 (21%) innocent bystanders

Q3: Why do you feel the need to understand how to review medical literature
R3: This was a free text field, and I got 73 responses. Of these, many had to do with gaining a better understanding of the subject in order to help others (patients, clients, trainees) learn how to read and understand medical literature.

Q4: This question was only for those who responded "yes" to being a healthcare professional/researcher: Do you engage in journal peer review as a reviewer?
R2: Of 60 responses, only 7 (12%) were "yes".

Q5: Similar to Q4, this question was for only those who responded "yes" to Q4: Have you had formal training on how to be an effective peer reviewer?
R2: All 7 responded, of whom only 2 (29%) had formal training through a journal or a professional society, The remaining 5 (71%) have gained pertinent knowledge through reading about it. None of the responders got any reviewing courses during their medical training. Although the sample size is small, the responses are revealing and go along with my experience.

Q6: This question was targeted to only those responders who identified themselves as e-patients: How technical do you want the webinar information to get?
R2: All 5 e-patients answered this question, of whom 2 were comfortable with some degree of technicality, while the remaining 3 were comfortable with a greater degree of it.

Q7: This question was for all responders who expressed interest in having a webinar: Would you want one session or multiple sessions?
R2: Of the 80 responders, 21 (26%) felt that 1 session would suffice, 40 (50%) would be amenable to up to 3 sessions, and 11 (14%) would do up to 5 sessions. The remaining 8 (10%) of the responders chose "other", where their replies ranged from "no clue" to "as many as you see fit" to "let's start an ongoing discussion."

Q8: This was for those who would prefer a single session: How long should the session be?
R2: Of the 20 responses, 10 (50%) indicated 1 hour, while the majority of the rest indicated 2 hours.

Q9: If you are a part of an institution, do you think this would be of interest to your institution?
R9: 70 people responded, with 39 (56%) saying "yes" and 31 (44%) saying "no".

Q10: This was for those responding "yes" to Q9: What type of an institution are you a part of?
R10: All 39 people responded, and there was a range of institutions from medical schools to hospitals to government organizations to academic libraries. What was interesting here was that none of the "yes" responses to Q9 came from anyone in Biopharma or a professional organization or a patient advocacy organization. This I found surprising.

Overall, I am very pleased with the response. I am grateful to Janice McCallum (@janicemccallum on Twitter) for spreading the word to a lestserv of medical librarians. It certainly looks like there is enough interest in a webinar, and now I have to figure out how to execute one. If anyone has ideas, please, let me know in comments here or via e-mail.

Thanks again to all who took the time to respond!                

Tuesday, January 25, 2011

Mirror neurons and the need for slow medicine

How long does it take for a silence to become uncomfortable? 5 seconds? 20 seconds? A minute? Students of education are taught to give a child roughly 20 seconds to answer a question posed to him. How long do teachers actually give? About 5 seconds, if that. Now sit there and count out 20 Mississippis and see what an astonishingly long time it seems. Why, what if a web page takes that long to load on your browser? This becomes a major technological tragedy for most of us. The point is that 20 seconds is a longer time than we appreciate.

Now, let's talk about empathy. Yes, empathy. This seeming non-sequitur has a solid connection. How do we like to experience empathy? Silent attentive listening is a great example of empathic engagement. When we talk with out friends about emotionally charged topics, we do not want them to respond with "yeah, yeah", and move on rapidly to the next topic, do we? So, empathy takes time and engagement. And when 20 seconds of silence seems like a long time, imagine it in a doctor's office, following a hard revelation or an emotional response by the patient. Can you? Are you counting the Mississippis?

Well, it is no wonder that doctors miss opportunities to express empathy to their patients. In a study from Canada, where oncologists were recorded during patient encounters, these doctors seized fewer than 1 in 4 opportunities to respond to their patients with empathy; the other 3 chances they squandered on discussing clinical information. And this is a pity, as is rightfully acknowledged by the investigator quoted in the article. His conjecture for why docs miss these opportunities to be empathic has to do with their apparently erroneous idea that it takes too much time, and his guidance is the following:
Showing empathy does not mean a doctor has to feel what his or her patient is feeling, Buckman says. Rather, it means acknowledging patients’ fears and other emotions.
“It is perfectly OK for the doctor to remain detached, but it is not OK to talk detached,” he says. “Acknowledging what a patient is feeling is not the same as feeling it yourself.”
Well, I have to respectfully disagree. Here is the meaning of the word "empathy" from the trusted Merriam-Webster dictionary:
: the action of understanding, being aware of, being sensitive to, and vicariously experiencing the feelings, thoughts, and experience of another of either the past or present without having the feelings, thoughts, and experience fully communicated in an objectively explicitmanner; also : the capacity for this
And in fact, looking to brain science to guide us on how we are wired to accomplish this, we realize that by definition empathy implies non-detachment, and, in fact, involves feeling what the other is feeling. Empathy is mediated by the so-called mirror neurons, residing in the cingulate gyrus of the brain. The great neurobiologist VS Ramachandran thinks that the discovery of these neurons is to the study of human behavior what the discovery of DNA was to biology. It has been said that mirror neurons help "dissolve the 'self vs. other' barrier." It is these neurons that make us feel others' pain, literally and figuratively. So, putting ourselves in the other person's shoes and "feeling what the patient is feeling" is truly the sine qua non of empathy.

So, if the docs' intuition is correct, and empathy does mean non-detachment and time (after all 20 seconds represents 3% of a 10-minute appointment), how does the medical profession go about relishing and leveraging the other 3 opportunities for empathy instead of throwing them away? I agree with the point of the article that medical students should be taught empathic communication. At the same time, we learn by example, and if harried mentors continue to skirt these issues in the office because they are running two hours behind schedule already, the students will get the point loud and clear. The bigger issue is the incredible shrinking appointment, which is not only likely driving up healthcare costs and the frequency and intensity of testing, with its attendant adverse events, but is eroding the opportunity for a meaningful therapeutic relationship. After all, if the doctor herself provides a therapeutic benefit, is this not of utmost importance?

In short, this is another argument for slow medicine, an argument that should not be weakened by the detachment reasoning. My guess is that it is our biologic imperative as humans to exercise our mirror neurons avidly and often, and being forced to blunt their firing may be yet another path to demoralization. And is the medical profession not already demoralized enough?  

Sunday, January 23, 2011

Top 5 this week

#5: Do private ICU rooms really reduce HAIs?
#4: Data mining: It's about research efficiency.
#3: To guideline or not to guideline, that is the ques...
#2: Reviewing medical literature, part 1: The study qu...

#1: A webinar survey -- Please, take this brief survey to help me gauge
interest in and content for a possible webinar on how to read and review
medical literature.

Thanks for visiting and reading!

Friday, January 21, 2011

A webinar survey

Hi, folks,

I am conducting a survey to see how much interest there may be in a webinar on reviewing medical literature. This should take no more than 10 minutes of your time and would be enormously helpful to me to a). gauge interest and b). create appropriate content.

Thank you so much for doing this!
To get to the survey, click on this url:

Thursday, January 20, 2011

To guideline or not to guideline, that is the question in... pneumonia?

Addendum 1/20/11, 1:27 PM
I want to add something to this, since I have been reflecting on the data more. It turns out that about 3/4 of all patients had an organism isolated felt to be causative of their pneumonia. Among these patients, over 80% in each group received empiric treatment that covered the pathogen. This means that 4 out of 5 patients in both groups received appropriate antibiotic coverage. What the authors skimmed over briefly is to talk about de-escalation. De-escalation is the guideline recommended strategy which entails reducing the spectrum of treatment after culture results become available to only those antibiotics that cover what has grown out. So, if, say, a patient is being empirically treated for Pseudomonas aeruginosa with double coverage, and the culture grows our MRSA and no Pseudomonas, the two anti-pseudomonal drugs should be stopped immediately. The investigators state that they did apply a de-escalation protocol, and that by day 3 50% and by day 5 75% were essentially de-escalated. The fact that they state this in the Discussion section makes me think that this was inserted in response to a reviewer. It is a pity that they did not include de-escalation in their stratified analysis, as it may be at least somewhat explanatory for the findings. 

I always felt that there was something intangible and intuitive about my assessments of the critically ill for whom I cared. I could not always explain why I thought one particular patient was more ill than the next, but there was that little something that I must have noticed out of the corner of my eye, and if I tried too hard to focus on it, it would disappear like a puff of smoke. Yet, docs make these pre-conscious assessments all the time. And though these hints drive treatment choices, they are distinctly difficult to quantify scientifically.

A new paper that was just published in The Lancet Infectious Diseases online is a great illustration of what happens when our analyses fail to account for these intuitions. The phenomenon is referred to as "confounding by indication", and it is the perennial plague of observational clinical research. Just to summarize, the study was an observational study of guideline implementation for the treatment of healthcare-associated pneumonia among ICU patients. The central guideline was that for the choice of empiric antibiotics selection. The initial choice of antibiotics, even before the definitive results of cultures are available, is based on the clinician's best guess at what organism(s) may be causing the pneumonia. Among these severely ill patients, the risk of having a bug that is resistant to many antibiotics is higher than for patients who come from the community with pneumonia, and this propensity drives the recommendation for a broader antibiotic coverage for these cases. It has been shown by us and many others that missing this initial opportunity to cover the bug(s) adequately subjects patients to a doubling or even trebling of the risk of death, regardless of whether the coverage is broadened later to include the culprit organism(s).

Back to the study. The four academic medical center that participated in it enrolled 303 eligible patients, of whom 129 were treated with antibiotic combinations that comported with the guideline recommendations (guideline compliant treatment) and 174 received other combinations that did not fit the guideline recommendations (guideline non-compliant). To their surprise, the investigators discovered that 28-day survival was actually higher in the non-compliant group than in the compliant one. And even after doing a great job of adjusting for many potential factors that made the groups different, this paradoxical disparity persisted, with an overall near-doubling in the hazard of death at 28 days in the compliant as opposed to the noncompliant group. Now, this is a fine how-do-you-do! So, does this mean that the guideline is actually killing people by advocating broader coverage? Well, not so fast.

First, I have to acknowledge that I may be engaging in rescue bias right now. Having said this, taking biological plausibility into account, the findings are very likely explained by confounding by indication. Namely, the docs who choose, say, dual rather than single therapy against gram-negative bacteria may be pre-consciously incorporating some intangible patient data into their choices, data that are not well represented by either laboratory values or disease severity scoring systems. I know this is a bit "soft" and maybe even "touchy-feely", but ask any doc, and s/he will confirm this phenomenon.

On the other hand, to be fair and balanced, I do have to agree that there may be other explanations. These include the possibility that our guideline recommendations, never really prospectively validated, may be wrong. Perhaps there is something about the untoward effects of these broad spectrum regimens that is at play. Maybe it is as simple as the "no free lunch" principle, and that even in the situation of covering appropriately broadly, introducing additional drugs increases not only their benefits, but also the risks associated with them. Finally, I have to acknowledge the possibility that we just have no clue what any of this means because our understanding of how antibiotics work in the setting of these types of pneumonia is flawed.

Now, let's put all of this in the context of our multiple discussions about data and knowledge on this web site. Several factors suggest that my initial explanation is correct. The bulk of the evidence points to the fact that skimpy early coverage increases the risk of death. Also, over a century of understanding and the durability of the germ theory imply that antibiotics are important in treating serious bacterial infections. So, the pre-test probability of the validity of the finding in the paper is pretty low. This is not to say that the study should not inject caution and self-examination into how we treat severe pneumonia; it absolutely should! This is also a place where we definitely need well designed interventional studies to confirm (or debunk) what we think we know to be true. In the meantime, as we often intone on this blog, let us not throw the baby out with the bath water.

Disclosure: I have done a lot of work in this area, so I have a potential intellectual COI with the study. Also, at least some of my research has been funded by the manufacturers of some of the antibiotics included in the guidelines.

Wednesday, January 19, 2011

Data mining: It's about research efficiency.

I have taken a little break from my reviewing literature series -- work has superseded all other pursuits for a little while. But I did want to do a brief post today, since this JAMA Commentary really intrigued me.

First thing that interested me was the authors. Now, I know who Benjamin Djulbegovic is -- you have to live under a rock as an outcomes researcher not to have heard of him. But who is Mia Djulbegovic? It is an unusual enough surname to make me think that she is somehow related to Benjamin. So, I queried the mighty Google, and it spat out 1,700 hits like nothing. But only one was useful in helping me identify this person, and that was a link to her paper in BMJ from 2010 on prostate cancer screening. On this paper (her only one listed on Medline so far), she is the first author, and her credentials are listed as "student", more specifically in the Department of Urology at the University of Florida College of Medicine in Gainesville, FL. The penultimate author on the paper is none other than Benjamin Djulbegovic, at the University of South Florida in Tampa, FL. So, I am surmising from this circumstantial evidence that Mia is Benjamin's kid who is either a college or a medical student. Why does this matter? Well, there seem to be so few papers in high impact journals that are authored by people without an advanced degree, let alone in the first position, that I am in awe of this young woman, now with two major journals to her name -- BMJ and JAMA. This is evidence that parental mentorship counts for a lot (assuming that I am correct about their relationship). But regardless, kudos to her!

Secondly, the title of the essay really grabbed me: what is the "principle of question propagation", and what does it have to do with comparative effectiveness research (CER) and data mining? Well, basically, the principle of question propagation is something we talk about here a lot: questions beget questions, and the further you go down any rabbit hole, the more detailed and smaller the questions become. This is the beauty and richness of science as well as what I have referred to as "unidirectional skepticism" of science, meaning that a lot of the time, building on existing concepts, we just continue down the same direction in a particular research pursuit. This is why Max Planck was right when he said
A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.
So, yes, we build upon previous work, and continue our journey down a single rabbit hole our entire career. Though of course there are countless rabbit holes all being explored at the same time. It is really more of a fractal-like situation than a single linear progression. What is clear, as the authors of the Commentary point out, is that this results in the ever-escalating theoretical complexity of scientific concepts. What does this have to do with anything? This, the authors state, argues for continued use of theory driven hypothesis testing, given that medical knowledge will forever be incomplete. And this brings them to data mining.

Here is where I get a little confused and annoyed. They caution the powers that be from consigning all clinical research to data mining, at the expense of more rigorous studies to pursue hypothesis testing. They argue that mining data that already exist is limiting precisely because it is constrained by the scope of our current knowledge, and that we cannot use these data to generate new associations and new treatment paradigms. They further state that emerging knowledge will require updating these data sets with new data points, and this, according to the authors
...creates a paradox, which is particularly evident when searching for treatment effects insubgroups—one of the purported goals of the IT CER initiative. As new research generates new evidence of the importance for tailoring treatments to a given subpopulation of patients, the existing databases will need to be updated, in turn undermining the original purpose to discover new relationships via existing records.
Come agin? And then they say that "consequently, the data mining approach can never result in credible discoveries that will obviate the need for new data collection". Mmhm, and so? Is this the punch line? Well, OK, they also say that because of all this we will still need to do hypothesis testing research. Is this not self-evident?

I don't know about you, but I have never thought that retrospective data mining would be the only answer to our research needs. Rather, the way to view this type of research is as an opportunistic pursuit of information from massive repositories of existing data. We can look for details that are unavailable in the interventional literature, zoom in on the potentially important bits, and use this information to inform more focused (and therefore pragmatically more realistic) interventional studies.

Don't take me wrong, I am happy that the Djulbegovics published this Commentary. It is really designed more as an appeal to policy makers, who, in their perennial search for one-size-fit-all panaceas, may misinterpret our zeal for data mining as the singular answer to all our questions. No indeed, hypothesis testing will continue. But using these vast repositories of data should make us smarter and more efficient at asking the right questions and designing the appropriate studies to answer them. And then generate further questions. And then answer those. And then... Well, you get the picture.

Friday, January 14, 2011

Reviewing medical literature, part 4: Statistical analyses -- measures of central tendency

Well, we have come to the part of the series you have all been waiting for: discussion of statistics. What, you are not as excited about it as I am? Statistics are not your favorite part of the study? I am frankly shocked! But seriously, I think this is the part that most people, both lay public and professionals, find off-putting. But fear not, for we will deconstruct it all in simple terms here. Or obfuscate further, one or the other.

So, let's begin with a motto of mine: If you have good results, you do not need fancy statistics. This goes along with the ideas in math and science that truth and computational beauty go hand in hand. So, if you see something very fancy that you have never heard of, be on guard for less than important results. This, of course, is just a rule of thumb, and, as such, will have exceptions.

The general questions I like to ask about statistics are 1). Are the analyses appropriate to the study question(s), and 2). Are the analyses optimal to the study question(s). The first thing to establish is the integrity and completeness of the data. If the authors enrolled 365 subjects but were only able to analyze 200 of them, this is suspicious. So, you should be able to discern how complete the dataset was, and how many analyzable cases there were. A simple litmus test is that if more than 15% of the enrolled cases did not have complete data for analysis or dropped out of the study for other reasons, the study becomes suspect for a selection bias. The greater the proportion of dropouts, the greater the suspicion.

Once you have established that the set is fairly complete, move on to the actual analyses. Here, first thing is first: the authors need to describe their study group(s); hence, descriptive statistics. Usually this includes so-called "baseline characteristics", consisting of demographics (age, gender, race), comorbidities (heart failure, lung disease, etc.), and some measure of the primary condition in question (e.g., pneumonia severity index [PSI] in a study of patients with pneumonia). Other relevant characteristics may be reported as well, and this is dependent on the study question. As you can imagine, categorical variables (once again, these are variables that have categories, like gender or death) are expressed as proportions or percentages, while continuous ones (those that are on a continuum, like age) are represented by their measures of central tendency.

It is important to understand the latter well. There are three major measures of central tendency: mean, median and mode. The mean is the sum of all individual values of a particular variable divided by the number of values. So, mean age among a group of 10 subjects would be calculated by adding all 10 individual ages and then dividing by 10. The median is the value that occurs in the middle of a distribution. So, if there are 25 subjects with ages ranging from 5 to 65, the median value is the one that occurs in subject number 13 when subjects are arranged in ascending or descending order by age. The mode, a measure used least frequently in clinical studies, signifies, somewhat paradoxically, the value in a distribution that occurs most frequently.

So, let's focus on the mean and the median. The mean is a good representation of the central value in a normal distribution. Also referred to as a bell curve (yes, because of its shape), or a Gaussian distribution, in this type of a distribution there are roughly equal numbers of points to the left and to the right of the mean value. It looks like this (from
For a distribution like the one above it hardly matters which central value is reported, the mean or the median, as they are the same or very similar to one another. Alas, most descriptors of human physiology are not normally distributed, but are more likely to be skewed. Skewed means that there is a tail at one end of the curve or the other (figure from here):
For example, in my world of health economics, many values for such variables as length of stay and costs spread out to the right of the center, similar to the blue curve in the right panel of the above figure. In this type of a distribution the mean and the median values are not the same, and they tell you different things. While the median gives you an idea of the central tendency of the entire distribution, the mean will tell you the central tendency of the majority of the distribution that is tightly clustered at the end opposite the tail. For a distribution similar to the one in the right panel, the mean will underestimate the central measure.

To round out the discussion of central values, we need to say a few words about scatter around these values. Because they represent a population and not a single individual, measures of central tendency will have some variation around them that is specific to the population. For a mean value, this variation is usually represented by standard deviation (SD), though sometimes you will see a 95% confidence interval as the measure of the scatter. Variation around the median is usually expressed as the range of values falling into the central one-half of all the values in the distribution, discarding the 25% at each end, or the interquartile range (IQR 25, 75) around the median. These values represent the stability and precision of our estimates and are important to look for in studies.

We'll end this discussion here for the moment. In the next post we will tackle inter-group differences and  hypothesis testing.      

Thursday, January 13, 2011

Reviewing medical literature part 3 continued: threats to validity

As promised, today we talk about confounding and interaction.

A confounder is a factor related to both, the exposure and the outcome. Take for example the relationship between alcohol and head and neck cancer. While we know that heavy alcohol consumption is associated with a heightened risk of head and neck cancer, we also know that people who consume a lot of alcohol are also more likely to be smokers, and smoking in turn raises the risk of H&N CA. So, in this case smoking is a confounder of the relationship between alcohol consumption and the development of H&N CA. It is virtually impossible to get rid of all confounding completely in any study design, save for possibly in a well designed RCT, where randomization presumably assures equal distribution of all characteristics; and even there you need an element of luck. In observational studies our only hope to deal with confounding is through statistical manipulation we call "adjustment", as it is virtually impossible to chase it away any other way. And in the end we still sigh and admit to the possibility of residual confounding. Nevertheless, going through the exercise is still necessary in order to get closer to the true association of the main exposure and the outcome of interest.

There are multiple ways of dealing with the confounding conundrum. The techniques used are matching, stratification, regression modeling, propensity scoring and instrumental variables. By far the most commonly used method is regression modeling. This is a rather complex computation that requires much forethought (in other words, "Professional driver on a closed circuit; don't try this at home"). The frustrating part is that, just because the investigators did the regression, does not mean that they did it right. Yet word limits for journal articles often preclude authors from giving enough detail on what they did. At the very least they should tell you what kind of a regression they ran and how they chose the terms that went into it. Regression modeling relies on all kinds of assumptions about the data, and it is my personal belief, though I have no solid evidence to prove it, that these assumptions are not always met.

And here are the specific commonly encountered types of regressions and when each should be used:
1. Linear regression. This is a computation used for outcomes that are continuous variables (i.e., variables represented by a continuum of numbers, like age, for example). This technique's main assumption is that the exposure and outcome are related to each other in a linear fashion. The resulting beta coefficient is the slope of this relationship if it is graphed.
2. Logistic regression. This is done when the outcome variable is categorical (i.e., one of two or more categories, like gender, for example, or death). The result of a logistic regression is an adjusted odds ratio (OR). It is interpreted as an increase or a decrease in the odds of the outcome occurring due to the presence of the main exposure. Thus, a OR of 0.66 means that there is a 34% reduction in the odds (used interchangeably with risk, though this is not quite accurate) of the outcome due to the presence of the exposure. Conversely, a OR of 1.34 means the opposite, or a 34% increase in the odds of the outcome if the exposure is present.
3. Cox proportional hazards. This is a common type of a model developed for a time to event, also known as "survival analysis" (even if not done for survival per se as the outcome). The resulting value is a hazard ratio (HR). For example, if we are talking about a healthcare-associated infection's impact on the risk of remaining in the hospital longer, a HR of, say, 1.8 means that a HAI increases the risk of being in the hospital by 80% at any time during the hospitalization. To me this tends to be the most problematic technique in terms of assumptions, as it requires that the risk of an even stays constant throughout the time frame of the analysis, and how often does this hold true? For this reason the investigators should be explicit about whether or not they tested for the assumption of proportional hazards and whether this was met.

Let's now touch upon the other techniques that help us to unravel confounding. Matching is just that: it is a process of matching subjects with the primary exposure to those without in a cohort study or subjects with the outcome to those without in a case-control study, based on certain characteristics, such as age, gender, comorbidities, disease severity, etc.; you get the picture. By its nature, matching reduces the amount of analyzable data, and thus reduces the power of the study. So, is is most efficiently applied in a case-control setting, where it actually improves the efficiency of enrollment.

Stratification is the next technique. The word "stratum" means "layer", and stratification refers to describing what happens to the layers of the population of interest with and without the confounding characteristic. In the above example of smoking confounding the alcohol and H&N CA relationship, stratifying the analyses by smoking (comparing the H&N CA rates among drinkers and non-drinkers in the smoking group separately from the non-smoking group) can divorce the impact of the main exposure from that of the confounder on the outcome. This method has some distinct intuitive appeal, though its cognitive effectiveness and efficiency dwindle the more strata we need to examine.

Propensity scoring is gaining popularity as an adjustment method in the medical literature. A propensity score is essentially a number, usually derived from a regression analysis, giving the propensity of each subject for a particular exposure. So, in terms of smoking, we can create a propensity score based on other common characteristics that predict smoking. Interestingly, some of these characteristics will be present also in people who are not smokers, yielding a similar propensity score in the absence of this exposure. Matching smokers to non-smokers based on the propensity score and examining their respective outcomes allows us to understand the independent impact of smoking on, say, the development of coronary artery disease. As in regression modeling, the devil is in the details. Some studies have indicated that most papers that employ propensity scoring as the adjustment method do not do this correctly. So, again, questions need to be asked and details of the technique elicited. There is just no shortcut to statistics.

Finally, a couple of words about instrumental variables. This method comes to us from econometrics. An instrumental variable is one that is related to the exposure but not the outcome. One of the most famous uses of this method was published by a fellow you may have heard of, Mark McClellan, where he looked at the proximity to a cardiac intervention center as the instrumental variable in the outcomes of acute coronary events. Essentially, he argued, the randomness of whether or not you are close to a center randomizes you to the type of treatment you get. Incidentally, in this study he showed that invasive interventions were responsible for a very small fraction of the long-term outcomes of heart attacks. I have not seen this method used that much in the literature I read or review, but am intrigued by its potential.

And now, to finish out this post, let's talk about interaction. "Interaction" is a term mostly used by statisticians to describe what epidemiologists call "effect modification" or "effect heterogeneity". It is just what the name implies: there may be certain secondary exposures that either potentiate or diminish the impact of the main exposure of interest on the outcome. Take the triad of smoking, asbestos and lung cancer. We know that the risk of lung cancer among smokers who are also exposed to asbestos is far higher than among those who have not been exposed to asbestos. Thus, asbestos modifies the effect of smoking on lung cancer. So, to analyze those smokers exposed to asbestos together with those who were not will result in an inaccurate measure of the association of smoking with lung cancer. More importantly, it will fail to recognize this very important potentiator of tobacco's carcinogenic activity. To deal with this, we need to be aware of the potentially interacting exposures, and either stratify our analyses based on the effect modifier or work the interaction term (usually constructed as a product of the two exposures, in out case smoking and asbestos) into the regression modeling. In my experience as a peer reviewer, interactions are rarely explored adequately. In fact, I am not even sure that some investigators understand the importance of recognizing this phenomenon. Yet, the entire idea of heterogeneous treatment effect (HTE) and our pathetic lack of understanding of its impact on our current bleak therapeutic landscape, is the result of this very lack of awareness. The future of medicine truly hinges on understanding interaction. Literally. Seriously. OK, at least in part.

In the next installment(s) of the series we will start tackling study analyses. Thanks for sticking with me.        

Wednesday, January 12, 2011

Reviewing medical literature, part 3: Threats to validity

You have heard this a thousand times: no study is perfect. But what does this mean? In order to be explicit about why a certain study is not perfect, we need to be able to name the flaws. And let's face it: some studies are so flawed that there is no reason to bother with them, either as a reviewer or as an end-user of the information. But again, we need to identify these nails before we can hammer them into a study's coffin. It is the authors' responsibility to include a Limitations paragraph somewhere in the Discussion section, in which they lay out all of the threats to validity and offer educated guesses as to the importance of these threats and how they may be impacting the findings. I personally will not accept a paper that does not present a coherent Limitations paragraph. However, reviewers are not always, as, shall we say, hard assed about this as I am, and that is when the reader is on her own. Let us be clear: even if the Limitations paragraph is included, the authors do not always do a complete job (and this probably includes me, as I do not always think of all the possible limitations of my work). So, as in everything, caveat emptor! Let us start to become educated consumers.

There are four major threats to validity that fit into two broad categories. They are:
A. Internal validity
  1. Bias
  2. Confounding/interaction
  3. Mismeasurement or misclassification
B. External validity
  4. Generalizability
Internal validity refers to whether the study is examining what it purports to be examining, while external validity, synonymous with generalizability, gives us an idea about how broadly the results are applicable. Let us define and delve into each threat more deeply.

Bias is defined as "any systematic error in the design, conduct or analysis of a study that results in a mistaken estimate of an exposure's effect on the risk of disease" (the reference for this is Schlesselman JJ, as cited in Gordis L, Epidemiology, 3rd edition, page 238). I think of bias as something that artificially makes the exposure and the outcome either occur together or apart more frequently than they should. For example, the INTERPHONE study has been criticized for its biased design, in that it defined exposure as at least one cellular phone call every week. Now enrolling such light users can really result in such a small exposure as not to be able to detect any increase in adverse events. This is an example of a selection bias, by far the most common form that bias takes. Another example of a frequent bias is encountered in retrospective case-control studies where people are asked to recall distant exposures. Take for example middle-aged women with breast cancer who are asked to recall their diets when they were in college. Now, ask the same of similar women without breast cancer. What you are likely to get is the effect, absent in women without cancer, of seeking an explanation for the cancer that expresses itself in a bias in what women with cancer recall eating in their youth. So, a bias in the design can make the association seem either stronger or weaker than it is in reality.

I want to skip over confounding and interaction at the moment, as these threats deserve a post of their own, which is forthcoming. Suffice it to say here that a confounder is a factor related to both, the exposure and the outcome. An interaction is also referred to as effect modification or effect heterogeneity. This means that there may be population characteristics that alter the response to the exposure of interest. Confounders and effect modifiers are probably the trickiest concepts to grasp. So, stay tuned for a discussion of those.

For now, let us move on to measurement error and misclassification. Measurement error, resulting in misclassification, can happen at any step of the way: it can be in the primary exposure, a confounder, or the outcome of interest. I run into this problem all the time in my research. Since I rely on administrative coding for a lot of the data that I use, I am virtually certain that the codes routinely misclassify some of the exposures and confounders that I deal with. Take Clostridium difficile as an example. There is an ICD-9 code to identify it in administrative databases. However, we know from multiple studies that it is not all that sensitive or all that specific; it is merely good enough, particularly for making observations over time. But even for laboratory values there is a certain potential for measurement error, though we seem to think that lab results are sacred and immune to mistakes. And need I say more about other types of medical testing? Anyhow, the possibility of error and misclassification is ubiquitous. What needs to be determined by the investigator and the reader alike is the probability of that error. If the probability is high, one needs to understand whether it is a systematic error (for example, a coder always more likely than not to include C. diff as a diagnosis) or a random one (a coder is just as likely to include as not to include a C diff diagnosis). And while a systematic error may result in either a stronger or a weaker association between the exposure and the outcome, a random, or non-differential, misclassification will virtually always reduce the strength of this association.

And finally, generalizability is a concept that helps the reader understand what population the results may be applicable to. In other words, will the data be applied strictly to the population represented in the study? If so, is it because there are biological reasons to think that the results would be different in a different population? And if so, is it simply the magnitude of the association that can be expected to be different or is it possible that even the direction could change? In other words, could something found to be beneficial in one population be either less beneficial or even more harmful in another? The last question is the reason that we perseverate on this idea of generalizability. Typically, a regulatory RCT is much less likely to give us adequate generalizability than a well designed cohort study, for example.

Well, these are the threats to validity in a nutshell. In the next post we will explore much more fully the concepts of confounding and interaction and how to deal with them either at the study design or study analysis stage.            

Tuesday, January 11, 2011

Do private ICU rooms really reduce HAIs?

We have known for quite some time now that the patient's environment in a hospital matters to his/her outcomes. The concept of biophilia was applied by Roger Ulrich back in the 1980s to surgical patients in a series of experiments. Famously, this work showed that looking out your hospital room's window on a bunch trees is associated with better and less eventful post-operative recovery than staring at a brick wall, for example. We have also known for some time that some of the hospital-associated delirium can be mitigated by having the patient dwell in a room with a window and be exposed to the diurnal light changes.

Another, perhaps even more tangible outcome that can be modified by hospital design is the spread of hospital-acquired infections. This week a paper in the Archives of Internal Medicine from the group in Quebec, who brought us detailed reports of the devastating multihospital hypervirulent Clostridium difficile outbreak in the last decade, generally confirms the effectiveness of private ICU rooms in containing the spread of HAIs. There are some interesting details to point out.

For example, the intervention hospital appears to have had a higher proportion of medical patients than the control institution. Why is this important? Well, medical patients generally experience more chronic and therefore longer stays in the ICU. This gives them a greater opportunity for exposure to HAIs than their surgical counterparts. On the other hand, we know that VAP, for example, an infection very likely to be caused by one of the resistant organisms listed in Table 2 of the paper, happens much more frequently in trauma ICUs than medical ICUs.

Second, the unadjusted ICU length of stay shows some interesting results, depicted in the graph below:
So, while at the intervention hospital the raw ICU LOS has remained stable, at the comparator institution it has been slowly creeping up. Of course, the investigators adjusted for all kinds of factors that may influence this outcome, and showed that there may be a (marginal) reduction in the ICU LOS in association with the switch to private rooms. The authors note that the adjusted average ICU LOS fell by 10%, though under similar circumstances in other similar investigations there is a 95% chance that this would fall somewhere between 0% and 19% reduction. So, under the best of circumstances, if we get a 20% reduction in the 5-day ICU LOS, this translates to about 1 day. And given that transfer timing is more likely to be driven by the availability of ward beds than by the patient's clinical readiness, I question whether this is truly a staggering reduction. Additionally, if you read on, you will realize that there is very little reason to believe that this maximal reduction in ICU LOS is unlikely to be achieved by an average institution. In fact, even the 10% seen on average in this investigation may be a bar that is too high in other less well organized ICUs.  

It is important to remember a couple of things: 1). In some circumstances there is unlikely to be any reduction in the ICU LOS; 2). Since LOS is not a normally distributed function, the mean value underestimates the true measure of central tendency in this outcome (this is due to the typically long right tail present in this distribution); and 3). This investigation, though not strictly speaking experimental, was done at 2 academic institutions with highly organized infrastructure and what looks like closed model ICUs (a dedicated specialized team of critical care professionals caring for all ICU patients). For this reason, a similar intervention at a less stringently streamlined institution is unlikely to produce the same magnitude of results.

But the mere fact that the rates of exogenous transmission of pathogenic organisms were reduced is itself encouraging. At the same time, by focusing on carriage rates and not just clinical infections, the authors may be overstating the clinical significance of the observed reduction. Additionally, one of the issues that does not appear to have been addressed explicitly has to do with the availability of sinks: In the intervention unit there was a plethora of sinks, missing in the pre- period and also not available in the comparator hospital. Is it possible then that simply putting in more sinks would accomplish the same for a lot less money?

And this brings me to my next issue with the paper -- cost effectiveness. Now, according to the AHA annual survey of US hospitals, the average age of the physical plant is on the order of 10 years. Given the rapid pace of change in medicine, this may well signal a time for capital investments in plant improvements. And surely from the patient's and family's perspective, private rooms are preferable. However, one must ask the pesky question of the return on such an investment in this era of much needed fiscal restraint in medicine. If the same outcomes of reducing the spread of infectious organisms can be achieved with merely adding sinks, this may be a less drastic and more immediately feasible intervention well worth considering.                

Monday, January 10, 2011

Reviewing medical literature, part 2b: Study design continued

To synthesize what we have addressed so far with regard to reading medical literature critically:
1. Always identify the question addressed by the study first. The question will inform the study design.
2. Two broad categories of studies are observational and interventional.
3. Some observational designs, such as cross-sectional and ecological, are adequate only for hypothesis generation and NOT for hypothesis testing.
4. Hypothesis testing does not require an interventional study, but can be done in an appropriately designed observational study.

In the last post, where we addressed at length both cross-sectional and ecologic studies, we introduced the following scheme to help us navigate study designs:
Let's now round out our discussion of the observational studies and move on to the interventional ones.

Case-control studies are done when the outcome of interest is rare. These are typically retrospective studies, taking advantage of already existing data. By virtue of this they are quite cost-effective. Cases are defined by the presence of a particular outcome (e.g., bronchiectasis), and controls have to come from a similar underlying population. The exposures (e.g., chronic lung infection) are identified backwards, if you will. In all honesty, case-control studies are very tricky to design well, analyze well and interpret well. Furthermore, it has been my experience that many authors frequently confuse case-control with cohort designs. I cannot tell you how many times as a peer-reviewer I have had to point out to the authors that they have erroneously pegged their study as a case-control when in reality it was a cohort study. And in the interest of full disclosure, once, many years ago, an editor pointed out a similar error to me in one of my papers. The hallmark of case-control is that the selection criteria are the end of the line, or the presence of a particular outcome, and all other data are collected backwards from this point.

Cohort studies, on the other hand, are characterized by defining exposure(s) and examining outcomes occurring after these exposures. Similar to case-control design, retrospective studies are opportunistic in that they look at already collected data (e.g., administrative records, electronic medical records, microbiology data). So, although retrospective here means that we are using data collected in the past, the direction of the events of interest is forward. This is why they are named cohort studies, to evoke a vision of Caesar's army advancing on their enemy.

Some of the well known examples of prospective cohort studies are The Framingham Study, The Nurses Study, and many others. These are bulky and enormously expensive undertakings, going on over decades, addressing myriad hypotheses. But the returns can be pretty impressive -- just look at how much we have learned about coronary disease, its risk factors and modifiers from the Framingham cohort!

Although these observational designs have been used to study therapeutic interventions and their consequences, the HRT story is a vivid illustration of the potential pitfalls of these designs to answer such questions. Case-control and cohort studies are better left for answering questions about such risks as occupational, behavioral and environmental exposures. Caution is to be exercised when testing hypotheses about the outcomes of treatment -- these hypotheses are best generated in observational studies, but tested in interventional trials.

Which brings us to interventional designs, the most commonly encountered of which is a randomized controlled trial (RCT). I do not want to belabor this, as RCT has garnered its (un)fair share of attention. Suffice it to say that matters of efficacy (does a particular intervention work statistically better than the placebo) are best addressed with an RCT. One of the distinct shortcomings of this design is its narrow focus on very controlled events, frequently accompanied by examining surrogate (e.g., blood pressure control) rather than meaningful clinical (e.g., death from stroke) outcomes. This feature makes the results quite dubious when translated to the real world. In fact, it is well appreciated that we are prone to see much less spectacular results in everyday practice. What happens in the real world is termed "effectiveness", and, though ideally also addressed via an RCT, is, pragmatically speaking, less amenable to this design. You may see mention of pragmatic clinical trials of effectiveness, but again they are pragmatic in name only, being impossibly labor- and resource-intensive.

Just a few words about before-and after studies, as this is the design pervasive in quality literature. You may recall the Keystone project in Michigan, which put checklists and Peter Pronovost on the map. The most publicized portion of the project was aimed at eradication of central line-associated blood stream infections (CLABSI) (you will find a detailed description in this reference, Pronovost et al. N Engl J Med 2006;355:2725-32). The exposure was a comprehensive evidence-based intervention bundle geared ultimately at building a "culture of safety" in the ICU. The authors call this a cohort design, but the deliberate nature of the intervention arguably puts it into an interventional trial category. Regardless of what we call it, the "before" refers to measurement of CLABSI rates prior to the intervention, while the "after", of course, is following it. There are many issues with this type of a design, ranging from confounding to Hawthorne effect, and I hope to address these in later posts. For now, just be aware that this is a design that you will encounter a lot if you read quality and safety literature.

I will not say much about the cross-over design, as it is fairly self-explanatory and is relatively infrequently used. Suffice it to say that subjects can serve as their own controls in that they get to experience both the experimental treatment and the comparator in tandem. This is also fraught with many methodologic issues, which we will be touching upon in future posts.

The broad category of "Other" in the above schema is basically a wastebasket for me to put designs that are not amenable to being categorized as observational or interventional. Cost effectiveness studies frequently fall into this category, as do decision and Markov models.

Let's stop here for now. In the next post we will start to address threats to study validity. I welcome your questions and comments -- they will help me to optimize this series' usefulness.                

Friday, January 7, 2011

Reviewing medical literature, part 2a: Study design

It is true that the study question should inform the study design. I am sure you are aware of the broadest categorization of study design -- observational vs. interventional. When I read a study, after identifying the research question I go through a simple 4-step exercise:
1. I look for what the authors say their study design is. This should be pretty easily accessible early in the Methods section of the paper, though that is not always the case. If it is available,
2. I mentally judge whether or not it is feasible to derive an answer to the posed question using the current study design. For example, I spend a lot of time thinking about issues of therapeutic effectiveness and cost-effectiveness, and a randomized controlled trial exploring efficacy of a therapy cannot adequately answer the effectiveness questions.
If the design of the study appears appropriate,
3. I structure my reading of the paper in such a way as to verify that the stated design is, in fact, the actual design. If it is, then I move on to evaluate other components of the paper. If it is not what the authors say,
4. I assign my own understanding to the actual design at hand an go through the same mental list as above with the current understanding in mind.

Here is a scheme that I often use to categorize study designs:
As already mentioned, the first broad division is between observational studies and interventional trials. An anecdote from my course this past semester illustrates that this is not always a straight-forward distinction to make. In my class we were looking at this sub-study of the Women's Health Initiative (WHI), that pesky undertaking that sank the post-menopausal hormone replacement enterprise. The data for the study were derived from the 3 randomized controlled trials (RCT) of HRT, diet and calcium and vitamin D, as well as from the observational component of the WHI. So, is it observational or interventional? The answer to this is confusing to the point of pulling the wool over even experienced clinicians' eyes, as became obvious in my class. To answer the question, we need to go back to definitions of "interventional" and "observational". To qualify as an interventional, a study needs to have the intervention be a deliberate part of the study design. A common example of this type of a study is the randomized controlled trial, the sine qua non of drug evaluation and approval process. Here the drug is administered as a part of the study, not as a background of regular treatment. In contradistinction, an observational study is just that: an opportunistic observation of what is happening to a group of people under ordinary circumstances. Here no specific treatment is predetermined by the study design. Given that the above study looked at multivitamin supplementation as the main exposure, despite its utilization of the data from RCTs, the study was observational. So, the moral of this tale is to be vigilant and examine the design carefully and thoroughly.

We often hear that observational designs are well suited to hypothesis generation only. Well, this is both true and false. Some studies actually can test hypotheses, while others are relegated to generation only. For example, cross-sectional and ecological studies are well suited to generating hypotheses to be tested by another design. To take a recent controversy as an example, the debunked link between vaccinations and autism initially gained steam from the observation that as the vaccination rates were rising, so was the incidence of autism. The type of a study that shows two events changing at the group/population level either in the same or in the opposite direction is called "ecologic". Similar types of studies gave rise to the vitamin D and cancer association hypothesis, showing geographic variation in cancer rates based on the availability of sun exposure. But, as demonstrated well by the vaccine-autism debacle, running with the links from ecological studies is dangerous, as they are prone to a so-called "ecological fallacy". It occurs when, despite the finding in groups of a linked change of the two factors under investigation, there is absolutely no connection between them at the individual level. So, don't let anyone tell you that they tested an hypothesis in an ecological study!

Similarly in cross-sectional studies an hypothesis cannot be tested, and, therefore, causation cannot be "proven". This is due to the fundamental property of "a snapshot in time" that defines a cross sectional study. Since all events (with few minor exceptions) happen at the same time, it is not possible to assign causation to the exposure-outcome couplet. These studies can merely help us think of further questions to test.

So, to connect the design back to the question, if a study purports to "explore a link between exposure X and outcome Y", either an ecologic or a cross-sectional design is OK. On the other hand, if you see one of these designs used to "test the hypothesis that exposure X causes outcome Y", run the other way screaming.

We will stop here for now, and in the next post will continue our discussion of study designs. Not sure yet if we can finish it in one more post, or if it will require multiple postings. Start praying to the goddess of conciseness now!


Reviewing medical literature, part 1: The study question

Let's start at the beginning. Why do we do research and write papers? No, not just to get famous, tenured or funded. The fundamental task of science is to answer questions. The big questions of all time get broken down into infinitesimally small chunks that can be answered with experimental or observational scientific methods. These answers integrated together provide the model for life as we understand it.

Clearly, the question is the most important part of the equation, and this is why in my semester-long graduate epidemiology course on the evaluative sciences we spend fully the first four to five weeks talking about how to develop a valid and answerable question. The cornerstone of this validity is its importance. Hence, the first question that we pose is: Is the study question important?

This is a bit of a loaded question, though. Important to whom? How is "important" defined? This is somewhat subjective, yet needs to be scrutinized nevertheless. In the context of an individual patient, the question may become: Is the study question important to me? So, importance is dependent on perspective. Nevertheless, there are questions upon whose importance we can all agree. For example, the importance of the question of whether our current fast-food life style promotes obesity and diabetes is hard to dispute.

Regardless of how we feel about the importance of the question, we must first identify the said research question. At least some of the time you will be able to find it in the primary paper, buried in the last paragraph of the Introduction section. Most of the questions we ask relate to etiologic relationships ("etiology" is medicalese for "causation"). Now, you have heard many times that an observational study cannot answer a causal question. Yet, why do we bother with the time, energy and money needed to run observational studies? Without getting too much into the weeds, philosophers of science tell us that no single study design can give us unequivocal evidence of causality. We can merely come close to it. What does this mean in practical terms? It means that, although most observational studies are still interested in causality rather than a mere association, we have to be more circumspect in how we interpret the results from such studies than from interventional ones. But I am jumping ahead.

Once we have identified and established the importance of the question, we need to evaluate its quality. A question of high quality is 1). clear, 2). specific, and 3). answerable. The question that I posed above regarding fast food and obesity possesses none of these characteristics. It is too broad and open to interpretation. If I were really posing a question in this vein, I would choose a single well defined exposure (consuming 3 cans of soda per day) influencing a single outcome (10% body weight gain) over a specific period of time (over 30 weeks). While this is a much narrower question that the one I proposed above, it is only by answering bundles of such narrow questions and putting the information together that we can arrive at the big picture.

A general principle that I like to teach to my student is the PICO or PECOT model (I did not come up with it, but am its avid user). In PICO, P=population, I=intervention or exposure, C=comparator, and O=outcome. The PECOT model is an adaptation of the PICO for observations over time, resulting in P=population, E=exposure, C=comparator, O=outcome, T=time. These models can help not only pose the question, but to unravel the often mysterious and far from transparent intent of the investigators.

Once you have identified the question and dealt with its importance, you are ready to move on to the next step: evaluating the study design as it relates to the question at hand. We will discuss this in the next post.

Series launch: Critical review of medical literature

Today I am launching a series of posts on how to read medical literature critically. The series should provide a solid foundation for this task and dove-tail nicely with some of the more dense methods themes that occur on this blog. Who should read the series? Everyone. Although the current model of dissemination of medical information relies on a layer of translators (journalists and clinicians), it is my belief that every educated patient must at the very least understand how these interpreters of medical knowledge (should) examine it to arrive at the information imparted to the public. At the same time, both journalists and clinicians may benefit from this refresher. Finally, my own pet project is to get to a better place with peer reviews -- you know how variable the quality of those can be from my previous posts. So, I particularly encourage new peer reviewers for clinical journals to read this series.  

First, a conflict of interest statement. What comes first -- the chicken or the egg? What comes first -- expertise in something or a company hiring you to develop a product? Well, in my case I would like to think that it was the expertise that came first and that Pfizer asked me to develop this content based on what I know, not on the fact that they funded the effort. At any rate, this is my disclaimer: I developed this presentation about three years ago with (modest) funding from Pfizer, and they had it on a web site intended for physician access. Does this mere fact invalidate what I have to say? I don't think so, but you be the judge.

Roughly, the series will examine how to evaluate the following components of any study:
1. Study question
2. Study design
3. Study analyses
4. Study results
5. Study reporting
6. Study conclusions
I am not trying to give you a comprehensive course on how all of this is done, but merely make the reader aware of what entails a critical review of a paper.

Look for the first installment of the series shortly.

Thursday, January 6, 2011

National Healthcare Expenditures, 2009 (In pictures)

Well, it's that time of the year again: CMS has given us the accounting of our National Healthcare Expenditures (NHE) in a paper published in Health Affairs. I am sure you have already heard that the spending only went up by 4% this year over last, an all-time low.

At the same time, we have achieved the highest ever NHE as a proportion of the GDP (17.6%) and as expenditures per capita ($8,086). But the GDP proportion is a somewhat deceptive number on the one hand, as the GDP has suffered a substantial drop from its 2008 value of $14.4 trillion to $14.1 trillion in 2009. On the other hand, this implies that healthcare is eating into the rest of our expenditures on life. At the same time the per capita expenditures have continued their relentless rise.

Let us look at the components of the NHE individually and see what they can tell us.

As usual, the bulk of the expenditures went to personal health care (85%). Public health got a measly 3% of the total NHE, and this continues to be one of our gravest misappropriations. You may recall that about a year ago I did a post where I cited some startling statistics about some broad categories of causes of premature death in the US. Access to medical care accounted for a measly 10% of those, and the rest were attributable to behavior, genetics, environment and social factors. So, while, by inference, fixing medicine may impact 10% of these premature deaths, in reality 97% of the entire NHE goes to medicine rather than to potentially more impactful public health interventions. And the real travesty is that, despite these astronomical expenditures, we are still losing 1,000 lives per day to our broken healthcare system.

Looking a bit more closely at the "personal health" category, we see that, just as in years past, hospital costs and professional services comprise the bulk of this spending.

The "professional services" category, 81% of which is physician and other clinical services, is a bit murky. Yet, without too many leaps of faith we can say that if this expenditure buys us better preventive care, it may be a cost-effective area. At the same time we know that we can make this area a lot more efficient by streamlining and realigning incentives to promote better health rather than more care. Hospital expenditures, on the other hand, are a juggernaut that without a doubt requires containing. It is very likely that exchanging our inflated personal healthcare budgets for well placed public health funding along with reimbursement reform and improved end-of-life decisions, could substantially alter this category of spending.

One final data point that interested me was the breakdown of what are considered investments in the healthcare system. This broadly includes government-funded research and allocations for structures and equipment. Now, I am not sure what "structures and equipment" means, so, if any of my readers know, please, enlighten me. I do know what "research" means, however, and am rather disappointed about this breakdown. What I do not understand is, given that structures and equipment should have some kind of a half-life and not be replaced annually, how it is that this budget also grows consistently year-over-year at a steady rate? Would love to get more details on this.

To be sure, the total research expenditure of $45 billion is nothing to sneeze at. The big question is, however, are we spending it on the right research. I am not at all sure that the answer is yes, given that we still struggle with the same issues at the bedside that we have been struggling with for over a decade. But more on this later.