The Everydata Interview: Evan Starr

For this week's interview, we are keeping it local. Evan Starr is an assistant professor at the Smith School of Business at the University of Maryland in the Management and Organization Department. Evan's work is of real interest to me as he is breaking new ground in an area that I study, Labor Economics.  In his own words...

How do you use data in your work?

As an applied microeconomist, data is the most fundamental element of my work. In much of my work on the use and implications of covenants not to compete (or simply "noncompetes"), there was no existing data. Thus my first goal was to collect information on both the use of noncompetes and outcomes that might be associated with noncompetes. It turns out it is harder than one might think to answer even a simple question like "what proportion of employees in the U.S. sign noncompetes?" Individuals might not know if they have signed, and even among those who report signing, they may recall incorrectly. Furthermore, it is difficult to gauge the extent to which a sample of respondents is representative of the 150 million US labor force participants. Once I had collected the data, I pursued two primary objectives: First, simply describe the use of noncompetes in the labor force and how it varies by worker, firm, and regional characteristics. Second, I used the data to help disentangle equally plausible competing theories about the impacts of noncompetes, which had been previously untested. This second step involves thinking about the mechanisms linking noncompetes to outcomes and then testing the extent to which the data can isolate these mechanisms. 

What is the most common mistake you see in terms of people misrepresenting or misinterpreting data?

The most common mistakes I see in terms of misrepresenting or misinterpreting data are making generalizations from data generated from a sample that is not a reflective of the population of interest, and making causal claims without proper consideration of other variables that might fully explain the proposed causal relationship.

To illustrate the first point regarding sample selection, suppose you were an HR manager and your boss asked you to improve retention. You respond naturally by surveying all the current employees about their satisfaction in the workplace, using the information from the survey to generate new methods to enhance retention. The fundamental problem with this approach is that the population on which these predictions are based are not representative of all the employees who have worked at the firm. In particular, the data represent only individuals who have chosen to stay at the firm, completely missing the group of individuals who had left the firm, who are the most important part of your study. As a result, the suggestions made from the data are unlikely to be helpful for reducing retention, since they do not contain any information on employees who actually left. The lesson is that it is crucially important to recognize whether the sample which underly the data is reflective of the population we are interested in studying.

The second mistake I see made all the time is making causal claims about the relationship between X and Y, when a third factor could fully explain the relationship between X and Y. For example, in my work there is a robust correlation between signing a noncompete agreement and staying longer at your employer. Does this imply that noncompetes cause employees to stay longer? While the notion that noncompetes lock people into their jobs makes one think that such a causal effect is likely, there are many other variables that may cause firms to use noncompetes and simultaneously make workers stay longer. For example, high status firms such as Google and Apple are likely to be targets for poaching. As a result, Google and Apple are likely to use noncompetes to protect themselves. Because of their status, however, employees are highly likely to want to stay working with Google and Apple. Hence simply observing the fact that noncompetes are associated with longer tenures does not imply a causal relationship, since firm status may be the true driver of this relationship. In order to tease out whether firm status is important, one would want to compare the difference in tenure related to signing a noncompete (for observationally equivalent workers) across firms of the same status.

Economists call this phenomenon omitted variable bias, and it rears it heads in almost every setting. In health, for example, studies often attribute poor health outcomes to one behavior, such as drinking, smoking, or not brushing your teeth. But this attribution is often misguided because it is difficult to account for the fact that those who drink, smoke, or don't brush their teeth are also likely to participate in other behaviors that might lead to poor health outcomes. Hence it is very challenging to isolate the link between an outcome and one behavior when many other behaviors may explain the link as well.

How do you think the media affects the ways in which people consume data?

Few media outlets discuss the nitty gritty details of data collection and analysis, yet at the same time they present the "facts" and opine on them. Due to issues like sample selection and omitted variable bias, however, I think the media, perhaps unintentionally, frequently mislead consumers.

For example, John Bohannon, a journalist with a PhD in molecular biology fooled the media by promoting a study in which he claimed to show that people on a low-carb diet lost weight 10 percent faster if they ate a chocolate bar every day. He writes, "We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded."

If he ran the experiment, why then are the results meaningless? Bohannon writes, "Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. That study design is a recipe for false positives."

By not digging into the analysis and focusing on the headline they knew would get people's attention, the media was complicit in misleading millions of media consumers regarding the benefits of chocolate (don't we wish it was true!). If the media reported such stories more accurately -- for example, by reporting the number of people studied, the number of outcomes, etc. -- then consumers would have more information from which to judge the validity of the claims.

Why should people care about understanding data? What are the consequences?

The consequences for understanding data are crucial for all citizens. In the case of Abraham Wald, a Jewish Mathematician who was responsible for fixing damaged Allied warplanes in WWII, interpreting the data correctly meant life or death for many pilots. Planes that returned from battle for servicing exhibited significant damage via bullet holes in the wings, nose and tail. Since the new protective armor was heavy, Wald had to be judicious in where he chose to allocate the extra protection. To the surprise of many, Wald chose the cockpit. He argued that being shot in the wing, tail, or nose allowed the plane to return home safely, but that no planes which returned were shot in the cockpit. He argued that since he only had a chance to fix planes that returned (as opposed to all the planes - as in the HR manager's problem in the first question), getting shot in the cockpit must mean sure destruction. Many claim that his choice saved the lives of numerous pilots, though proving this would be very difficult.

Most of us we won't be making such life and death decisions like Abraham Wald, but we do make every day decisions and this year we will elect a new President and lawmakers. The policies of lawmakers leave empirical trails unassailable from political spin, and exacting minds carefully examining the data, aware of omitted factors and selection into the underlying samples, can often distinguish between valuable signals and the noise proliferated by the media and the candidates themselves. 

What is one thing people could do to become a better consumer of data in their everyday lives?

Be critical of numbers, and especially proposed causal relationships. Where do numbers come from? Which population do they reflect? Are there any other variables that might confound the proposed causal relationship between X and Y?

 

Please note that guest interviews are informational only, and do not necessarily represent the views of John H. Johnson, PhD.