(In)Convenience Samples

SHORT: Convenience samples are a bad idea because they put us at great risk of making flawed decisions based on flawed data. They do nothing to counter self-selection bias in our respondents and they take away our ability to make full use of statistical tools like margin of error and significance testing. The only thing convenience sampling can definitively tell us is if a behavior/opinion is present; it cannot, however, speak to absence. Convenience sampling can also sometimes tell us a) if something is likely very rare or very common (when we get extreme rates such as 90% or 10%) & b) a rough, qualitative sense about the ranking of things (when one is significantly more common than the other, e.g. 60% vs. 20%). Even still, these require scrutiny as respondent bias or other flaws in sampling methods can sometimes have a tremendous impact on results.


" the real danger of collecting data in this manner is that we are always going to be tempted to use it "

As a researcher, I have no love of convenience sampling; but as a consumer of research, I sometimes have to accept it. This post will be discussing the core limitations of convenience sampling and the maximum it is that I feel I can safely take away from results from such surveys.

But first, a little philosophizing about truth. What is 'truth'? Some say that truth is whatever you make it, and there is some validity to that line of thinking- especially when diving into the realm of human experience. However, in the field of research, 'truth' is a little bit more concrete. Some aspect of the world exists in some state, and we want to measure it. The extent that we do a good job measuring that state is the extent to which we are able to come close to estimating (social science for 'guessing') what the 'TRUTH' actually is.

Enter the Gold Standard for TRUTH-Seeking in the Social Sciences: Randomization. As you may already know, we social scientists tend to love randomization in all its forms. Random assignment for control trials, random samples for surveys (sometimes with multiple layers of randomization), and even randomization of response options or questions within a survey instrument. We love it because it does lots of good things for us (that will be another post) and helps us to get closer to 'TRUTH'.

But, as you may also know, randomized methods can be expensive and not randomizing, well, is just so much easier! Thus the appeal of the (in)convenient 'convenience sample'. Convenience samples come up when we want to collect information from a group of folks about some aspect of their lives without going to the trouble of trying to randomly engage participants. We primarily see it used in survey research and that is the context in which I will be discussing it today. I will also be using a specific example to illustrate my concerns, which you can view here: https://stateofevaluation.org/

The State of Evaluation is a fabulous concept for research. They sent a survey to thousands of non-profits across the country, asked them about their evaluation practices, and have done this 3 times in 6 years so that they can track changes in time. All good stuff. However, the problem I discovered after I read the report was this: "We sent an invitation to participate in the survey to one unique email address for each of the 37,440 organizations in the dataset. (pg. 22)“ This is a convenience sample. While they did receive 1,125 responses, the number of responses has no bearing on the quality of the data they collected because there was no randomization. Also, the convenience sample renders all of their numbers (and there are a lot of them) completely meaningless.

Why is this? What’s the trouble here? The chief problem and the core reason why we do randomization is self-selection bias. Always in the world there will be some people who want to take your survey more than others. When we draw a non-random, convenience sample we maximize the effect of this bias on our results.

You might imagine, in this example, that organizations who do evaluation in some form or another or who value it more will be more likely to respond than organizations for which evaluation is not on their radar or who are hostile to it. As a consequence, organizations that are pro-evaluation will respond more frequently (and may have more extreme responses) than organizations who are neutral to evaluation. Likewise, organizations that are hostile to evaluation may respond less frequently than organizations that are neutral. The likely end results will then be that the final estimate of what proportion of organizations conduct any given set of practices is biased in favor of evaluation. Put another way, rates at which the various ‘good’ behaviors measured in this survey are practiced will be inflated, painting a rosier picture than normal.

" self-selection bias: always in the world there will be some people who want to take your survey more than others "

However, the bias may not be even across all of the different practices. Some practices may end up being more strongly biased because they are also strongly correlated with those organizations that are pro-evaluation. As a consequence, not only is the data potentially painting a rosier picture than what ‘TRUTH’ is, but that picture is also likely distorted in terms of how frequently different things occur in relation to each other. And here is the real kicker: we have no way of knowing how badly the data was affected by respondent biases, which direction the biases actually went in, or how distorted the ‘image’ actually is.

Randomization does not solve all of these problems. The potential for bias and distortion is still there. However, what randomization does give us is a statistical toolkit for measuring differences in things and making reliable guesses as to what ‘TRUTH’ actually is. Randomization also gives us a pretty strong reassurance that we did not actually have significant bias. Some of the tools randomization gives us are margin of error and significance testing. With randomization, we can better answer the question, “Are these two things occurring at different rates?” (significance testing). We can also elucidate how reliable our estimate actually is. “Its most likely between 78% and 86%” (margin of error).

Standard error allows us to see how precise our guess is and know roughly what area 'TRUTH' lies within, with a certain degree of confidence (e.g. 95% confident). Significance testing helps us to see if two numbers are likely different or are they still too close to safely claim they are different.

So does the State of Evaluation report do anything that its convenience sample actually forbids it from doing? Unfortunately, yes. A lot. First, it makes the claim that their rates they observed in their dataset translate to the non-profit sector, as a whole. Without randomization, they have no basis for making this assumption since they made no effort to counter people’s tendency towards bias. Second, they make the claim that evaluation usage rates went up from 2012 to 2016. However, because of their convenience sample, they don’t actually have the luxury of error bars and cannot actually test for a significant difference between 2012 and 2016. It is plausible that the 'TRUTH' is that that the actual rate for both years is 89%. However, due to chance, they got 90% and 92% in 2012 and 2016, respectively.†

In fact, every point at which they list a specific number or rate in their report is a step across the boundary of what they can actually do while still being honest and true to the limitations of their research methods.

But, being a researcher who needs data on the relationship non-profits have with evaluation, and them having the only current data on the subject, I have no choice but to use their report. So how do I use it, then? Convenience samples can answer a few questions. First, they can confirm the presence of a behavior or opinion (but not the absence of one!).

" we have no way of knowing how badly the data was affected by respondent biases "

Second, they can give a very rough, qualitative ranking of behaviors or opinions- what is most common and what is least? This only applies, however, for things that occur at very different rates. For example, we might comfortably conclude from convenience data that something that occurred 66% of the time in the sample really is more frequent than something that occurred 33% of the time. However, make that 45% and 55%, and all bets are off.

Similarly, we can get a vague assessment of how common or rare something is. For example, according to the SoE report, 92% of respondents reported doing some form of evaluation, so obviously evaluation is likely very common. Likewise, they report that “only 12% of organizations spent 5% or more of their organization budget on evaluation,” so we can reasonably conclude that that behavior is rare, even if the number “12%” has no value.

When should your organization use convenience samples? When your question is one of the three below. However, the real danger of collecting data in this manner is that we are always going to be tempted to use it. When we start writing the report, we will want to put numbers in it. If we do, people who do not know better are going to be quoting those numbers as fact and law.

You yourself may be pulled into thinking "roughly 1/4 of my constituents feel this way" because that was the number that came back in the convenience sample. This may lead you and your board to conclude that something is a problem when, in fact, only 10% of your constituents actually felt that way. Alternatively, you may decide something is not important when actually 'TRUTH' was closer to 50% for that issue. In either case, you will never know which it is unless you do a random sample.

A more likely scenario is that you are trying to assess people's preferences and one group strongly over-responds compared to another group. As a consequence, you get the notion that one preference is far more common than the other (e.g. 60% vs 30%) when, in fact, 'TRUTH' was more like 50% and 40%. Certainly, the one preference was still greater than the other; but not by nearly as much as the convenience sample suggested. If you start making decisions based on the convenience sample data, however, you risk upsetting a far greater number of your constituents than you thought!

1. Does anybody, even just one person, do, think or feel something? e.g. 5% of respondents said they need a vegetarian meal option, so we know at least some folks need a vegetarian option. But, just because no body said they needed a vegan option does not mean that there are not vegans in the program population.

2. Is something obviously really common or obviously really rare? e.g. 90% or more is likely very common, 10% or less is likely very rare, but we don't know the actual rates or even how bias our rates may be.

3. Is one thing clearly more common than another? e.g. 60% of people preferred apples while 20% of people preferred bananas in the convenience sample -> most likely more people prefer apples to bananas, but we don't know how many more.

†In truth, their sample sizes were too small to claim a 2% change was significant improvement. To distinguish a 2% difference between 2 samples with 95% confidence, each sample would need to have around 3,100 responses- and that assumes everything else is perfect.