Are students biased against female professors because they are, well, female?
Many academics hold as near gospel the claim that student evaluations are systematically gender biased. However, more recent evidence suggests we should pump the brakes a bit on that conclusion.
As it turns out, answering whether student evaluations are gender biased convincingly is tricky for several reasons.
First, students usually aren’t randomly assigned to instructors. Different types of students may select into classes based on an instructor’s gender.
Second, instructors of different genders may teach differently. As a result, differences in student evaluations could reflect teaching style rather than gender per se.
One might address the first problem by randomly assigning students to classes. Studies such as Boring (2017) and Mengel et al. (2019) do exactly this. But these studies do not fully address the second problem: they cannot cleanly separate instructor gender from instructor behavior.
To deal with both issues simultaneously, a widely cited 2015 paper by MacNell, Driscoll, and Hunt used an online course as a laboratory.
In their study, two assistant instructors (one male, one female) each taught two discussion sections: one under their real identity and one under the other instructor’s gendered name and identity. As a result, students evaluated the same instructor under different perceived genders.
Because the course was fully online, all interaction occurred via discussion boards and email. This design feature is crucial, as it strips away in-person cues—such as appearance, voice, body language, and physical presence—that are tightly bundled with gender in face-to-face classroom settings.
MacNell et al. found that students rated the male identity higher than the female identity. In other words, perceived gender appeared to drive student evaluations.
Given how consequential student evaluations are for promotion and tenure, the paper received enormous attention. It has been cited almost 1,300 times and it was widely covered in the media and discussed on social media.
In short, the MacNell et al. study helped cement a now-common belief in academia: that student evaluations are inherently gender biased.
But the MacNell et al. study has important limitations. Most notably, the experiment involved only 43 students. With samples this small, estimates are highly sensitive to noise and vulnerable to Type S (sign) and Type M (magnitude) errors—meaning estimated effects may be exaggerated or even flip direction by chance.
Despite this, the result was widely generalized.
Fast forward to this year. A team of economists led by Andersson conducted a much larger, tightly controlled double-blind field experiment in a university setting. Crucially, neither students nor instructors knew that gender was being experimentally manipulated.
In contrast to MacNell et al., they find no evidence of bias against female instructors in student evaluations. Their study is much better powered, and the resulting confidence intervals are tight enough to rule out effect sizes similar to those reported in earlier influential work.
The takeaway, in my view, is not that gender bias cannot exist in student evaluations. Rather, it is that the magnitude and generality of such bias were likely overstated based on early, underpowered evidence. As with many findings that become academic "common knowledge," stronger designs with larger samples paint a more nuanced picture.
Student evaluations may be gender biased in some contexts—but not in all.
What do you make of these new findings?