Justin Esarey (Wake Forest) & Natalie Valdes (Wake Forest), Unbiased, Reliable, and Valid Student Evaluations Can Still Be Unfair:
Scholarly debate about student evaluations of teaching (SETs) often focuses on whether SETs are valid, reliable and unbiased. In this article, we assume the most optimistic conditions for SETs that are supported by the empirical literature. Specifically, we assume that SETs are moderately correlated with teaching quality (student learning and instructional best practices), highly reliable, and do not systematically discriminate on any instructionally irrelevant basis. We use computational simulation to show that, under ideal circumstances, even careful and judicious use of SETs to assess faculty can produce an unacceptably high error rate: (a) a large difference in SET scores fails to reliably identify the best teacher in a pairwise comparison, and (b) more than a quarter of faculty with evaluations at or below the 20th percentile are above the median in instructional quality. These problems are attributable to imprecision in the relationship between SETs and instructor quality that exists even when they are moderately correlated. Our simulation indicates that evaluating instruction using multiple imperfect measures, including but not limited to SETs, can produce a fairer and more useful result compared to using SETs alone.
Inside Higher Ed, Even ‘Valid’ Student Evaluations Are ‘Unfair':
Student evaluations of teaching reflect students’ biases and are otherwise unreliable. So goes much of criticism of these evaluations, or SETs. Increasingly, research backs up both of those concerns.
On the other side of the debate, SET proponents acknowledge that these evaluations are imperfect indicators of teaching quality. Still, proponents argue that well-designed SETs inevitably tell us something valuable about students’ learning experiences with a given professor.
A new study — which one expert called a possible “game-changer” — seeks to cut through the noise by assuming the best of SETs — at least, that which is supported by the existing literature. Its analysis assumes that the scores students give instructors are moderately correlated with student learning and the use of pedagogical best practices. It assumes that SETs are highly reliable, or that professors consistently get the same ratings. And it assumes that SETs do not systematically discriminate against instructors on the basis of irrelevant criteria such as their gender, class size and type of course being taught.
And even when stacking the deck for SETs, the study finds that these evaluations are deeply flawed measures of teaching quality.