Gender Gaps - A Cautionary Tale

James Day, University of British Columbia

In research, we strive to be our own most rigorous critic. At the FFPERPS conference, I presented a cautionary tale about matching claims to evidence, drawn from work my colleagues and I have done at the University of British Columbia (UBC) investigating possible gender differences in student learning in our first-year physics laboratory course.

The talk presented three main messages. First, while all lab students learn, a gender gap is present both at the beginning and end of instruction. Second, valid and interpretable results are elusive. And third, effect sizes are a robust quantitative measure. The second and third messages are readily found in existing peer-reviewed literature, and are addressed more explicitly below. FFPERPS was an opportunity for the community to reflect on and discuss these easily overlooked but critical points.

The talk began with the common assertion that male students outperform females on most physics concept inventories. We wondered whether such a gender gap existed with the relatively new Concise Data Processing Assessment (CDPA),1 developed at UBC, and whether gendered actions in the teaching lab might influence — or be influenced by — such a gender gap. The CDPA is a ten-question, multiple-choice diagnostic, that probes student abilities related to the nature of measurement and uncertainty, and to handling data. To estimate the gap, and its predictors and correlates, we collected student responses before and after instruction. We also observed how students in mixed-gender groups spent their time in the lab.

Analysis of CDPA responses allowed us to make some claims. There is a gender gap on the CDPA, and it persists from the pre- to the posttest. Furthermore, this gap is as big as, if not bigger than, gaps reported for other instruments. Our observations revealed compelling differences in how students divide their time in lab. In mixed-gender pairs, male students tend to monopolize the computer, while female and male students tend to devote equal time to the equipment, and female students spend more time on other activities, such as writing or speaking to peers. We found no correlation between computer use, when students are presumably working with their data, and posttest performance on the CDPA.

But research is never done as cleanly as it is presented. We stumbled, blundered, and gaffed with our analysis, a process made explicit during the talk. A key point of this confessional description involved the assumptions that underlie common statistical tests. In the physics education literature, explicit discussion of such assumptions is often missing. This could be due to a selection effect, in which manuscripts with data that do satisfy the assumptions are the ones published. But it could also be that researchers in some cases leave the assumptions unevaluated. Misapplying statistical techniques can lead to both type I and type II errors, and to over- or underestimation of inferential measures and effect sizes. Indeed, “the applied researcher who routinely adopts a traditional procedure without giving thought to its associated assumptions may unwittingly be filling the literature with non-replicable results.”2

We were certainly guilty of inattention to underlying assumptions in the early stages of our own data analysis. Ironically, such neglect is consistent with the demonstration of a broad lack of knowledge about the assumptions, the robustness of the techniques with regards to the assumptions, and how or whether the assumptions should be checked.3 Our initial identification of a gender gap in the CDPA data left us wondering what impact, if any, our lab curriculum was having on student performance. How was the gender gap changing over time? To investigate, we rather blindly followed the well-worn path of examining measures of gain, somewhat-arbitrarily deciding in advance that we would use one particular measure. Fortunately, we also decided to do a quick comparison with a second measure as a sanity-check. Inconsistency led to a crumbling of our understanding of these gain measures, and we began to look at a wide variety of alternatives. In total, we examined five separate measures of gain: the average normalized change4 <c>; the average absolute gain normalized by the total test score <gabs>; the course average (Hake’s) normalized gain5 <g>; the absolute gain normalized by twice the average of the pre- and post-test <g2av>; and the relative change, which is the absolute gain normalized by the pre-test score <grel>. To be clear, then, the differences between these metrics lies in the denominator, in how each is normalized. We found that male students’ scores showed higher apparent learning gains than females’ only when normalized change was used! With Hake’s normalized gain, or any of three other reasonable metrics, the statistical significance vanished and the effect size approached zero (perhaps even changing sign!).

We concluded that none of these gain measures were appropriate as estimates of learning for our situation. Although female students clearly were starting and ending at lower levels of achievement, we had no clear picture of whether or not the amount of learning was comparable, a finding and question which has been encountered and wrestled with before.6 Gain scores must be treated with great care; it may be that simply avoiding them altogether is the best path. When different measures applied to the same raw data lead to different narratives about what is happening, we may be better served by reframing the research question. Asking whether one gender has learned more than another is fraught with tacit premises. Instead, we can ask whether there is a gender difference on post-test scores after having considered (some of the) differences with which female and male students begin the course. With that, the talk argued for the use of analysis of covariance (ANCOVA) as a step in the right direction.

In addition to urging researchers to explicitly check the assumptions associated with their statistical methods, we further call for improved reporting and contextualization of effect sizes.7,8 In our own case, we decided to make no claim that female students are learning less than their male peers in our lab program. The interested reader can now find this work in peer-reviewed form.9

James Day is Research Associate at the Quantum Matter Institute and Department of Physics and Astronomy at the University of British Columbia. He was a plenary speaker at FFPER: Puget Sound 2016.

Endnotes

1. J. Day and D. Bonn, "Development of the concise data processing assessment," Physical Review Special Topics - Physics Education Research, 7(1), 010114-1:14 (2011).

2. H. J. Keselman, C. J. Huberty, L. M. Lix, S. Olejnik, R. A. Cribbie, B. Donahue, R. K. Kowalchuk, L. L. Lowman, M. D. Petoskey, J. C. Keselamn, and J. R. Levin, “Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA, and ANCOVA analyses,” Review of Educational Research, 68(3), 350-386 (1998).

3. R. Hoekstra, H. Kiers, and A. Johnson, “Are assumptions of well-known statistical techniques checked, and why (not)?,” Frontiers in Psychology, 3, 137 (2012).

4. J. D. Marx and K. Cummings, “Normalized change,” Am. J. Phys., 75, 87-91 (2007).

5. R. R. Hake, “Interactive-engagement versus traditional methods: A six-thousand-student survey of mechanics test data for introductory physics courses,” Am. J. Phys., 66, 64-74 (1998).

6. S. D. Willoughby and A. Metz, “Exploring gender differences with different gain calculations in astronomy and biology,” American Journal of Physics, 77(7), 651-657 (2009).

7. G. M. Sullivan and R. Feinn, “Using effect size-or why the P value is not enough.” Journal of Graduate Medical Education, 4(3), 279-282 (2012).

8. J. M. Maher, J. C. Markey, and D. Ebert-May, “The other half of the story: effect size analysis in quantitative research,” CBE-Life Sciences Education, 12(3), 345-351 (2013).

9. J. Day, J. B. Stang, N. G. Holmes, D. Kumar, and D. A. Bonn, “Gender gaps and gendered action in a first-year physics laboratory,” Physical Review Physics Education Research, 12(2), 020104-1:14 (2016).


Disclaimer – The articles and opinion pieces found in this issue of the APS Forum on Education Newsletter are not peer refereed and represent solely the views of the authors and not necessarily the views of the APS.