Improvement in experimental design skills for the least-prepared undergraduate students: a real effect or regression to the mean? A response to Furrow (2019)

Posted on

Recently, we showed that guided inquiry in laboratory courses results in significant gains in scientific reasoning and experimental design skills for the least-prepared undergraduate students (Blumer and Beck, 2019).  In his Letter to the Editor, Furrow (2019) suggests that our results, especially the gains in experimental design skills for the least-prepared students, could be explained by the statistical phenomenon of regression to the mean rather than a real effect of guided inquiry in laboratory courses.  In his Letter, Furrow suggests several approaches for avoiding regression to the mean in pre-test/post-test studies, as well as ways of generating null models of regression to the mean.  In our paper, we acknowledge the importance of considering regression to the mean in pre-test/post-test studies (Blumer and Beck, 2019, pp. 9-10, 12).  Therefore, we do not disagree with Furrow’s contention that regression to the mean could occur in such studies.  However, the nature of the assessment instrument used and the actual data must be examined when evaluating the possibility of regression to the mean.  In our study, we used The Experimental Design Ability Test (EDAT) (Sirum and Humburg, 2011), an open-ended assessment scored with a hierarchical rubric, to assess students’ experimental design skills.  Below, we describe why regression to the mean is unlikely to explain changes in EDAT scores.  In the Supplemental Materials, we give our perspective on the suggestions for controlling regression to the mean, as well as describe aspects of the EDAT that were not considered in the null models suggested by Furrow (2019), which make the models inappropriate for the EDAT and similar assessments.  We note that the main arguments below for why changes in EDAT scores in our study are likely real effects and not artifacts of regression to the mean were presented in the Discussion of our original paper (pp. 9-10), but were not evaluated by Furrow (2019).

In our original study, we used The Experimental Design Ability Test (EDAT) (Sirum and Humburg, 2011) to assess students’ experimental design skills.  The EDAT presents students with an open-ended prompt.  Student responses were scored with a 10-point rubric (see Table 2 in Blumer and Beck, 2019).  We consider the rubric to be hierarchical in that some items on the rubric require that students receive points for other items on the rubric first.  For example, identifying independent and dependent variables (items 2 and 3) presuppose that students recognize that an experiment can be conducted (item 1). Or, realization that many variable should be held constant in the experiment (item 7) presupposes that students state that one variable should be held constant in the experiment (item 5).  In addition, the items are in order of assumed increasing difficulty for students (Sirum and Humburg, 2011). In short, at least some of the items in the scoring rubric are not independent of one another.

As noted by Furrow (2019), regression to the mean occurs due to within-individual random changes between repeated measurements on the same individual, in our case, pre-test and post-test.  Such random changes could result in students with the lowest pre-test scores scoring higher on the post-test (positive learning gains) and students with highest pre-test scores scoring lower on the post-test (negative learning gains). Clearly, if multiple choice assessments were used and students guessed the answers, regression to the mean would be likely.  However, within-individual random changes between pre-test and post-test are much less likely for open-ended assessments scored with a rubric.  Students are unlikely to add (or subtract) a rubric item by random guessing.  Furthermore, we would expect random changes between pre-test and post-test to be spread across the rubric items.  Indeed, in our study, the least-prepared students showed greater gains in Basic Understanding of experimental design than in Advanced Understanding of experimental design (see Figures 4 and 5 in Blumer and Beck, 2019), as we would expect if most students are learning the essential aspects of experimental design and fewer students are learning the more complex concepts related to experimental design.  In addition, students in the top quartile showed significant decreases for Advanced Understanding, but not Basic Understanding of experimental design, which is more suggestive of a lack of motivation on the post-test by the top quartile students than regression to the mean (see Blumer and Beck, 2019, p. 10-11, for more discussion of a motivation effect). Given that we assessed student experimental design skills using the EDAT and student gains occurred in a predictable fashion, we maintain that gains for the least-prepared students are a real effect of guided-inquiry in laboratory courses. 

In short, we agree with Furrow (2019) that it is essential that researchers consider the possibility of regression to the mean when considering pre-test/post-test changes for students grouped by pre-test quartile (see Blumer and Beck, 2019, p. 12).  However, we suggest that it is also essential that researchers consider the nature of the assessment instrument that they are using and whether regression to the mean is likely.  The data from our study suggest that regression to the mean is unlikely to explain the changes in experimental design skills as measured with the EDAT and that the least-prepared students do indeed benefit from guided-inquiry laboratory courses.

References

Blumer, L. S., & Beck, C. W. (2019). Laboratory courses with guided-inquiry modules improve scientific reasoning and experimental design skills for the least-prepared undergraduate students. CBE—Life Sciences Education, 18(1), ar2. doi: 10.1187/cbe.18-08-0152

Furrow, R.E. (2019). Regression to the Mean in Pre–Post Testing: Using Simulations and Permutations to Develop Null Expectations. CBE—Life Sciences Education, 18(2), le2. doi: 10.1187/cbe.19-02-0034

Sirum, K., & Humburg, J. (2011). The Experimental Design Ability Test (EDAT). Bioscene: Journal of College Biology Teaching, 37(1), 8–16.

Supplemental Materials

Controlling for Regression to the Mean

In addition to describing why we think that regression to the mean is unlikely to explain our results, we would like to address the approaches for preventing regression to the mean suggested by Furrow (2019).  First, Furrow suggests randomization of students into control and intervention groups or matching of students in these groups based on prior preparation.  In many cases, such randomization or matching is logistically untenable in educational studies.  In addition, and more related to the results of our study, “if outcomes of an intervention depend on the starting point of a student, we might be missing the differential effects of the intervention by controlling for differences among students” (Blumer and Beck, 2019, p. 2, emphasis in the original). Second, Furrow suggests dividing students into groups based on some other measure of preparation, such as previous grade point average.  While other measures can be used to group students, whether these measures are accurate predictors of prior preparation related to the student skills of interest is unclear, especially for cross-institutional studies such as ours.  We agree that independent measures of prior preparation can be used to categorize students, but suggest that care must be taken in selecting those measures to insure that they align with what is being assessed.

Null Models and Permutation Tests for Regression to the Mean

Furrow (2019) suggests some useful approaches for generating null models for comparison to actual data if regression to the mean is suspected.  While we do not examine all of the approaches here, we do suggest that consideration of the assessment instrument is important in generating null models.  Furrow (2019) developed a null model for the EDAT (see S2 and S3).  In his null model, the 10-point scale of the EDAT rubric was modeled as “a binomial distribution with 10 trials.”  This approach assumes that the 10 items on the EDAT rubric are independent of one another such that the trials are independent.  However, as discussed above, the items on the rubric for the EDAT are not independent.  As a result, a more complex null model that incorporates the non-independence of rubric items would be necessary.  Finally, Furrow (2019) suggests permutation tests as a way in which to test for differences between pre-test and post-test scores.  As noted by Furrow (2019), the underlying assumption of standard permutation tests is that values are “exchangeable” (Edgington and Onghena, 2007).  In other words, all permutations are equally likely.  Such an assumption is likely true for some assessments.  However, if we expect a priori that some permutations are more likely than others based on chance alone, this underlying assumption is violated.  For example, if it is more likely for a student to forget to include a rubric item in their answer to the EDAT prompt by chance than to include a rubric item by chance, the likelihood of an EDAT score changing from 4 to 3 by chance is greater than the likelihood of an EDAT score changing from 3 to 4 by chance.  To conduct a permutation test for an assessment such as the EDAT, we would first need to develop a probability distribution of permutations that we would expect under the null.  While all permutations are not equally likely for the EDAT, what the probability distribution of permutations would be under the null is unclear.

As a side note, the way in which individuals are assigned to quartiles differs depending on the algorithm used (e.g., see https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html). The ntile() function that was used in the code provided by Furrow (2019) divides observations into groups that have approximately the same number of observations (https://www.rdocumentation.org/packages/BurStMisc/versions/1.1/topics/ntile).  As a consequence, individuals with the same pre-test score on the EDAT can be placed into different quartiles.  The same would be true for any assessment in which student scores are from a limited range of discrete values.

References

Blumer, L. S., & Beck, C. W. (2019). Laboratory courses with guided-inquiry modules improve scientific reasoning and experimental design skills for the least-prepared undergraduate students. CBE—Life Sciences Education, 18(1), ar2. doi: 10.1187/cbe.18-08-0152

Edgington, E., & Onghena, P. (2007). Randomization tests. New York: Chapman and Hall/CRC. doi: 10.1201/9781420011814

Furrow, R.E. (2019). Regression to the Mean in Pre–Post Testing: Using Simulations and Permutations to Develop Null Expectations. CBE—Life Sciences Education, 18(2), le2. doi: 10.1187/cbe.19-02-0034