Is Everything Old New Again? Performance Assessment in Science

Back in mid 1990s, when I was an informal science educator working for a small science museum in Connecticut, our institution became involved in a pilot project to test hands-on "performance assessments" in elementary classrooms in our part of the state. A performance assessment measures how well students apply their knowledge, skills, and abilities to authentic problems. This project was a tiny part of a much larger effort to improve standardize testing by going beyond multiple-choice, fact-based items and creating measures that could gauge students' ability to actually do science.
These efforts were in happening in concert with science education reform overall, which at the time, was rooted in a vision articulated by Project 2061's Science All Americans (1990), the associated Benchmarks for Science Literacy (1993), and eventually the 1996 National Science Education Standards (NRC, 1996). The big difference between the approach taken in these documents and the existing school science was the notion that science literacy was an essential outcome of education for all citizens, not just the purview of an elite few. The other major change was the addition of an emphasis on "inquiry" or "process skills". Scientific inquiry refers to the "diverse ways in which scientists study the natural world and propose explanations based on the evidence derived from their work" (NRC 1996 p.23). Looking back on it now, I see that, prior to the 1996 standards, K-12 science instruction and assessment in the U.S. was an exclusive enterprise, focused almost solely on a students' ability to recall facts and explain (or maybe just state) abstract theories in English.
The performance assessments we helped pilot nearly 30 years ago were certainly different from ordinary large-scale tests. First of all, they were active: students were provided with a set of materials (marbles, measuring tapes, eye droppers, magnets, etc.) and asked to use the materials to answer a series of science questions. The role of teachers was more active too. Instead of the usual job of reading directions and enforcing test protocols, teachers, armed with clipboards and rubrics (and after being trained), were charged with evaluating and reporting their students' performance. This allowed teachers to understand and make use of the results immediately, rather than waiting months for performance data based on questions they could not view due to test security, to arrive.
A few things became apparent as children in our community began taking the assessments: 1) many students had little or no exposure to the underlying content and skills being assessed, even though the concepts were part of the curriculum and 2) some students struggled to use the most basic science tools effectively, even though using these tools should have been part of instruction in both math and science.
I don't have copies of the assessments we piloted back then, however, the examples in this Rand Corporation report from 1993 seem very familiar to me. One thing that stands out is how disconnected these activities are from any real world context. With the exception of "Al" and "Frank" wondering about forces and motion, the science being investigated in this assessment does not seem to extend beyond the tabletop.
In the end, these early attempts to incorporate materials-based performance assessments into statewide science testing fell by the wayside, for various reasons. One reason was cost: the effort of organizing and distributing the materials to schools across the state was determined to be unsustainable, as was the cost associated with managing all of the individual teacher evaluations. Another was reliability: early concerns over inconsistency in ratings provided by teachers resulted in a loss of confidence in these types of assessments, even though substantial improvements consistency were made later on.
While these considerations may have posed some barriers to the implementation of performance assessment at a large scale, the real reason work on this type of "authentic assessment" (Wiggins, 1990) did not continue had more to do with stepped up enforcement of accountability requirements by the U.S. Department of Education during the late 1990s and the enactment of No Child Left Behind Act in January 2002 (Peceone et.al, 2010). These two actions by the Federal government meant that every child had to be tested every year in reading and math and that the turnaround time for results had to be shortened to accommodate a parental choice option added to the law. This resulted in added expense to assessment budgets nationwide, which led many states to rely almost exclusively on relatively inexpensive and easily scored multiple-choice tests.
Fast forward to today, and performance assessment is back in fashion in science education, except this time, the emphasis is on formative performance assessment. In formative assessment, the goal is to monitor students' learning and provide feedback not to document what has been learned at the end of instruction or evaluate programs. The reemergence of performance assessment as a focal method in science education is directly linked to the integrated approach to taken in the latest (Next Generation) Science Standards (NGSS) (NRC 2013). It is no longer deemed sufficient for students to demonstrate that they know the content and can separately show mastery of certain "inquiry" skills. Instead, they must be able to demonstrate, in an integrated way, that they can use science and engineering practices (process/inquiry skills) to apply their knowledge of crosscutting a concepts (eg. cause and effect, patterns, systems, etc.) and draw upon their understanding of specific disciplinary core ideas (the content) in the context of real world phenomena or problems. Assessment of student understanding within this "multi-dimensional" approach does not lend itself well to traditional methods like multiple-choice items or isolated short answer questions, hence the renewed interest in performance assessment.
Has performance assessment's time come for measuring students' science learning? Maybe, however, as with any complex endeavor, the devil is in the details. It is possible to craft a performance assessment that looks good - maybe it has an interesting context and asks students to respond to questions that require them to employ different science practices (model and explain, for example) - but doesn't really align with the goals of instruction. Those "hands-on" performance assessments we administered in the 1990s behaved like this for some students: they had fun interacting with the materials but did not have the understanding needed to fully engage with the assessment because the concepts were never taught. This is why proponents of "backwards design" (Wiggins & McTighe, 2005) advise starting with the desired results (student learning goals) and then determining acceptable evidence (the basis of assessment) before planning any learning experiences or assessments. This approach also protects against the "favored activity effect" where a teacher starts with their favorite lesson or lessons and then tries to bend the lesson (or the plan) to meet the prescribed learning targets (been there, done that!).
Another place where a performance assessment may fall short is if it doesn't sufficiently test a students' ability to apply what they have learned to a new situation. All too often, the assessments we give our students are so closely aligned with instruction that they become recall exercises rather than measures of understanding. This ability to apply learning to a different situation is called "transfer". According to Wiggins & McTighe (2005 p.40), transfer is "an essential ... because teachers can only help students learn a relatively small number of ideas..so we need to help [students] transfer their inherently limited learning to many other settings, issues, and problems." One way to find out if students can transfer their learning is to challenge them with performance assessments that leave the work of figuring out which acquired knowledge and skills to apply and when, to them.
These two criteria - alignment and transfer - could apply to a performance assessment regardless of whether it is intended to be used formatively or summatively. A formative performance assessment would need to possess these qualities and provide both the teacher and the student with useful (actionable) information about the student's learning. This added need calls for a mechanism of information flow back to the student (e.g. via the teacher, through self or peer assessment, or by automated response) and supports a planning process based in backward design. If instruction is designed with the end goals of learning in mind and evidence of learning is defined from the start, any performance assessment created using this framework should also provide useful information to the teacher.
Thinking back on the efforts of those educators to improve large scale testing through hands-on performance assessment in the 1980s and 1990s, I can't help wondering if this might be a good time to try again. One big barrier to successfully implementing the program on a large scale involved the logistics of distributing and maintaining the materials needed for students to engage in the process of science. Today, at least some of what assessment designers were trying to accomplish with all those parts and pieces could be handled through technology. It's not exactly the same experience, but there are plenty of examples of animations, simulations, and engaging video that could be used to provide students with the context, information, and interactivity they need to be able to demonstrate what they know and can do in science - and that's just using technology widely available today. Who knows what will be possible in the future?
Here are some examples of high quality performance assessments:
Stanford NGSS Assessment Project
Next Generation Science Assessment
Wisconsin DPI Assessment Examples
References and Resources:
National Research Council. (2013). Next Generation Science Standards: For States, By States. Washington, DC: The National Academies Press. https://doi.org/10.17226/18290.
Pecheone, R., Kahl, S., Hamma, J., Jaquith, A. (2010). Through a LookingGlass: Lessons Learned and Future Directions for Performance Assessment. Stanford, CA: Stanford University, Stanford Center for Opportunity Policy in Education.
Stecher, Brian M. and Stephen P. Klein, Performance Assessments in Science: Hands-On Tasks and Scoring Guides. Santa Monica, CA: RAND Corporation, 1996. https://www.rand.org/pubs/monograph_reports/MR660.html.
Wei, R.C., Pecheone, R.L., & Wilczak, K.L. (2014). Performance Assessment 2.0: Lessons from Large-Scale Policy & Practice. Stanford, CA: Stanford Center for Assessment, Learning, and Equity.
Wiggins, Grant (1990). The case for authentic assessment. Practical Assessment, Research & Evaluation, 2(2).
Wiggins, G., & McTighe, J. (2005) Understanding by design (2nd ed.). Alexandria, VA: Association for Supervision and Curriculum Development ASCD. Colomb.
Image by Monstera from Pexels
.