Half-priced hoodies: Donate $8/month and get a cozy Marketplace hoodie today. Donate Now
Automated test grading has moved way past Scantron bubble sheets
Oct 12, 2020

Automated test grading has moved way past Scantron bubble sheets

But human teachers should still double-check discrepancies.

Every Monday this fall, we’re talking about technology and education, because many students, caregivers and teachers are getting a crash course in ed tech. Even before the pandemic, one way technology has been creeping into students’ lives is through grading. And we’re not just talking about those multiple choice bubble sheets that’ve been around for decades.

The Educational Testing Service, which creates statewide assessments for K-12 students, along with higher ed tests like the GRE, has been using artificial intelligence to grade essays since 1999. But can AI really tell good writing from bad? I spoke with Andreas Oranje, vice president of assessment and learning technology development for ETS. He says systems are trained to look for things like style, grammar and how arguments are built. The following is an edited transcript of our conversation.

Andreas Oranje (Courtesy ETS)

Andreas Oranje: The system is trained to extract a lot of features of good and bad writing. And so what that system does then is it aggregates all that data from an essay after processing it, then it produces some kind of professional score. The scores of the computer are being compared to a human grader. And then if there’s a big discrepancy, usually a third grader comes in to resolve the discrepancy.

Amy Scott: So, it’s actually a check on the human reviewer?

Oranje: It is the way we use it. It is a check on the human reviewer.

Scott: Do you see this ever, the AI, being good enough to replace teachers in their regular grading that they do, which I know is one of many teachers’ least favorite activities?

Oranje: I really believe in that joining forces of computers and humans. Having systems that help teachers, for example, spot [something] like, “Hey, a lot of your students are struggling with this topic. Maybe in your next lesson, you want to take a little bit more time expanding on that.” Or maybe the grading, especially the more low-key grading with fairly straightforward answers at the lower levels. That’s something that a computer can do really well and very reliably. And so helping a teacher so that the teacher can spend more time on personalizing their instruction and giving individual attention, I think is a great way to go about this.

Scott: How often do you see AI catching examples of poor writing that a human has missed, and that second expert having to come in?

Oranje: It depends by exam. So for a practice exam, it may be different than for a GRE exam. But we have pretty strict rules around that, where even more than a 1.5-point difference is already more than we want to tolerate and you’re talking now about maybe a 6-point scale. So it depends on the exam, it can be 5%, it can be 10%. But I think the important point is that whenever there’s a discrepancy, we’re going to check into it, we’re never letting something go without a check.

Scott: We’ve seen stories of students gaming the system, just writing a bunch of keywords that they know the system is going to be looking for, and getting a perfect score. What’s the usefulness of that kind of grading if it can easily be gamed?

Oranje: If a system is that easily gameable, it’s obviously not a very good system, and it needs to be revisited. Now, you have to take into account what the purpose was. So, if the purpose of a system is to help a student in a low-stakes environment learn, or to help inform a teacher, the needs of the system are obviously very different than if you make big life decisions about admitting someone or other very impactful decisions based on it. It depends on how you use it.

Scott: Clearly there are some school systems that are using AI in this way. What role do you see the companies that make the technology having in preventing it from being used in a way that you think is not actually that helpful?

Oranje: We have done a bunch of presentations over the past couple of years to set up some key questions that we believe teachers should ask. Is the data that is being used in these systems appropriate for my students? Are these models related to what my students need? Can I intervene in the system? All these kinds of questions that teachers need to ask. And I think it’s really important that as we move further and further in this automated world, that we really tool people and arm people with those questions and be able to make good judgments for whether these tools are good for the learning that happens in the classrooms or not.

Scott: You’ve been with the company for a really long time working on this technology. Do you see a point when your work is done, it all works as well as human grading?

Oranje: I don’t think that that point will ever come, and not because the systems aren’t progressing really rapidly — because they’re getting better and better — but the reason why I don’t think that will ever happen is because as soon as we get these systems really well done to do what we know now, we as a society want way more things and we are always wanting more things than we wanted before. And so our creativity and our desire for new things will always outpace what we can model at any point in time.

Related links: More insight from Amy Scott

Monica Chin reported a piece last month for The Verge about the virtual learning app Edgenuity. It was being used to grade assignments in a seventh-grade history class in Los Angeles. One 12-year-old student received a 50/100 on his first assignment. Then, his mom, a history teacher herself, caught on that the grading software was looking for certain keywords and would mark an answer wrong if those words weren’t there. So, she and her son decided to play along.

For a question about the Byzantine Empire, for example, the student wrote a couple sentences and then just a list of relevant words: wealth, caravan, ship, India. That was apparently all it took — he started getting perfect grades. As Andreas Oranje pointed out, even if AI can help, there are still some things where a human should always be in the mix.

The future of this podcast starts with you.

Every day, the “Marketplace Tech” team demystifies the digital economy with stories that explore more than just Big Tech. We’re committed to covering topics that matter to you and the world around us, diving deep into how technology intersects with climate change, inequity, and disinformation.

As part of a nonprofit newsroom, we’re counting on listeners like you to keep this public service paywall-free and available to all.

Support “Marketplace Tech” in any amount today and become a partner in our mission.

The team

Molly Wood Host
Michael Lipkin Senior Producer
Stephanie Hughes Producer
Daniel Shin Daniel Shin
Jesus Alvarado Assistant Producer

Half-priced hoodies! 🎉 

This weekend only, get a Marketplace zipup hoodie when you donate $8/month. Don’t wait this offer ends at midnight Sunday!