Generative AI can harm learning?
Results from a randomised controlled trial on exam performance
A brilliant paper titled Generative AI Can Harm Learning was published by Hamsa Bastani, Osbert Bastani, Alp Sungu, Haosen Ge, Ozge Kabakcı, and Rei Mariman, researchers from the University of Pennsylvania and Budapest British International School. As a founder building AI solutions in education, the title caught my attention. After reading it, I am more convinced than ever of the importance of developing AI specifically for education, like what we are doing at Bloom AI. This article provides an overview of the study and some of my own commentary and takeaways.
Overview of the experimental design
The researchers performed a randomised controlled trial (RCT) to evaluate the impact of GPT-4 based tutors in 9th-, 10th- and 11th-grade mathematics in a Turkish high school. The students went through four 90-minute sessions, each with three parts: a teacher review, an assisted practice period, and an unassisted 30-minute exam. The intervention only affected the second part. There were three treatment arms:
Control: Students worked through the practice problems with access to course books and notes, with no devices.
GPT Base: Students interacted with GPT-4 through a chat interface with a minimal prompt. We can think of this as similar to ChatGPT.
GPT Tutor: Students used the same chat interface with GPT-4, but each practice problem had a detailed prompt. Each prompt included the solutions to the practice problem and teacher input on common student mistakes, which reduces the error rate of the tutor. The prompt also included instructions to not give out the answer directly. We can think of this as an early version of a specialised AI solution for education.
The primary regression is:
I won’t explain each variable in detail (see Section 3), but I’ll highlight that GPT Basec and GPT Tutorc are binary variables corresponding to a class c, where 0 means no treatment and 1 means the treatment is assigned. In simple terms, they are trying to estimate β1 and β2. For example, if β1 is positive, then it shows that the use of GPT Base improves the outcome measure (either on practice problems or the unassisted exam).
The authors were clearly very thoughtful about experimental design. Some choices included:
Incentive compatibility: They ensured that performance on the practice problems and the unassisted exam both contributed to the final grade.
Independent graders: They hired independent graders to reduce teacher-student bias.
Accurate and relevant course material: They employed a mathematics teacher to aid in the material development, and ensured it was in line with the syllabus prescribed by the Ministry of Education in Turkey.
Reliable internet connection: They purchased 52 laptops and four portable Wi-Fi dongles to guarantee a reliable internet connection across the classes.
Enforcing no other tools: They ensured that the teacher and IT research assistants actively monitored the classroom to ensure that students were forbidden from using other websites or applications.
At the end of each session, they surveyed the students on their experiences and preferences.
Commentary on the results
The first main result is that both the GPT Base and GPT Tutor groups perform significantly better than the control group on the assisted practice problems. While the control group had a mean normalised score of 0.28 (out of 1), the GPT Base-assisted group had a mean score of 0.42 (48% higher than control) and the GPT Tutor-assisted group had a mean score of 0.65 (127% higher than control). Both of these results were significant at the 1% level. This is not a surprise. There are many studies that demonstrate that human-AI collaboration yields performance improvements.123
The second main result is that student performance in the unassisted exam for the GPT Base cohort degraded by 17% compared to the control group. Notably, these negative effects were mitigated by the safeguards in the GPT Tutor group. There was no significant difference in exam performance between the control group and the GPT Tutor group. This result is certainly thought-provoking and runs counter to some of the results we’re seeing at Bloom AI. Below, I suggest some further exploration avenues.
Both of the main results are robust when the regression is performed with different specifications. These include when done at the problem-level, omitting non-compliers (sessions which did not use the assigned treatment due to exogenous circumstances), including the non-random honors classes, and including self-reported survey variables such as gender or prior exposure to ChatGPT. See Appendix B.2 for more detail.
The main mechanism by which GPT Base impedes learning is students using it as a crutch. The authors start by saying there are two potential mechanisms by which GPT Base could students can be adversely affected: (1) errors made by GPT Base, and (2) using GPT Base as a crutch. If the mechanism was (1), then one would expect that if GPT Base’s error rate on a particular problem is higher, then the student’s performance on the problem should be worse. However, the authors do not find a significant effect of the error rate on exam performance. GPT Tutor avoids both of these mechanisms because (1) the prompt includes the solution and (2) it avoids giving the student the answer.
Students who used GPT Tutor were more engaged, spent more time and had more non-superficial conversations than those who used GPT Base. By Session 4, students using GPT Tutor were, on average, asking twice as many questions per problem compared to the students using GPT Base. Students spend 13% more time with GPT Tutor than GPT Base. The researchers also classified conversations as superficial (for example, asking directly for the answer) and non-superficial (for example, giving an attempted answer). They found that a substantially larger portion of conversations in the GPT Tutor arm were non-superficial.
Students in the GPT Tutor arm perceived the practice sessions as more valuable for learning than those in other arms. This is despite no noticeable exam performance improvement compared to the control group. The students in the GPT Base group perceived their learning to be the same as the control group, despite performing worse in the unassisted exam. The authors point out that the students potentially overvalue the benefits of both GPT Base and GPT Tutor. Indeed, this confirms the well-known division of perceived learning and actual learning as separate constructs. At the same time, perceived learning can have other potential benefits, such as improved student experience and motivation.4
Other insights from the Appendix:
Perceived exam performance was significantly higher in the GPT Tutor arm compared to the control arm.
The GPT Tutor arm took about a minute longer than the control arm to finish the exam.
Access to GPT Tutor and GPT Base reduced grade dispersion i.e. access to generative AI can reduce the “skill gap” by helping weaker students more.
GPT Base and GPT Tutor have no effect on student absenteeism.
Key takeaways
Blindly implementing off-the-shelf AI tools may harm learning. This study gives some evidence to confirm many teachers’ suspicions that students can become over-reliant on AI. The GPT Base group’s experience is akin to students having access to off-the-shelf AI tools without any safeguards or instructions.
We need AI specialised for education. The negative effects of the GPT Base cohort did not appear with GPT Tutor cohort. But the amount of work which went into GPT Tutor was non-trivial. The authors had to hire a teacher to put together bespoke prompts for each of the 57 practice problems, with detailed instructions to ensure that direct answers were not given, and solutions and common student mistakes to reduce hallucinations. Each GPT Tutor prompt was 500+ words compared to the short ~50-word prompt for GPT Base. The result was an experience where students engaged more, asked better questions and had higher perceived learning.
Give students guidance on how to use generative AI. Tools like ChatGPT and Microsoft Copilot are designed to be assistants and as helpful as possible, but this can go against the goal of learning. Simply changing the prompt is not enough to break them free of these engrained principles. Without the proper instruction or guidance, students will be tempted to use these unspecialised tools to obtain answers directly. The study also showed that when GPT Base got a practice problem wrong, the student was also more likely to get it wrong. This shows that students also need to be taught to critically assess AI outputs.
AI is like the calculator… but it’s also not. The authors point out that this potential trade-off between short-term performance and long-term learning is also exhibited by the calculator. The wide-spread use of the calculator has most likely deteriorated our arithmetic skills. (Although, it is worth noting that I struggled to find any studies that have demonstrated this over long time horizon.) But they also point out two important differences: first, the capabilities of generative AI are substantially broader than a calculator, and second, generative AI is unreliable and provides incorrect responses. This combination means that “substantial work is required to enable generative AI to positively enhance rather than diminish education.”
Further areas of exploration
Measure learning more holistically. In this study, learning was measured as the result of an unassisted exam within the same 90-minute session. In reality, differences in learning may take longer to manifest. Students often learn concepts over multiple sessions, repeated practice, and exposure to different problem types. For example, one could have regular quizzes over a period of weeks for the same type of content.
Take into account task switching and changes in cognitive load. The GPT Base and GPT Tutor groups had to switch from listening to a lecture, then being on a laptop, then doing an exam (most likely on paper, but the study does not specify). The control group had less of a shift in cognitive process going from textbook to exam. It is possible that the greater cognitive load due to more substantive task switching contributed to a decline in the unassisted exam result.
Repeat the study with different subjects. Mathematics happens to be one of the subjects that GPT models are the worst at. GPT-4’s worst performance in an academic exam was AP Calculus BC. Confirming this, GPT Base only gave a correct answer 51% of the time, and most of these errors were logical errors.
Cut the results by demographic, behavioural or other background data. I would love to understand whether the results change by group. The authors did include the variables as controls (Appendix B.2), and they also looked for heterogeneous treatment effects on students’ previous GPA, access to private tutoring, and hours spent studying. They did find that weaker students (those with a lower GPA) and students who underwent private tutoring benefited more from GPT Base in the assisted problems. However, they did not publish any findings relating to gender differences or year-level differences.
Fabrizio Dell'Acqua, Edward McFowland, Ethan R. Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani. "Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality." Harvard Business School Technology & Operations Mgt. Unit Working Paper 24-013 (2023).
Shakked Noy and Whitney Zhang. "Experimental evidence on the productivity effects of generative artificial intelligence." Science 381, no. 6654 (2023): 187-192.
Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond, "Generative AI at Work." NBER Working Paper No. 31161, National Bureau of Economic Research, April 2023.
Traci Sitzmann, Katherine Ely, Kenneth G. Brown, and Kristina N. Bauer. "Self-assessment of knowledge: A cognitive learning or affective measure?." Academy of Management Learning & Education 9, no. 2 (2010): 169-191.
Fundamentally, generative artificial intelligence is still just a technology, and technology is a double-edged sword. The key is to see how it is used.
In the future, teaching will largely exhibit the characteristics of human-computer collaboration. Teachers in the classroom will be more of a guide, leading students to learn how to learn and how to ask questions; students' learning will no longer be confined to the classroom but will move towards a more open, diverse, and interactive space of human-machine interaction.