2026-06-11

DeepMind's Sierra Leone RCT: AI Tutoring's Real Effect Depends on Who It Helps, Not What It Teaches

1,763 students, eight weeks, +0.258 standard deviations. A rare causal result for AI in education. But the students who gained most were already the strongest, and whether it transfers is the question builders should ask.

ai-education rct deepmind

DeepMind's Sierra Leone RCT: AI Tutoring's Real Effect Depends on Who It Helps, Not What It Teaches — Photo / Unsplash

Summary

DeepMind released a pre-registered randomized controlled trial run in Sierra Leone: eight weeks, 12 schools, 1,763 junior secondary students, measuring the effect of Guided Learning in Gemini on math scores. The result was +0.258 standard deviations over the control group, translated into roughly 1.2 to 1.7 years of typical learning progress in eight weeks. This is a rare piece of methodologically serious causal evidence for AI in education, not another product blog post.

What matters is not the headline number but the two conditions attached to it. First, the students who gained the most were the ones who entered with the strongest math skills, which DeepMind flags in its own “path forward” section as the “achievement gap” problem. Second, this was achieved in a setting with a very low math baseline and stretched teacher resources, and the same intervention dropped into schools with a higher ceiling would likely show a smaller effect. The real question about AI tutoring is not what it teaches but who it helps. For an edtech builder, this trial offers less a verdict that “AI can teach” and more a map of which populations and which classroom designs make it work.

What happened

The trial was run by DeepMind with Fab AI and supported by the Sierra Leone Ministry of Education, with additional support from Google.org, the Gates Foundation, EducAid, Laterite, and Oxford MeasurEd. The site was 12 schools in Port Loko District, the sample 1,763 junior secondary students, the duration eight weeks, the intervention Guided Learning in Gemini, the outcome math scores. The key feature is that it was pre-registered: what to measure and how was fixed before the data came in, which closes the door on cherry-picking outcomes afterward and is the main reason it is far more credible than most edtech self-reports.

The quantitative result: +0.258 SD over control, translated to about 1.2 to 1.7 years of typical progress, and about 1.8 to 2.5 years in classrooms where teachers worked Gemini into roughly half their lessons to hit a 12-hour usage target. On engagement, 69% of students met or exceeded usage targets, far above the roughly 5% typical of voluntary edtech, the so-called five percent problem.

DeepMind also released process data to address the worry that AI becomes a shortcut for copying answers. Across more than 113,000 interactions, 91.4% were used to build conceptual understanding rather than seek solutions; 76% of Gemini’s messages posed scaffolding (Socratic) questions, and only 2% gave a direct solution. Over time, students’ queries shifted from 68% skill-building in week one to 90% by the final week, while solution-seeking questions dropped from 25% to 10%. In focus groups, teachers reported that preparing lessons with the tool gave them new ways to explain familiar topics like fractions, with their role shifting from “lecturers” to “facilitators.” DeepMind also released a teacher training guide built with Fab AI and a playbook for running these RCTs.

Why it matters

It moves AI in education from anecdotes and demos to citable causal evidence. The field’s biggest problem has been that almost every positive claim came from a vendor’s own retention curve or an uncontrolled pilot, with no way to separate “the AI actually taught” from “students willing to use AI were already more motivated.” A pre-registered RCT with a control group cuts that confound. On method alone, this is one of the few pieces of evidence in AI education worth taking seriously.

But how you read the effect size determines what you conclude. +0.258 SD is fairly large for an education intervention, yet once it is converted into “years,” you have to watch two amplifiers. The first is baseline: Sierra Leone’s junior secondary students start low on math, and the lower the start, the larger the visible SD gain from any effective intervention, because there is more room to improve and the scale is more sensitive at the bottom. The second is measurement itself: the same 0.258 typically shrinks in a setting with a higher ceiling where the control group is also improving fast. None of this makes the result fake. It means the “1.2 to 1.7 years” conversion is highly context-dependent and should not be treated as a constant Gemini reproduces everywhere.

The “achievement gap” DeepMind flags is the report’s most honest and most memorable finding: most students benefited, but the strongest students benefited most. That puts a counterintuitive fact on the table. In the form this trial took, AI tutoring widened an existing gap rather than closing it. For a technology so often pitched as bringing quality education to everyone, that is a tension worth confronting directly. It also answers the opening question: AI tutoring’s real effect depends on who it helps. When it helps students who already know how to learn and can work with scaffolding questions, the gain is largest, while the students who most need help may sit lower on the benefit curve.

Builder impact

First, do not cite +0.258 SD as your product’s expected outcome. It was earned in a specific population and format: low baseline, teacher-led, embedded in formal lessons. What to copy is not the number but the method: pre-register, use a control group, report effect sizes rather than retention. If your growth deck shows only DAU and satisfaction, any investor or district that has seen this report will ask you for a control group, and that is the new bar for credibility.

Second, treat the teacher as part of the product, not the thing being replaced. That 69% engagement almost certainly came not from model charm but from teachers designing the curriculum, setting objectives, and running discussions, so students were not left alone with an app. DeepMind’s own framing is that AI “augments teachers’ reach” rather than replaces them. If your product puts a student alone in front of a chat box, you are reproducing exactly the isolated setting this trial did not measure, and the five percent problem will likely find you.

Third, treat the achievement gap as a core product problem, not a PR footnote. If your target customers are developing regions or under-resourced schools, beware a tool that makes the strong stronger: it can look great on aggregate metrics while the disadvantaged students you most want to serve fall behind. DeepMind lists “deliver the strongest gains for the students who need it most” as the next need to solve, which means even they have not solved it. Whoever makes guided learning produce the largest gains for low-starting students will have solved this market’s real problem, rather than added another retention curve.

What to ignore

Ignore the reading that “AI is now proven to teach” by treating a single trial as a universal conclusion. This RCT proves that one specific guided-learning product produced measurable math progress over eight weeks in a low-baseline, teacher-led setting in Sierra Leone. It does not prove, and does not claim, that the gain transfers intact to better-resourced schools in the US, Europe, or China. Treating a single result as a law is the most common misuse of this kind of study.

Ignore the urge to use the “years” conversion as a hard metric. The 1.2 to 1.7 years is an interpretive projection of one SD gap onto typical learning speed, highly dependent on Sierra Leone’s low baseline and this trial’s measurement scale. It is useful for grasping how big the effect is, not for copy-pasting into “use Gemini and learn an extra year and a half” marketing.

Do not swing to the other extreme and dismiss it because it is a single-country, single-vendor trial. The pre-registration and control group are real, and the analysis of 113,000-plus interactions supplies mechanism evidence that there is genuine pedagogy in the gain, not pure low-baseline windfall. The honest stance is to grant it internal validity (causation holds within this population) while noting that external validity (whether it transfers) is not yet established. DeepMind says it is running more pre-registered RCTs across countries, and until those land, treat this one as strong existence proof rather than an established universal rule.

FAQ

Is +0.258 standard deviations a big effect for an education intervention?

It is fairly large by education standards. DeepMind translates it into roughly 1.2 to 1.7 years of typical learning progress in eight weeks, and classrooms where teachers used Gemini in about half their lessons (a 12-hour target) reached roughly 1.8 to 2.5 years. But the absolute size of an effect depends heavily on baseline and measurement. Sierra Leone's students start low on math, and the same SD gain usually shrinks in settings with a higher ceiling.

Why didn't this trial fall into the five percent problem?

Voluntary edtech typically sees only about 5% of students actually use it, the long-standing five percent problem. Here 69% met or exceeded usage targets. The reason is not the model but the format: it was teacher-led and embedded in formal lessons. Teachers designed the curriculum, set objectives, and ran discussions, so students were not left alone with an app. Crediting the model misreads the result.

Are the gains just because Sierra Leone students started so low that anything would help?

A low baseline does inflate visible gains, and that has to be discounted when reading the number. But the trial also shows mechanism: across 113,000-plus interactions, 91.4% were used to build conceptual understanding rather than seek answers, Gemini posed scaffolding questions in 76% of messages and gave direct solutions in only 2%, and skill-building queries rose from 68% in week one to 90% by the last week. That points to real pedagogy in the gain, not pure low-baseline windfall, though the trial does not separate how much each contributes.

Sources

Measuring the impact of learning with AI in Sierra Leone and beyond / official