Scaling Feedback Quality: Lessons from 50,000 Assessed Submissions

After 50,000 submissions processed across 34 institutions in our first full year of operation, we have enough data to say something substantive about where AI-assisted feedback creates genuine educational value — and where it doesn't.

This post covers the findings we think are most actionable for assessment coordinators and teaching teams considering a shift to AI-assisted marking workflows.

Feedback Volume Increased Substantially

The most consistent finding across institutions was that AI assistance dramatically increased the volume of feedback students received. Before Graddle, the median feedback length for a 1,500-word undergraduate essay at partner institutions was 67 words. After one semester of AI-assisted marking, the median rose to 203 words — a threefold increase.

This is partly a function of AI making it easier for markers to generate text, but it also reflects a subtler effect: when the AI has already drafted a first pass at criterion-specific feedback, markers are more willing to expand and personalise it than they would be starting from a blank text box at the end of a long marking session.

Feedback Specificity Drives Improvement

Increased volume is only valuable if the content is good. We tracked student revision quality across two assignment iterations — an initial submission and a resubmission — for a subset of 8,400 students who received either generic feedback (e.g. "argument could be clearer") or specific, criterion-referenced feedback (e.g. "the argument in paragraph 3 conflates correlation with causation — see the lecture 4 discussion of Simpson's Paradox for the distinction").

Students who received specific feedback improved their resubmission score by an average of 9.4 percentage points. Those who received generic feedback improved by 3.1 points. The difference was consistent across disciplines, year levels, and prior GPA bands — suggesting specificity is the active ingredient, not just a proxy for other factors.

Speed of Return Matters More Than Expected

A secondary analysis looked at the relationship between feedback turnaround time and resubmission improvement rates. Submissions returned within 48 hours of the deadline saw students improve at roughly double the rate of those returned after two weeks, holding feedback quality constant.

The mechanism is likely motivational rather than purely cognitive — students are more invested in acting on feedback when the assignment is still contextually fresh. AI-assisted marking enables faster turnaround at scale, which means faster feedback cycles, which compounds over a semester in ways that manual marking at volume simply cannot.

Discipline Variation Is Real

AI assistance worked best in disciplines with structured, criterion-specific rubrics — law, economics, nursing clinical documentation, and most STEM fields. It worked least well in creative writing, fine arts criticism, and highly contextual historical analysis, where the quality of a response often depends on tacit domain knowledge that is difficult to encode in a rubric.

This isn't a surprising finding, but it's worth stating explicitly: AI-assisted marking is not a universal solution. It is a very strong fit for a large proportion of higher education assessment, and a poor fit for a meaningful minority. Knowing which category your courses fall into is the first step in making an honest adoption decision.

What We're Building Toward

The data pattern that's most shaping our product roadmap is the compound effect of fast, specific feedback across multiple assessment points in a single semester. Students who received AI-assisted feedback on three or more assignments in the same unit showed improvement trajectories markedly steeper than those who received it on one or two. This suggests that the feedback loop — not any single feedback event — is the key variable.

Our focus for the coming year is building the tools that help institutions design cohesive assessment sequences, not just better individual marking events. That means cross-assignment analytics that surface longitudinal student development patterns, rubric recommendation tools that encourage specificity, and tighter integration with LMS grade books so the feedback pipeline has no friction points.

We'll publish a full methodology note alongside this data in our forthcoming impact report. If you're interested in the statistical analysis or want to contribute your institution's data to the research cohort, reach out at research@graddle.app.