Exercise Recommendation Benchmark

UniER A Unified Benchmark for
Item-level and Path-level
Exercise Recommendation

UniER evaluates personalized exercise recommendation under one framework by aligning item-level recommendation and path-level learning path generation with Weighted Cognitive Gain.

Paper Code and Data

9public datasets

18representative methods

4generation families

2WCG task settings

Overview

UniER studies personalized exercise recommendation across two paradigms: ILER, which focuses on immediate item-level recommendation, and PLER, which constructs coherent learning paths. The benchmark addresses the lack of unified evaluation for fair cross-paradigm comparison.

Evaluation Gap

ILER and PLER have shared educational goals but historically rely on incompatible metrics, datasets, and experimental settings. This makes cross-method and cross-paradigm comparison difficult.

UniER Contribution

UniER introduces Weighted Cognitive Gain and a unified experimental setting to compare ILER and PLER methods across effectiveness, generalizability, robustness, and efficiency.

Overview of the UniER benchmark framework

RQ1

How do ILER and PLER perform under the unified WCG metric?

RQ2

How does performance degrade under data sparsity and cold-start?

RQ3

How robust are representative models against underlying label noise?

RQ4

What is the trade-off between computational cost and pedagogical performance?

Paradigms

UniER brings two exercise recommendation paradigms into a shared output space: every evaluated method is represented as an ordered learning path under a cognitive budget of N steps.

Item-Level Exercise Recommendation

ILER recommends a Top-K set of exercises for a student's immediate cognitive needs. End-to-end methods directly map unmastered knowledge concepts to exercises, while two-stage methods separate candidate retrieval from diversified re-ranking.

Path-Level Exercise Recommendation

PLER generates an ordered exercise sequence toward pedagogical goals. Full-path methods generate the whole path at once, while step-by-step methods use sequential decisions, often with reinforcement learning.

Unified Output

To compare the paradigms fairly, UniER projects ILER outputs into sequential learning paths by sorting the recommended exercises in descending order of predicted recommendation scores. PLER outputs are already ordered paths.

Datasets

UniER evaluates nine widely recognized educational datasets spanning different domains, scales, and knowledge concept structures.

Dataset	#Interactions	#Students	#Exercises	#KC	Description
ASSISTments2009	346,860	4,217	26,688	123	Knowledge-tracing data from the ASSISTments online tutoring platform.
ASSISTments2012	6,123,270	46,674	179,999	265	ASSISTments data from a different school year than the 2009 release.
ASSISTments2017	942,816	686	3,162	102	Subset from the ASSISTments Data Mining Competition.
Algebra2005	813,661	575	1,084	112	KDD Cup 2010 algebra logs with knowledge concept annotations.
Bridge2006	3,686,871	1,146	19,258	493	KDD Cup 2010 bridge-to-algebra logs with long-term interaction volume.
NIPS34	1,382,727	4,918	948	57	NeurIPS 2020 Education Challenge diagnostic math questions with skill labels.
Junyi	25,925,992	247,606	722	41	Student problem-solving logs, exercise metadata, and annotated exercise relations.
Ednet	95,293,926	784,309	13,169	188	Large-scale Santa AI tutoring data with diverse student activities over more than two years.
XES3G5M	5,549,635	18,066	7,652	865	Real-world K-12 online math learning data with auxiliary metadata and hierarchical KCs.

Evaluation

Weighted Cognitive Gain is the primary metric in UniER. It measures expected weighted improvement in knowledge mastery before and after executing a generated learning path.

Weighted Cognitive Gain

WCG uses a task-specific target weight distribution over knowledge concepts. The normalized weight vector lets the same metric express different pedagogical goals.

WCG(T_unified | w) = ∑_c∈C w_c · (M_KT^(c)(H_s,t ⊕ Sim(T_unified)) - M_KT^(c)(H_s,t))

Here, H_s,t is the student's interaction history, Sim(T_unified) is the simulated response sequence after executing the path, M_KT^(c) estimates mastery of concept c, and w_c controls the task-specific importance of that concept.

Two WCG Settings

TGA evaluates targeted goal achievement by weighting a specific target concept set.

GPP evaluates global proficiency promotion by weighting all concepts the student has not yet mastered.

Additional Metrics Considered in UniER

NDCGPosition-aware ranking quality.

F1-scoreBalance of precision and coverage for cognitive deficits.

RecallCoverage of unmastered knowledge concepts.

MRRFirst occurrence position of a weak KC.

NoveltyAbility to recommend less familiar exercises or KCs.

ProximityDifficulty match between exercises and student level.

VolatilityDifficulty fluctuation between consecutive exercises.

AccuracyAppropriateness of recommended exercise difficulty.

MAPRanking accuracy across weak KCs.

DiversityDissimilarity across recommended exercises.

Methods

UniER benchmarks 18 representative methods across four generation families.

End-to-end ILER

Direct recommendation from diagnosed student state or unmastered concepts.

Open family details

Two-stage ILER

Candidate retrieval followed by filtering, re-ranking, or diversity optimization.

Open family details

Full-path PLER

Encoder-decoder generation of the entire learning path in a single step.

Open family details

Step-by-step PLER

Sequential learning path generation through dynamic decisions and feedback.

Open family details

Contributors

Mentors

Quanlong Guan

JNU Professor

Shirui Pan

Griffith University Professor

Liangda Fang

JNU Associate Professor

Jiapu Wang

NJUST Associate Professor

Yixin Liu

Griffith University Lecturer

Graduate Students

Xinghe Cheng

PhD Student @ JNU (09/2023-)

Guiyong Zhuang

MS Student @ JNU
(09/2025-)

Yusheng Xie

MS Student @ JNU
(09/2025-)

Undergraduate Students

Danni Wang

Undergrad Student @ JNU (09/2024-)

Longhui Jiang

Undergrad Student @ JNU (09/2023-)

UniER A Unified Benchmark forItem-level and Path-levelExercise Recommendation

Overview

Evaluation Gap

UniER Contribution

RQ1

RQ2

RQ3

RQ4

Paradigms

Item-Level Exercise Recommendation

Path-Level Exercise Recommendation

Unified Output

Datasets

Evaluation

Weighted Cognitive Gain

Two WCG Settings

Additional Metrics Considered in UniER

Methods

End-to-end ILER

Two-stage ILER

Full-path PLER

Step-by-step PLER

Contributors

Mentors

Graduate Students

Undergraduate Students

UniER A Unified Benchmark for
Item-level and Path-level
Exercise Recommendation