SteLLA: A Structured Grading System Using LLMs with RAG

A structured automated grading and feedback system which applies LLMs with RAG techniques.

Challenge to Address

Large Language Models (LLMs) have shown incredible general capabilities in many applications. However, how to make them reliable tools for some specific tasks such as automated short answer grading (ASAG) remains a challenge.

Proposed Model

We propose SteLLA (Structured Grading System Using LLMs with RAG) in which a) Retrieval Augmented Generation (RAG) approach is used to empower LLMs specifically on the ASAG task by extracting structured information from the highly relevant and reliable external knowledge based on the instructor-provided reference answer and rubric, b) an LLM performs a structured and question-answering-based evaluation of student answers to provide analytical grades and feedback. Experiments on a dataset collected from a college-level Biology course show that our proposed system can achieve substantial agreement with the human grader while providing break-down grades and feedback on all the knowledge points examined in the problem. A systematic analysis of the feedback generated by GPT4 provides insights into the usage of LLMs in the ASAG system.