Leon Furze's "AI Assessment Scale": a Critique and Two Alternative Options

11/22/2023

The “AI Assessment Scale” is a five-point scale created by Leon Furze to help educators clarify the appropriate level of generative AI (GenAI) use in their assessments. Its author breaks it down as follows:

No AI - Brainstorming and Ideas - Outlining and Notes - Feedback and Editing - Full AI

While it has many benefits and was an important contribution to our collective reflection on the most effective ways to help teachers and students integrate GenAI effectively and appropriately in an educational context, it also has limitations that ultimately call for a new Revised AI Assessment Scale. Here, I propose two: an “Intensity” scale and a “Competency scale”.

Leon Furze was kind enough to review a draft of this post and share that he is currently working with two co-authors on a refined version of his original scale for both K-12 and tertiary contexts that aligns with some of the ideas below. — An article to watch out for!

Limitations of the Original AI Assessment Scale

A first limitation of the Original AI Assessment Scale is that it is not explanatory. While levels 1 (“No AI”) and 5 (“Full AI”) are logical, there is no particular reason given for the choice or the 3 intermediate levels. Where does this breakdown come from? What justifies it? This lack of explanation means that the use of this scale does not really promote understanding. By contrast, an improved scale would stem from a general principle and thus give teachers and students, not only a document, but a conceptual rule they can apply in different contexts.
Such a general principle would also help teachers decide which “level” of GenAI use is appropriate for a particular assessment, and help students know why it comes with such instructions. To be fair, the AI Assessment Scale does come with recommendations. For instance, Leon Furze indicates that Level 2 “Brainstorming and Ideas” is suitable “for assessments where students need to demonstrate their writing skills…”. This indicates, once again, the quality of this instrument which, I believe, assumes the required general principle – but only implicitly.
Because it is only implicit, this general principle is not the organizing factor of the scale, which is its second limitation: the fact that it follows a chronological, rather than a logical order. What might justify the 5 levels of the scale is that they are steps in the writing process: brainstorming, outlining, drafting, finalizing. However, such a sequence is not additive. Allowing students to use AI to turn ideas into an outline (Level 3) does not mean that they are allowed to use it to generate these ideas at the previous step (Level 2). In that sense, “level” is probably not the right word to use here. Rather, these are different types of use of GenAI; or different steps at which it can be used. As a matter of fact, Leon Furze seems to agree when he writes:
“It might also be appropriate to break a task down into different elements, and apply different levels of the Scale to different parts of the assessment”.
The problem is, if the scale is organized by parts of an assessment (brainstorming, outlining, etc.), then “apply[ing] different levels of the Scale to different parts of the assessment” is tautological, or redundant. It is the same thing as breaking down the process for the students and telling them whether they can use AI, or not, for each step. While this is useful, this is not a scale, but a binary yes/no, which does not help students (or teachers) understand what the proper use of GenAI is, and why.
For example, when would “Feedback and Editing” (Level 4, or rather step or type 4) ever not be allowed? The only cases I can think of are instances where this would amount to AI giving the student new ideas (Level 2), proposing a better organization (Level 3), or doing part of the writing for them (5). However, in these scenarios, the issue is not the stage at which AI is used, but the extent to which it is used at that stage. As I explained above, the “levels” of the AI Assessment Scale are not additive, which means that it is not a scale of the amount of cognitive work offloaded onto AI. It dictates “when” AI can be used at different and independent steps of a generative process – but not “how much”.
As a side note, it could be useful to add a Level 1.5 between 1 and 2, or a parallel level running along all the other ones for the “meta” use students can do of AI to manage their time, plan their work, practice skills, manage their state of mind, etc. Likewise, Level 2 should probably include the use of AI for research purposes, whether it is identifying or analyzing sources.

A New, Revised AI Assessment Scale

To address these limitations and build on the benefits already provided by the AI Assessment Scale, I would propose a new, revised version.
As explained above, an improved scale would:

Be organized, not chronologically, by order of independent steps at which AI can be used; but logically, by order of intensity, i.e., extent to which AI can be used.
Follow a philosophical principle to determine the appropriate “level” of AI use in specific instances, such as particular assessments.

These principle and logical order can both be derived from the AI philosophy I laid out in a short e-book, AI @ School, which is precisely meant to help schools design thoughtful AI policies and competencies framework.
The fundamental idea is that “human beings use means to achieve ends, transforming inputs into outputs in accordance with certain values”, and that AI is one such technological means — although one that differs from others due to its capacity to imitate and automate our ability to transform inputs into outputs based on their meaning.
From there, the logical consequences are that AI can make us more efficient, effective, and innovative; and that it can also contradict our values, undermine our end-goals, marginalize our other means to reach them (such as our own skills), and call for the development of new competencies.
How does this translate into an AI Assessment Scale?
First, it helps us create an “intensity scale” of the extent to which GenAI can be used. Such a scale can be logically derived from the definition of both AI technologies and human action as follows:

Level 0 = Not using AI (treating AI as a Stranger)
Level 1 = Using AI to generate inputs that you turn into an output (treating AI as a Mentor)
Level 2 = Using AI to generate an output based on your inputs (treating AI as an Assistant)
Level 3 = Using AI to generate an output based on your instructions (treating AI as an Executant)

Contrary to the original AI Assessment Scale, this order of intensity can be applied to any step of the writing process (or any other generative process). Saying “you can use AI at Level 3 to create your outline” was tautological, because Level 3 was about creating outlines. The only other option available was Level 1 (not being allowed to use AI). More than a scale, this was a yes/no 1-point rubric. Such is not the case here. Students can be allowed to create an outline:

With no AI help
With AI’s input, but in such a way that its suggestions are only used as starting points (e.g., by comparing different suggestions and making modifications)
With AI’s help in enhancing a draft outline
With AI, based on a properly designed prompt (e.g., instructing a bot to create an outline for a specific assessment by providing proper context, specific steps to follow, criteria to meet, etc.)

Next, this philosophy also gives us a general principle to determine which “level” and extent of AI use is appropriate in specific instances and for particular purposes. Here is the logical argument:
Since AI use should not contradict our values, undermine our end-goals, or marginalize other means necessary to reach them such as our own skills;
And since we value academic and intellectual honesty, and aim to help students develop and demonstrate specific understandings and skills;
The rational conclusion is that:
AI can be used to facilitate and enhance learning and the demonstration of learning;
But AI should not be used to bypass learning and pretend to possess the targeted understandings and skills.
Which means that AI can be used as long as it does not perform for the student the specific cognitive operations that they are assessed on.

An AI Competency Scale

While this Revised AI Assessment Scale has many benefits, an even better instrument would not only describe “how much” GenAI can be used, but rather “how well” it should be used.
To promote the effective and appropriate use of GenAI, and to ensure that both stem from an understanding of its potential and limitations, the best approach is not to use a chronological scale (“when” to use it, at what stage in the generative process - the original AI Assessment Scale), or even an intensity scale (“how much” to use it, to what extent at each stage - the Revised Scale), but a competency scale.
Based on the philosophy outlined above, such an AI literacy scale would look something like this:

Level 0: Using AI without advanced input (instructions) for a relevant, accurate, and high-quality output; and without critical evaluation and refinement of this output
Level 1: Using AI with advanced input, but without critical evaluation or refinement of the output
Level 2: Using AI with critical evaluation and refinement of the output, but without advanced input
Level 3: Using AI with both advanced input and critical evaluation and refinement of the output

Students should obviously always use GenAI in such a way that they think deeply before and after using it, providing it with advanced instructions and critically evaluating and revising its products. This was not reflected in the original AI Assessment Scale. Saying that “AI can be used to outline entire responses or convert notes into organised ideas” (the descriptor for Level 3, “Outlining and Notes”) does not indicate that students should properly prompt AI and treat its outputs as starting points. The Revised Scale does refer to the student providing input to the AI and/or treating what it generates as an intermediary product, but its different levels become rather irrelevant if it is understood that students should always provide high-quality inputs (instructions) and turn whatever help they receive from AI into the final output themselves.

If students use GenAI this way (or rather, know that they are expected to and, just as importantly, know how to do so), then the meaning of an AI Assessment Scale changes. It no longer measures when or how much student can use AI in an assessment, but rather assesses how well they are able to do so to demonstrate (or acquire in the case of a formative assessment) the intended learning outcomes. Arguably, all assessments should simply come with the understanding that students can use AI “when” (at whatever step in the generative process) and “how much” they want — as long as they use it well, i.e., as long as the targeted thinking (cognitive operations) is not delegated to the machine, but performed before and after its use (thinking “outside the bot”) in the students’ initial input (advanced instructions) and final output (critical evaluation and revision of AI-generated intermediary products).
Note that this GenAI competency scale only touches on one aspect of AI literacy. Students should not only be able to use generative AI effectively, but also safely and ethically. And their knowledge, understanding, skills, and dispositions should transfer to other AI technologies.

It does serve a specific purpose, though, and I believe that it does so better than the original AI Assessment Scale. I am excited to see the revised version Leon Furze is currently working on, and hope that these and other contributions will help schools navigate the disruptions (positive and negative) created by AI and integrate this technology in a way that harnesses its full benefits and mitigates its inherent risks.

0 Comments