Assessment

Artificial Intelligence: assessing performance claims

If the brains behind Artificial Intelligence claim their creations can perform in tests like humans then surely those results should be assessed as if they were produced by humans. AQA's Head of Research and Development, Dr Cesare Aloisi, says this is vital to maintain trust in AI but fears that is not what is happening

Date25/05/23

AuthorDr Cesare Aloisi

Claims of AI achievements need proper assessment

Image courtesy of mikemacmarketing/www.vpnsrus.com

Google recently released a new iteration of its AI large language model PaLM2.

To demonstrate its capabilities, developers had it 'sit' advanced language assessments in German, Italian, Spanish, French, Chinese and Japanese.
It passed all at ‘mastery’ level.

That's an impressive result, on the surface. But the fine detail gave me concern about the way AI systems’ performances are being assessed and portrayed.

Did PaLM2 actually pass a language test?

The technical report shows PaLM2 passed part of a recognised test using a changed scoring system, marked by non-experts and awarded a sometimes dubious percent score.

To explain my reasoning, let’s drill down into PaLM2’s performance using the PLIDA Italian language proficiency test it sat as an example.

Firstly, PLIDA assesses Reading, Listening, Writing and Speaking.
PaLM2 cannot be assessed on Speaking, and the transcript of the Listening component was used as a Reading test.

That means it did not sit the whole test and casts doubt over the ’overall’ assessment.

How was PaLM2's language assessment marked?

Now we come to the scoring system.

PLIDA scores each of its components out of 30 while PaLM2’s Writing was scored out of 5.

The only question asked of the markers was something like ‘does this read like a native speaker wrote it?’

Imagine you’re a marker and think PaLM2 did ‘OK’. Does 3 out of 5 seem right? How about 16 out of 30 using PILDA’s system?

In the first case with little room for nuanced marking , PaLM2 passes with 60%.

For the second, marked with more subtlety, 16 out of 30 is 53% and a fail.
This is why assessment scoring systems matter.

Who marked PaLM2's work?

Then there’s the humans marking it.

‘Non expert’ native speakers assessed the written test. They were not described as assessment experts qualified to rate others’ performances.

Professional markers have to make sense of marking concepts such as ‘adequacy of the register to the situational context.’ This is what they are trained to do.

It is not clear the non-experts looking at PaLM2's responses were asked, or able, to do this.

Scoring PaLM2 does not always add up

Finally, we’ll move on to the scores I described as ‘sometimes dubious.’

Three people marked the Writing out of 5 and the average was used to calculate a percentage.

For Chinese and French, PaLM2 scored an impressive 82% and 85% respectively.

However, these percentages are unachievable from an average of three scores out of 5. Even considering rounding error, there’s a jump from 80% to 87% ruling out any score in between.

These are small errors, but how many others may there be?

Was there a lack of assessment know-how applied to PaLM2?

I may be accused of being unreasonably picky. Sure, these are unofficial results to demonstrate PaLM2’s capabilities but, the way Google casually rebuilt a well thought out, rigorous assessment reveals a worrying issue – the lack of assessment literacy within the AI community.

Quite often, developers have no understanding of assessment and may not realise it is a standalone discipline.

Whenever they train, assess and evaluate their systems, they do so, not as experts in assessment but in AI development.

This matters because these systems are increasingly required to perform tasks alongside or in place of people.

Imagine an AI model as a student being trained to do something – you’d want the training delivered by an expert in the subject and pedagogy.

Assessment, likewise, should also be carried out by an expert.

AI should be evaluated with the same rigour applied to humans on those same tasks.

There shouldn’t be free passes simply because no comparable assessments for AI exist.

If PaLM2 can act like a human should it be assessed as one?

The more we conceptualise AI in human terms, the greater the need is to bridge machine learning with human learning using assessment experts.

There’s room for optimism however. OpenAI had qualified markers look at written assessments its GPT-4 model created.

I hope collaboration like this between assessment experts and AI professionals continues to grow and furthers research in the field.

While I have shown why I think PaLM2 didn’t actually pass language tests, knowing whether it could pass a proper, real-world language test could have important repercussions on how we assess people.

Computing

Could girls be the secret to boosting the UK’s growth as a technology superpower?

What if women played a more central role in responding to the rapid technological changes in our world? Girls in England outperform boys at every grade level but disproportionately don’t take Computer Science at GCSE.

Assessment

What can this year’s GCSE entries tell us to look for in tomorrow’s results?

With GCSE Results published on 22nd August, Dr Chinwe Njoku looks into the underlying data on what subjects this year's cohort took and how this has changed from previous years.

Assessment

A-level maths students hit six figures – what’s behind its popularity?

On Thursday, more than 100,000 A-level maths students in England will find out their results – 11.4% more than last year. Why the upturn? Dr Chinwe Njoku, AQA Education Insights Lead and former maths teacher, was heartened by the news and keen to look at the story behind the data.

Education Policy

What comes after ‘urgent’ for the new Education Secretary?

After the burning issues are addressed, what should come next for the new Education Secretary?

Education Policy

Labour’s oracy plans: They need clear goals

Sir Keir Starmer has said he wants to boost students’ confidence by raising the importance of speaking skills – oracy. In this previously published blog, Reza Schwitzer, AQA’s director of external affairs, applauds the ambition but warns there needs to be clear goals

Education Policy

Through the looking glass: How polling the public can help policymakers learn about themselves

Public attitude data is key to effective policymaking. Proper polling can reveal what people think about existing policies and what they want for the future. But, if looked at from a different angle, it can also help policymakers question themselves and their assumptions about the public. In this blog, AQA’s Policy and Evidence Manager Adam Steedman-Thake, reveals the lessons he learned about himself while reading a recent public attitude survey.

Assessment

Assessing oracy: Is Comparative Judgement the answer?

Oracy skills are vital to success in school and life. And yet, for many children, opportunities to develop them are missed. Educationalists are engaging in a growing debate about where oracy fits into the school system. Labour has put it at the heart of its plans to improve social mobility and an independent commission is looking at how it is taught in the classroom. This renewed focus on oracy means it is more important than ever that teachers have a way to reliably assess and understand their students’ attainment and progression. Amanda Moorghen of oracy education charity Voice 21 explains how Comparative Judgement can help with that and why it may be a game changer.

Education

TV subtitles as an aid to literacy: What does the research say?

Jack Black is probably best known in educational circles for playing a renegade substitute teacher in School of Rock. But the Hollywood star has made a more conventional foray into education by backing the use of TV subtitles to improve child literacy. Stephen Fry and the World Literacy Foundation also want parents to use their TV remotes to get children reading. So, could this simple click of a button be a solution to boost pupils’ reading skills? AQA’s resident expert on language teaching, Dr Katy Finch, casts her eye over the research to see whether it stacks up.

Data Analysis

What is left behind now education’s Data Wave has receded?

Is data the solution to all education’s issues? About a decade ago the prevailing wisdom said it was. Advocates of this Data Wave argued that harvesting internal statistics would help schools solve issues such as teacher accountability and attainment gaps. As with all waves, after crashing onto the beach they recede, leaving space for another to roll in. In this blog, teacher, author and data analyst Richard Selfridge looks at the legacy of the Data Wave to see what schools can take from it.

International Approaches

Finland & PISA – A fall from grace but still a high performer?

Finland was once recognised as one of the most successful educational systems in the world. At the turn of the millennium, it topped the PISA rankings in reading, maths and science. But by 2012, decline set in. The last set of results showed performances in maths, reading and science were at an all-time low. In this blog Dr Jonathan Doherty of Leeds Trinity University outlines some reasons that may account for the slide.

Download a PDF version.

Download a copy of this content to your device as a PDF file. We generate PDF versions for the convenience of offline reading, but we recommend sharing this link if you'd like to send it to someone else.

Artificial Intelligence: assessing performance claims

Did PaLM2 actually pass a language test?

How was PaLM2's language assessment marked?

Who marked PaLM2's work?

Scoring PaLM2 does not always add up

Was there a lack of assessment know-how applied to PaLM2?

If PaLM2 can act like a human should it be assessed as one?

Could girls be the secret to boosting the UK’s growth as a technology superpower?

What can this year’s GCSE entries tell us to look for in tomorrow’s results?

A-level maths students hit six figures – what’s behind its popularity?

What comes after ‘urgent’ for the new Education Secretary?

Labour’s oracy plans: They need clear goals

Through the looking glass: How polling the public can help policymakers learn about themselves

Assessing oracy: Is Comparative Judgement the answer?

TV subtitles as an aid to literacy: What does the research say?

What is left behind now education’s Data Wave has receded?

Finland & PISA – A fall from grace but still a high performer?

Join the conversation on Twitter

Download a PDF version.

Sign up to our newsletter