In part One, I suggested that coursebook-driven ELT is a prime example of the commodification of education. Here, in Part Two, I focus on the Common European Frame of Reference (CEFR) and high stakes tests. The global adoption of coursebook-driven ELT is illustrated by the increasing use of the CEFR, which informs not just coursebooks, but the high stakes tests which loom large in the background. I rely mostly on the work of Glenn Fulcher and on Jordan & Long (2022).
18,1. As Fulcher (2010), argues, citing Bonnet (2007), the CEFR is increasingly being used to promote a move towards “a common educational policy in language learning, teaching and assessment, both at the EU level and beyond”. The rapid spread of the use of the CEFR across Europe and other parts of the world is due to the ease with which it can be used in standards-based assessment. As a policy tool for harmonization, the CEFR is manipulated by “juggernaut-like centralizing institutions”, which are using the CEFR to define required levels of achievement for school pupils as well as adult language learners worldwide.
The indiscriminate exportation of the CEFR for use in standards-based education and assessment in non-European contexts, such as Hong Kong and Taiwan, shows that it is being increasingly used as an instrument of power ((Davies 2008: 438).
18.2. Fulcher (2008: 170) nails the problem of the CEFR. It requires a few seconds close reading, if you’ll forgive me, to appreciate its full import.
It is a short step for policy makers, from ‘the standard required for level X’ to ‘level X is the standard required for….’ (emphasis added).
Fulcher (ibid.) comments: “This illegitimate leap of reasoning is politically attractive, but hardly ever made explicit or supported by research. For this step to take place, a framework has to undergo a process of reification, a process defined as “the propensity to convert an abstract concept into a hard entity” (Gould 1996: 27)”.
18.3. The CEFR scale descriptors are based entirely on intuitive teacher judgments rather than on samples of performance. The scales have no empirical basis or any basis in theory, or in SLA research. They’re “Frankenstein scales”, as Fulcher calls them. We can’t reasonably expect the CEFR scale to relate to any specific communicative context, or even to provide a measure of any particular communicative language ability. To quote Fulcher (2010) again:
Most importantly, we cannot make the assumption that abilities do develop in the way implied by the hierarchical structure of the scales. The scaling methodology assumes that all descriptors define a statistically unidimensional scale, but it has long been known that the assumed linearity of such scales does not equate to how learners actually acquire language or communicative abilities (Fulcher 1996b, Hulstijn 2007, Meisel 1980). Statistical and psychological unidimensionality are not equivalent, as we have long been aware (Henning 1992). The pedagogic notion of “climbing the CEFR ladder” is therefore naïve in the extreme (Westhoff 2007: 678). Finally, post-hoc attempts to produce benchmark samples showing typical performance at levels inevitably fall prey to the same critique as similar ACTFL studies in the 1980s, that the system states purely analytic truths: “things are true by definition only” (Lantolf and Frawley 1985: 339), and these definitions are both circular and reductive (Fulcher 2008: 170-171). The reification of the CEFR is therefore not theoretically justified.
19.1. Current English language testing uses the CEFR scale in three types of test: first, placement tests, which assign students to a CEFR level, from A1 to C2, where an appropriate course of English, guided by an appropriate coursebook, awaits them; second, progress tests, which are used to decide if students are ready or not for their next course of English; and third, high-stakes-decision proficiency tests (a multi-billion-dollar commercial activity in its own right), which are used purportedly to determine students’ current proficiency level.
19.2. The key place of testing in the ELT industry is demonstrated not just by exam preparation materials which are a lucrative part of publishing companies’ business, but by the fact that most courses of English provided by schools and institutes at all three educational levels start and finish with a test.
Perhaps the best illustration of how language testing forms part of the ELT “hydra” is the Pearson Global Scale of English (GSE), which allows for much more finely grained measurement than that attempted in the CEFR. In the Pearson scale, there are 2,000 can-do descriptors called “Learning Objectives”; over 450 “Grammar Objectives”; 39,000 “Vocabulary items”; and 80,000 “Collocations”, all tagged to nine different levels of proficiency (Pearson, 2019). Pearson’s GSE comprises four distinct parts, which together create what they proudly describe as “an overall English learning ecosystem” (Pearson, 2019, p.2.). The parts are:
• The scale itself – a granular, precise scale of proficiency aligned to the CEFR.
• GSE Learning Objectives – over 1,800 “can-do” statements that provide context for teachers and learners across reading, writing, speaking and listening.
• Course Materials – digital and printed materials, most importantly, series of General English coursebooks.
• Assessments – Placement, Progress and Pearson Test of English Academic tests.
As Jordan & Long (2022) comment:
Pearson say that while their GSE “reinforces” the CEFR as a tool for standards-based assessment, it goes much further, providing the definitive, all-inclusive package for learning English, including placement, progress and proficiency tests, syllabi and materials for each of the nine levels, and a complete range of teacher training and development materials. In this way the language learning process is finally and definitively reified: the abstract concepts of “granular descriptors” are converted into real entities, and it is assumed that learners move unidimensionally along a line from 10 to 90, making steady, linear progress along a list of can-do statements laid out in an easy-to-difficult sequence, leading inexorably, triumphantly, to the ability to use the L2 successfully for whatever communicative purpose you care to mention. It is the marketing division’s dream, and it shows just how far the commodification of ELT has already come.
19.3. The power of high stakes tests is exemplified by the work of the Cambridge Assessment Group. It has three major exam boards: Cambridge Assessment English, Cambridge Assessment International Education, and Oxford Cambridge and RSA Examinations. (Note that all these companies are owned by the University of Cambridge and are registered as charities, exempt from taxes!) The group are responsible for the Cambridge B2 (formerly the First Certificate Exam) and Cambridge C1 (formerly the Cambridge Advanced Exam), and also, along with their partners, for the IELTS exams, used globally as a university entrance test (the Academic module), an entrance test to many professions and job opportunities, and as a test for those wishing to migrate to an English-speaking country (the General English module).
In 2018, the Cambridge Assessment Group designed and delivered assessments to more than 8 million learners in over 170 countries, employed nearly 3,000 people in more than 40 locations around the world and generated revenue of over £382 million (tax free). More than 25,000 organizations accept Cambridge English exams as proof of English language ability, including top US and Canadian institutions, all universities in Australia, New Zealand and in the UK, immigration authorities across the English-speaking world, and multinational companies including Adidas, BP, Ernst & Young, Hewlett-Packard, Johnson & Johnson, and Microsoft. The Cambridge English exams can be taken at over 2,800 authorized exam centers, and there are 50,000 preparation centers worldwide where candidates can prepare for the exams. The impact of the Cambridge Assessment Group’s tests on millions of individual lives can be life-changing, and the scale of their activities means that they have global political, social, economic, and ethical consequences, suggesting to many that an independent body is needed to regulate them.
19.4. As indicated above, “proficiency” in the high scale tests is an epiphenomenon – a secondary effect or by-product of the thing itself. Overall “proficiency” is divided into levels on a proficiency rating scale, determined by groups of people who write proficiency level descriptors, and decide that there are X levels on the particular scale they develop. In fact. only zero and near-native proficiency levels are truly measurable. We know this from the results from countless empirical SLA studies that have tried to identify the advanced learner, which has required the ability to distinguish near-native speakers from true native speakers. Results of these studies consistently show such distinctions are possible provided measures are sufficiently sensitive (Hyltenstam, 2016), and they demonstrate that any other distinctions along proficiency scales are unreliable.
19.5. Beyond the proficiency scale descriptors, there are numerous problems in the tests that elicit language samples on which scores and ratings are based. For example, proficiency tests typically employ speaking prompts and reading texts which purport to have been “leveled,” i.e., judged to aim at the level concerned. This is nonsense. Apart from highly specialized material, all prompts and all texts can be responded to or read at some level; the amount of information conveyed or understood will simply vary as function of language ability. Moreover, proficiency scales offer little in the way of diagnostic information which could indicate to teachers and learners what they would need to do to improve their scores and ratings.
19.6. There is little evidence that proficiency ratings are predictive of success in any language use domain. Even if a test taker can succeed in the testing context, there is no way to tell whether this means the person will succeed outside that context, for example in using language for professional purposes.
19.7. The administration and management of high stakes tests raises the issue of discrimination based on economic inequality. The test fees are high and vary significantly – in the IELTS tests, fees vary from the equivalent of approximately US$150 in Egypt to double that in China, a difference explained more by Chinese students’ desire to study abroad than by any international differences in administration or management costs. Such are the expenses involved in taking these tests that they evidently discriminate against those with lower economic means and make it impossible for some people to take the test multiple times in order to achieve the required score. W.S. Pearson (2019) also points out that the owners of IELTS produce and promote commercial IELTS preparation content, which takes the form of printed and on-line materials and teacher-led courses. These make further financial demands on the test-takers, and while some free online preparation materials are made available on the IELTS website, full access to the materials costs approximately US$52, and is free only for candidates who do the test or a preparation course with the British Council. Likewise, details of the criteria used to assess the IELTS writing test are only freely available to British Council candidates; all other candidates are charged approximately US$55 for this important information. Finally, it should be noted that it is common, for those who can afford it, to take the IELTS multiple times in an attempt to improve their scores, and that the score obtained in an IELTS test is only valid for two years.
19.8. The simplicity and efficiency with which high stake test scores can be processed strengthens the perception that the scores are used blindly by the gatekeepers of university entrance,. If an overseas student does not achieve the required score, their application for admission to the university is normally turned down. Even more questionable is the use of the test by employers to assess prospective employees’ ability to function in the workplace, despite the fact that, in most cases, none of the test tasks closely corresponds with what an employee is expected to do in the job. Worst of all, band scores in the test are used by some national governments as benchmarks for migration: It is quite simply immoral to use a score on these tests to deny a person’s application for immigration.
Those who seek to study at universities abroad or to work for a number of large multinational companies, or to migrate, are forced to engage with these tests on the terms set by the test owners, conferring on the owners considerable global power and influence; and they suffer dire consequences if they fail to achieve the required mark in tests which, in a great many cases, are not fit for purpose.
I’ll give a full list of references at the end of the thesis.
These are valid points. Is there anything you would suggest as an alternative to ensuring the hundreds of thousands of students who study abroad each year have the required level of english needed to fully participate in the courses they are paying for?
I teach in university, helping students achieve the required levels, so they can study abroad. So it would be interesting to know any alternative ways of assessing students.
The short answer is: Talk to them. A 30 minute interview should do it. The long answer involves a radical change in assessment away from tests that test knowledge about the language towards assessment procedures that find out what the testees can do with the L2.
Thanks for that. Interesting read as ever. What would be the contents of the 30 minutes interview? What specifically would you be looking for?
I’d be interested to know what kind of test procedures you would recommend. We are in the process of rewriting our tests, so any ideas would be welcome.
Also, any thoughts on the impact Chat GPT will have on ESL?
Experienced teachers who engage someone for whom English is an addiional language in conversation for half an hour will get a good idea of the strengths and weaknesses of their communicative competence. This can help in placement, diagnostic and proficiency evalutation.
Assessment must specify the test’s intended use. Most high stake proficiency English tests are unfit for purpose because they’re used to make decisions as diverse as university entrance and immigration permits. Ridiculous! It’s a business racket. They’re designed to be taken by millions, they use the absurd CEFR to put people in a “band” from A1 to C2 and they don’t include a personalised, extended interview.
Glenn Fulcher is, IMHO, the best authority on language assessment. He insists that test validity rests on fiitness for purpose. For me, following Glenn and Mike Long, the best tests are task-based, criterion-referenced performance tests. See Long, 2015, Chapter 11.
As for ChatGPT, I’m intrigued, but from what I’ve read so far (not much) I’m not very inpressed. I think there are plans afoot, involving Sam Gravell, Scott Thornbury and others, to have an on-line “conference” about it. Watch out for news on Twitter.