[I'm focusing here on IELTS, largely because of its importance in Australia and its familiarity to potential readers of this blog. However, the ideas in this post apply to psychometric tests more generally, including, for example, the Pearson Test of English]
Consider this claim from the IELTS website (https://www.ielts.org/what-is-ielts/ielts-introduction):
The International English Language Testing System (IELTS) measures the language proficiency of people who want to study or work where English is used as a language of communication.
Does IELTS (or any other language test, for that matter) actually ‘measure’ anything at all?
What is 'measurement'?
To try to answer this question, I'm going to quote extensively from the work of Ludwik Finkelstein, particularly his 2005 article in the journal Measurement titled 'Problems of measurement in soft systems'. By 'soft systems', Finkelstein means "systems involving human action, perception, feeling, decisions and the like" and this "embraces much of the psychological, social and economic domains" (p. 269). In his work, Finkelstein does refers at times specifically to educational assessment as a soft system; language assessment also meets Finkelstein's definition of 'soft system'.
Measurement can be defined in the wide sense as a process of empirical, objective assignment of symbols to attributes [e.g. 'language proficiency'] of objects and events of the real world, in such a way as to represent them, or to describe them.
Description or representation means, that when a symbol, or measure, is assigned by measurement to the property of an object or event, and other symbols are assigned by the same process to other manifestations of the property, then the relations between the symbols or measures, imply and are implied by empirical relations between the corresponding property manifestations.
The term objective process in the definition of measurement means that the symbol assigned to a property manifestation by measurement must within the limits of uncertainty, be independent of the observer.
To clarify how this applies to IELTS, we could rewrite it in the following way.
- An IELTS band score is assigned by the IELTS process to Person A's language proficiency.
- Another IELTS band score is assigned by the same IELTS process to Person B's language proficiency.
- The relations between the two IELTS band scores imply and are implied by the empirical relations between Person A and Person B's language proficiency.
- To do this objectively, this process must be dependent on the empirical relations between Person A and Person B's language proficiency and not on the observation of the IELTS examiner.
Problems with measurement in soft systems
Defined in this way, measurement
in systems involving human actors, is part of the application of the scientific method to their study and is based on the philosophical perspective of positivism, operationalism and their developments. The application of the scientific method in these domains is the dominant approach to their study and can claim considerable achievement. Those who use it, claim that social studies conform to the canons of scientific method and have the same status as natural science.
This approach, however, encounters philosophical criticism. It is argued that in social systems the observer and analyst [or examiner] are not objective, but operates on the basis of ideologically motivated theories. The objects of observations are humans. They have their beliefs, desires and methods of reasoning and may not be amenable to description by simple models. The understanding of their behaviour must be based on empathy and the experience of life.
There are, according to Finkelstein, further "philosophical challenges to the application of measurement to systems involving human actors" (p. 269).
Experimental processes are, in the theory of measurement, the basis of the formation of the concept of a quality, or of a scale of measurement. They rely on the observation of a single quality, or a limited number of related qualities.
Such experiments are frequently not possible, or conceivable, in soft systems.
In many systems, such as economies or societies [or 'language proficiency'] it is not possible to disturb the system for the purposes of an experiment. It is generally not possible to disintegrate the system in order to perform observations, or experiments, on its individual components. It is frequently not possible to ensure that repeated observations on a system are made under the same conditions.
Empirical knowledge of soft systems is thus commonly obtained by observation of the whole, functioning system without disturbing the system. This is commonly the case in economics, sociology and in psychological empirical evidence.
The testing of whole functioning systems presents problems. In general, only a limited set of variables is accessible to observation, other[s] must be estimated from a model of the system. The models are based on theory, and for the purposes of measurement the variables and parameters of the model are required to correspond to quantities of theoretical interest. This basis in theory weakens the empirical nature of the measurements. It is generally difficult to establish model validity. It is also difficult to estimate internal variables and system parameters in systems of even moderate complexity. (p. 270, emphasis added)
In the case of language assessment, we can develop a theory about what 'language proficiency' is and how we could measure it, but, as Finkelstein points out regarding soft systems more generally, the validity would be "difficult to establish". This includes the 'predictive validity' of a language test, i.e., the validity of using an IELTS score to predict performance in higher education. Finkelstein (p. 270) argues that:
The predictive function of models is to represent the behaviour of systems under conditions in which they have not been observed. Models of soft systems have, generally limited predictive validity.
In addition, there is the problem of self-awareness:
Systems that include human actors present, as far as their observation is concerned, a problem of self-awareness. By this is meant, that if they system is observed, and the fact of the observation and its results are known, the human actors tend to alter their behaviour. The measurement thus significantly distorts the observation. (p. 272).
And this in turn distorts any measurement which might take place.
Obviously, none of this has stopped people from developing language tests such as IELTS, but they have done so without these fundamental problems being resolved; understanding this is my essential, in my view, to understanding why IELTS and other such tests have the perverse social impacts that we encounter so frequently.
'Measurement' in education
In a 2003 paper, Finkelstein considers "assignments of numbers, or other symbols, to properties in such a way as to describe them, but which are not measurements." (p. 47)
An important case of the descriptive assignment of numbers, the measurement status of which is problematic, arises in educational testing. Marks in examinations may be objective [which, in the case of IELTS, would be easier to claim for Reading and Listening modules than but Speaking and Writing], and are based on an empirical process, but it is problematic what they measure, other than performance in a particular test. It is doubtful whether, when marks are treated as measures on a ratio scale, they are not in fact measure on an ordinal scale. This affects the meaningfulness of statistics on marks, such as t calculations of averages and the like. The conflation of marks, such as the calculation of weighted sums of marks, contains an element of subjectivity in the conflation scheme, which probably disqualifies such conflated marks from being considered measurements. (p. 47, emphasis added)
Therefore, an IELTS score may be at best "a description of the subjective judgement of the decision-maker" and certainly "not a measure of any objectively defined characteristic of the object evaluated" (p. 47).
Validity and measurement
To sum up the issues regarding the meaning of 'measurement' and what it means for claims about the validity of tests such as IELTS and conclude this post, I'd like to turn to the words of Michael D. Maraun (2012):
Perplexity and angst over the issue of validity arises from its role as a back door to measurement, the psychologist's wanting to have his or her cake and eat it too; cleave to the belief that his or her tests measure, while demurring in the use of the full-blooded measurement language that has in the past, lead to his or her censure.
Despite its lengthy history, the volumes that have been written about it, the study of test validity is neither evolving, nor approaching resolution of a core set of defining problems, certainly none such bearing on the topic of measurement. It continues, rather, to spin its wheels, as evidenced by the lack of "precision, consistency, and clarity" associated with so many of its central offerings.
The explanation for this state of affairs is that the back door did not lead to measurement at all but was, instead, an illusion, a projection of a primitive metaphysical picture the commitment to which distorts understanding of measurement through a fundamental conflation of its conceptual and empirical components and leads to incoherence. Because incoherence is irresolvable, perplexity forever attends the consideration of test validity, to be encountered afresh by scholars every few years. (p. 80)
In short, psychometric tests such as IELTS do not actually measure anything and yet language testers continue their efforts to convince stakeholders that they do, without resolving fundamental problems with the application of the measurement construct to soft, social systems; the incoherence created by this means that test stakeholders continue to struggle to interpret test scores and the use of these scores (e.g., for immigration purposes or to determine entry into higher education) continues to have perverse social impacts.
Update: What is to be done?
After I shared this post in the #AusELT community earlier today, a reader encouraged me to propose some solutions. Here's my five-point plan.
1. Encourage education professionals and test score users to engage more deeply with and understand better assessment-related issues.
2. Test developers stop claiming to measure anything without at least explaining what they mean by 'measure' and attempting to demonstrate that the test does actually measure something.
3. Test developers cease using and reporting statistics which presuppose that measurement has taken place unless they can demonstrate that measurement has in fact taken place (otherwise, the reporting of such statistics is misleading and valuable for marketing purposes only).
4. Test developers and promoters comply more closely with the ILTA Code of Ethics, including/starting with of course Principle 1, which requires language testers to "communicate the information they produce to all relevant stakeholders in as meaningful a way as possible". See also point #2 above.
5. As a longer term project, test developers conscientiously seek opportunities to move away from the psychometric assessment paradigm towards the hermeneutic assessment paradigm proposed by Pamela Moss. This will cost more money, but will be necessary if point #4 is to be achieved.