Calibration – A Significant Activity and Contributory Factor to Quality and Fairness in Testing Language Skills

Duong Thi Thuy Uyen, M.A.
Lecturer of School of Foreign Languages for Economics – UEH

In language testing, the standards of Quality and Fairness are of great importance because they guarantee the accurate, reliable and fair measurement of the learners’ language capability. One of the elements which helps to meet these standards is calibration. Therefore, in the process of preparation and administering the test, calibration should be conducted to ensure accuracy, reliability and fairness in assessing learners’ language proficiency levels.

INTRODUCTION

‘Don’t call the Nobel Committee just yet: We forgot to calibrate the instruments before the experiment…’

In September 2014 the School of Foreign Languages for Economics (SFLE), University of Economics, Ho Chi Minh City (UEH) was founded. In the academic year of 2015, it was the first time that UEH had an intake of eighty-four students majoring in English for Economics. The teaching staffs at the SFLE in the preparation period for this Bachelor of Arts program had worked tirelessly to design a curriculum which has met the requirements of both the Ministry of Education and the society. Up till now, the feedbacks from the students about the courses and teaching staffs are extremely positive, a fruitful result for the teachers’ efforts. Currently, the great concern is on the issue of delivering tests consistently and effectively, meeting the standards of Quality and Fairness in language testing. To achieve this, calibration – an activity which has been conducted regularly in the preparation period of international tests as well as in tests at prestigious universities – should be organized before Speaking and Writing tests at SFLE so that the teachers – raters can assess the students’ performance as well as their language proficiency levels in an accurate, reliable and fair way.

LITERATURE REVIEW

2.1 The standards of Quality and Fairness

The standards of Quality and Fairness were developed and are periodically revised by Educational Testing Service (ETS), known as ETS Standards for Quality and Fairness. The aims are to support in designing, developing, and delivering technically sound, fair, accessible, and useful products and services, and to help to evaluate those products and services. They are “a model for organizations throughout the world that seek to implement measurement standards aligned with changes in technology and advances in measurement and education.” (Walt M., ETS Standards for Quality and Fairness, 2015:1)

There are thirteen standards which are sub-divided into many detailed ones. These individual standards provide detailed guidelines, standard operating procedures, work rules, checklists, and so on. The Standard 10.1 – Developing Procedures for Human Scoring – proposes that: “If the scoring involves human judgment, develop clear, complete, and understandable procedures and criteria for scoring, and train the raters to apply them consistently and correctly.” (ETS Standards for Quality and Fairness, 2015:43)

It also clarifies that training for raters is needed to ascertain that they can apply the scoring rubric consistently and correctly. Benchmarks (responses typical of different levels of performance at each score point) should be included in training sessions and an array of examples should be used to help the raters have a clear understanding about the intended scoring standards. Besides, expert raters should be available to assist when a rater has difficulty rating a response. (ETS Standards for Quality and Fairness, 2015:43)

The second standard in this set – Standard 10.2 – Monitoring Accuracy and Reliability of Scoring – points out that: “If the rating process requires judgment, the accuracy of ratings can best be judged by comparing them with ratings assigned to the same responses by expert raters.” (ETS Standards for Quality and Fairness, 2015:43)

As far as the standard of Fairness is concerned, in the Manual for Language Test Development and Examining produced by the Association of Language Testers in Europe (ALTE) on behalf of Language Policy Division, Council of Europe, it is emphasized that three aspects of fairness should be acknowledged. They are “fairness as lack of bias, fairness as equitable treatment in the testing process and fairness as equality in outcomes of testing.” (Manual for Language Test Development and Examining, 2011:17)

This manual also gives instructions on having good quality control. To meet the Standards for Quality, it claims that besides considering many related elements such as editing, piloting, pretesting and trialing, etc., the process of marking must be managed to avoid any threatening to reliability and accuracy. Therefore, raters or examiners have to be trained so that they can rate consistently and accurately, especially when “a single ‘correct answer’ cannot be clearly prescribed by the exam provider before rating” (Manual for Language Test Development and Examining, 2011:41). Before the rater training sessions, a rating scale must be constructed. This is “a set of descriptors which describes performances at different levels, showing which mark or grade each performance level should receive.” (Manual for Language Test Development and Examining, 2011:41). Then in the training sessions, samples and the rating scale are used for open discussion after independent rating is done.

2.2 Calibration

In the online dictionary (http://www.vocabulary.com/dictionary/calibrate), the word calibrate means making precise measurement. In the glossary compiled by ETS, it is defined as follows:

“In the scoring of a constructed-response test, “calibration” refers to the process of checking to make sure that each scorer is applying the scoring standards correctly.”

(from https://www.ets.org/understanding_testing/glossary/ )

In the process of developing and delivering tests, calibration is just a small but important step to ensure fair and reliable rating. Calibrations are conducted in scorer / rater training sessions. According to the Guidelines for the Assessment of English Language Learners, also published by ETS, a review of how to interpret responses and the scoring rubric should be included in scorer training. Assessment developers should select various-score-point exemplar responses and use them in training raters so that raters can “recognize English Language Learner (ELL) characteristics and score ELL responses fairly without introducing bias”. Besides, it states that “Recalibrating scorers at the beginning of each scoring session should confirm scorers’ abilities to resume accurate scoring.” (Guidelines for the Assessment of English Language Learners, 2009: 29)

METHODOLOGY

To learn about the process of training raters to ensure accuracy and fairness, informal interviews were conducted with Ms. Ton Nu Ngoc Tuong, the Country Test Manager, International Developing Program (IDP) Viet Nam, which organizes IELTS (International English Language Testing System) tests and Dr. Nguyen Thi Cam Le, the main class English teacher for English Proficiency Program (EPP) from English Language Institute, School of Linguistics and Applied Language Studies, Victoria University of Wellington, New Zealand. Dr. Le is also the Representative and Manager for the EPP program of Victoria University of Wellington at Ho Chi Minh City campus, administering English Proficiency Test (EPT) which has been developed and standardized by English Language Institute, School of Linguistics and Applied Language Studies, Victoria University of Wellington, New Zealand.

Interviews were also conducted with three IELTS examiners at IDP Viet Nam who have had regular retraining from British Council and IDP: IELTS Australia through the Professional Support Network.

RESULTS

4.1 Steps in calibration and differences in training sessions

The two managers, Ms. Tuong and Dr. Le, gave almost the same information on the process of training raters as well as calibrations for Speaking and Writing tests. Generally, all the steps are the same in calibration sessions. First, the raters are given band descriptors to study and the trainer will give some explanation if necessary. Then sample answers – video clips recorded students’ answers for the Speaking test – are shown; or students’ graph and essay writing paper – are delivered (usually five sample answers for each kind of test) for the raters to mark. After that, each rater will explain why they come up to the scores. Then there is an open discussion among the raters before they compare their ratings with the ones assigned to the same responses by expert raters. In the end, another discussion is conducted to come to the final concurrence.

Figure 1. Flow chart of steps in a calibration for a speaking test

There is a small difference in the frequency of the training sessions organized at the two organizations. IDP Vietnam requires the examiners for IELTS tests to have training and retraining in standardization sessions which were held every two years in the past and are now held every year, led by examiner trainers. Meanwhile, the raters for the Speaking and Writing tests at the English Language Institute, School of Linguistics and Applied Language Studies attend calibrations before the English proficiency tests that are delivered three times a year and led by the Dean of the School.

As far as the band descriptors are concerned, for copyright reason, Dr. Le and Ms. Tuong just gave general information about them. However, public versions of assessment criteria for Speaking and Writing in IELTS tests can be found on the Internet, available at www.ielts.org (see Appendixes). In comparison, the band descriptors for Speaking and Writing tests in IELTS have four criteria while the ones in EPT have five. (These criteria are equally weighted). Another difference is IELTS has a nine-band scale while EPT band score is of 6.

Table 1. The differences between IELTS test and EPT test

	IELTS test	EPT test
1. Number of criterion areas for Speaking and Writing tests	04	05
2. The range of band scale	01 – 09	01 – 06
3. Scoring system	Whole or half band scores	Whole band scores

4.2 Difficulties in preparing samples for calibrations

According to Dr. Le, making samples for calibration is not a simple task. Two major prolems need considering and finding solutions for. For Speaking tests, because of ethical reason, firstly, it needs the students’ agreement in making video recordings and using them in calibrations for Speaking tests afterwards. Secondly, it is time-consuming and costly to make these video clips. In addition, it is quite difficult to arrange calibrations for the expert raters to score the samples before calibrations for raters due to their different availability.

Besides, reality shows that there is an issue coming from teachers – raters themselves when they sometimes cannot come to an agreement on a student’s score even though they have referred to the band descriptors already. Fortunately, this case does rarely happen.

4.3 Benefits of calibrations

The outcomes from the interviews with the two managers and three examiners revealed that from both the managers’ and raters’ points of views, the calibration is a significant activity and contributory factor to the quality and fairness in language testing. The reason is after being trained and retrained with calibrations, raters can carry out consistent assessment and give accurate rating because they are “appropriately qualified and have the relevant professional experience” (Ensuring Quality and Fairness in International Language Testing, 2013:12). Three examiners at IDP Viet Nam, Ho Chi Minh City, when interviewed, all said that after calibrations it was much easier for them to assess candidates’ proficiency levels when they knew which grades or scores should have been given for their performances.

In addition, the writer herself – being a co-teacher for the EPP – has also been trained in calibrations two or three times a year since 2008 (depending on the number of courses she has been involved in). She also shares the opinion that the rater after participating in calibrations can be more confident and make quick and definitive decisions as well as maintain high level of accuracy in scoring. It is the detailed band descriptors that help raters to avoid indecisive cases or being two minds giving a candidate a four or a five, for example, when rating.

RECOMMENDATIONS

From the benefits of conducting calibrations before delivering Speaking and Writing tests, it is recommended that calibrations should be organized at the SFLE before Speaking tests delivered to the students majoring in English for Economics because they help the teachers – examiners measure the learners’ language capability in an accurate, reliable and fair way.

To do this, detailed performance descriptors which describe spoken performance at each of the 10 bands should be developed first. These descriptors will help teachers – examiners understand clearly the level of performance required to attain a particular band score in each of the criterion areas. After that, teachers – examiners will undergo practicing / training to ensure that they can apply the descriptors in a valid and reliable manner.

There might be some arguments that the tests delivered at SFLE are not international tests like IELTS, TOEIC, or TOEFT, etc. and that there might not be enough resources to design band descriptors, making video recordings and conduct calibrations because SFLE is not a Language Institute. However, it cannot be denied that ensuring quality and fairness in language testing is one of the teacher’s primary responsibilities, and if it is not carried out now, then when can it be? Big things cannot be achieved when small things have not been started with.

CONCLUSION

To sum up, to achieve the main goal set up by the President of the UEH – making UEH become one of the best international universities – each School in the university has to improve itself in every aspect and from every seem-to-be simple thing. For SFLE, it is about time that calibrations were applied to step-by-step standardize its testing system.

REFERENCES

http://www.vocabulary.com/dictionary/calibrate
https://www.ets.org/understanding_testing/glossary/
http://www.ielts.org/researchers/score_processing_and_reporting.aspx#speaking
http://www.ielts.org/researchers/score_processing_and_reporting.aspx? utm_source=GuideForAgents&utm_medium=Print&utm_campaign=criteria
ALTE (2011) Manual for Language Test Development and Examining, Council of Europe
Mary J. P., John W. Y., Maria M., Teresa C. K., Alyssa B., and Mitchell G. (2009) Guidelines for the Assessment of English Language Learners, Educational Testing Service
(2013) Ensuring Quality and Fairness in International Language Testing, www.ielts.org
(2015) ETS Standards for Quality and Fairness, Educational Testing Service