Home Up Your Contact Contents Order Tilaus



LAnguage software as dictionaries - proofing tools - machine translation
"ELT" english language teaching  - "CLT" chinese language teaching
data capture & document processing - document conversion
F-Secure anti-virus - asian language software - russian software







Test development and validation










Test development  -  Validation  - What is computer-adaptive testing?

For a more theoretical explanation of issues connected with computer-based testing, please read the following:

Test development

Explicit work to link Cambridge examinations to a single measurement scale began in the early 1990's when Cambridge began to use a latent-trait item banking approach to test construction. The development of the Common Scale has made it possible to improve the quality of examinations, by ensuring constant difficulty of component papers, and has greatly aided interpretation of qualifications to end users of examination results.

The Common Scale is also an essential prerequisite for computer-adaptive testing, as the difficulty of the items in a CAT (computer-adaptive test) covers all levels. Data for constructing the Common Scale has come from pretesting, live administrations, and from other special research projects.

An important advantage of the QPT is that it reports test results as a band on the ALTE 5-level scale. This makes the result potentially much more useful to end users. The band descriptors represent an outcome of early validation work with Can Do statements. The ALTE Can Do Project is ongoing work which is refining these statements, in the process of constructing a European cross-language framework for language proficiency.

Cambridge ESOL now devotes a significant amount of effort to developing and maintaining the bank of test materials available for use in CATs. This material comes largely from previous paper and pen examinations, but is also specially commissioned for CAT where necessary. Considerable research has gone into how big the pool for each item type should be, and how items are chosen for use in each test event, so that there is an appropriate mix of item types available at the right level for each candidate.

The test development process involves several key quality control stages, whether it is to produce traditional paper- based tests or those delivered by computer. For any test development project, there is always a consideration of the test construct and the way in which test scores will be calculated and then used to make decisions about test takers. This process results in the production of a test specification document which provides key information such as the test purpose, the intended candidates, the overall test structure, range of item types, test construct and score reporting issues.

Professional item writers are commissioned to produce material that meets the standards set out in the test specifications document. The material, such as a Reading passage, is first assessed at a pre-editing meeting by a Cambridge ESOL examination manager and the Chair of the item-writing team. They decide if the material is suitable for the level and purpose of the test.

Consideration is also given to the suitability of the topic matter. Material can be rejected at this stage or returned to the item writer for reworking.

If it is acceptable, the material is returned to the item writers to produce the items which are then submitted for an editing meeting. The editing meeting can decide to reject or return material for revision but at this stage a large percentage is passed for pretesting.

The pretesting stage involves making up new tasks into pretests to send out for trialling with representative groups of students. By trying out new material and reviewing the results of analysis and feedback from test centres, we are able to improve the overall quality of the bank of material used to make live examinations.

All of the test items in the Quick Placement Test (QPT) have been through this quality control procedure; however, additional steps have been taken to assess the overall reliability of the QPT and the relationship of scores between it and those derived from the paper and pen versions.


To date, the test has been validated in 20 countries by more than 5,000 students.

  • Phase 1

The first phase of trialling involved students from a variety of countries taking the electronic QPT and one of the two paper and pen tests. They also completed a questionnaire indicating the extent to which they were comfortable using a computer. Teachers provided detailed feedback on the look and feel of the paper and pen tests and some of the items in the electronic version of the QPT, and on how accurate the tests were in terms of identifying the current level of their students.

As a result of this, the paper and pen tests were changed and the QPT database modified to include more lower level items with an increase in lexico-grammatical items at the lower level.

With the format of the tests confirmed, the second phase of activity concentrated on determining score equivalence between the QPT and the paper and pen versions, and also between two successive administrations of the QPT. The aim was to assess how consistently students were located on the same score and what degree of error was associated with these scores.

'Error' refers to the fact that in any measuring process there will be some inconsistency. If you were to weigh yourself five times in the same day, you would notice that the recorded weight varied slightly. In testing terms, there is a notion of True Score Theory, which states that if a candidate took a test an infinite number of times, the average score would be the true score and the range of scores around that average score would indicate the error range for that test. By investigating the reliability of the test scores as well as the tests themselves, we have produced a test which is both reliable and practical.

In 2003 work began on a new edition of the QPT designed to simplify the installation process and ensure that the management of counts is more user-friendly. Further research was then conducted on the Paper Based and Computer Based modes of delivery of the examination as part of Cambridge ESOL's commitment to continually monitoring standards. This research confirmed the reliability of the tests and that the two Paper Based Versions are of the same difficulty and are equivalent. No evidence was found of one version or mode of the examination appearing more difficult than another at any level of candidate ability. While the Computer Based Test cannot be considered equivalent to the Paper Based Tests, as it contains a Listening section not existent in the Paper Based versions, it was noted that candidates who sat both modes of delivery tended to achieve the same ALTE band in both.

  • Phase 2
What is computer-adaptive testing?

In a computer-adaptive test (CAT) the test-taker responds to questions presented by a computer. As the test proceeds the computer estimates the ability of the test taker and chooses items which are of the right difficulty for that level. Because of this, each response contributes a maximum amount of information. CAT tests can thus be shorter than equivalent paper and pen tests, without sacrificing reliability.

Thus to do computer-adaptive testing it is first necessary to build up sets of items covering a wide range of levels, where the difficulty of each item is precisely known. This approach to test construction is called item banking , and it involves the use of a particular branch of statistics known as item response theory.

An example of how adaptive testing works would be two learners at rather different levels taking an adaptive test. Firstly, because the difficulty of the questions adapts to the level of the test-taker, the learners in this example will get a roughly similar number of items right and wrong - they both might finish with a score of about 60% correct. But clearly, a score of 60% on a set of easier questions demonstrates less ability than a score of 60% on a set of harder questions. To estimate ability the adaptive test must take account not only of the test-taker's score but also of the difficulty of each question that was attempted. From this it estimates the test-taker's ability on an underlying measurement scale.

Thus it is possible to give different test-takers completely different sets of test items, and yet estimate their ability on the same scale.





















































Send mail to Mari.Haapaniemi@trantor.fi with questions or comments about this web site.
Copyright ᅵ 2002 Trantor Ky
Last modified: 04/04/08