Testing Reliability Validity

A student sits down for his college entrance exam.  His hands are clammy; his throat is dry. He hasn’t been able to eat a thing because his stomach has been tied in knots since he woke up.  The proctor announces that the test has begun.  The student stares at the first question.  Ten minutes later, the student is still on question one and is frozen from performance anxiety.  Unfortunately, the test results for this student will not accurately reflect his abilities. 

Much controversy surrounds the use of high-stakes tests.  Many critics argue that test scores do not reflect true ability because results are based on a single performance.  That said, tests are here to stay.  With so many candidates seeking college admissions and professional licenses, institutions have found that test scores have considerable value in determining who has acquired essential knowledge in a field and who has not.

Two criteria for determining the quality of a high-stakes test are reliability and validity.  Reliability is the precision of the test.  It determines how well the test provides consistent information.  For example, Maria takes a test Monday and then is administered the same test the following day.  Reliability describes how similar the scores are between the two tests sessions.  Reliable test scores will correlate very highly.  A test that produces unreliable scores is not a solid measurement instrument because it does not provide consistent information across test sessions.    Taking a test on two consecutive days is a test-retest method for measuring reliability.  The amount of time between the two test sessions should be short (usually a day) so that the test taker’s ability remains constant.  Otherwise, it will not be clear whether a difference in test scores is the result of an unreliable test or a change in the person’s ability to answer the test questions.  Another way to measure reliability of a test is to determine how closely scores correlate when the results of half the test questions are correlated with the results from the other half.  This is called split-half reliability. Typically, researchers report Pearson’s product-moment correlation coefficient as a measure of reliability.

Another criterion for evaluating high-stakes tests is validity.  Validity is the degree to which the test measures what it was designed to measure.  A valid test for obtaining a driver’s license, for example, should assess the candidate’s knowledge of traffic rules as well as the person’s ability to maintain control of a vehicle in different driving situations.  A written exam is not able to assess the candidate’s ability to handle a vehicle when parallel parking, for example, and therefore, a written test by itself would not be a valid test of overall driving ability.  Test developers often redesign tests so that they more closely match the skills they are trying to assess.  For example, the Scholastic Achievement Test (SAT) involves more writing than it did in the past since writing is an essential skill for success in college; and the Test of English as a Foreign Language (TOEFL) now includes a speaking component.  Quantitative evaluations of test validity involve correlating test results with results from another established assessment that measures the desired ability.  A high correlation suggests that the test is measuring the same underlying trait. 

Finally, the methods used for developing the test will reflect the test’s quality.  No items should introduce bias for or against a given subgroup of test takers. Test developers should document their methods and ensure that items are vetted and reviewed by experts. 

Even though, high-stakes tests present a hurdle for many candidates, there are established standards for ensuring assessments are of the highest quality.  These standards are documented in a book entitled Standards for Educational and Psychological Testing  by the American Psychological Association.