Understanding Test Quality-Concepts of Reliability and Validity
Test
reliability and
validity are two technical properties of a
test that indicate the quality and usefulness of the test. These are the two
most important features of a test. You should examine these features when
evaluating the suitability of the test for your use. This chapter provides a
simplified explanation of these two complex ideas. These explanations will help
you to understand reliability and validity information reported in test manuals
and reviews and use that information to evaluate the suitability of a test for
your use.
Chapter Highlights1. What makes a good test?
2. Test
reliability
3. Interpretation of reliability information from test manuals
and reviews
4. Types of reliability estimates
5. Standard error of
measurement
6. Test validity
7. Methods for conducting validation
studies
8. Using validity evidence from outside studies
9. How to
interpret validity information from test manuals and independent reviews.
Principles of Assessment DiscussedUse only reliable
assessment instruments and procedures.
Use only assessment procedures and instruments that have been demonstrated to
be valid for the specific purpose for which they are being used.
Use assessment tools that are appropriate for the target population.
|
1. What makes a good test?
An employment test is considered "good" if the following can be said about
it:
- The test measures what it claims to measure consistently or reliably. This
means that if a person were to take the test again, the person would get a
similar test score.
- The test measures what it claims to measure. For example, a test of mental
ability does in fact measure mental ability, and not some other characteristic.
- The test is job-relevant. In other words, the test measures one or more
characteristics that are important to the job.
- By using the test, more effective employment decisions can be made about
individuals. For example, an arithmetic test may help you to select qualified
workers for a job that requires knowledge of arithmetic operations.
The degree to which a test has these qualities is indicated by two technical
properties:
reliability and
validity.
2. Test reliability
Reliability refers to how dependably or consistently a test measures
a characteristic. If a person takes the test again, will he or she get a similar
test score, or a much different score? A test that yields similar scores for a
person who repeats the test is said to measure a characteristic reliably.
How do we account for an individual who does not get exactly the same test
score every time he or she takes the test? Some possible reasons are the
following:
- Test taker's temporary psychological or physical state. Test
performance can be influenced by a person's psychological or physical state at
the time of testing. For example, differing levels of anxiety, fatigue, or
motivation may affect the applicant's test results.
- Environmental factors. Differences in the testing environment, such
as room temperature, lighting, noise, or even the test administrator, can
influence an individual's test performance.
- Test form. Many tests have more than one version or form. Items
differ on each form, but each form is supposed to measure the same thing.
Different forms of a test are known as parallel forms or alternate
forms. These forms are designed to have similar measurement
characteristics, but they contain different items. Because the forms are not
exactly the same, a test taker might do better on one form than on another.
- Multiple raters. In certain tests, scoring is determined by a rater's
judgments of the test taker's performance or responses. Differences in training,
experience, and frame of reference among raters can produce different test
scores for the test taker.
These factors are sources of chance or random measurement error in the
assessment process. If there were no random errors of measurement, the
individual would get the same test score, the individual's "true" score, each
time. The degree to which test scores are unaffected by measurement errors is an
indication of the
reliability of the test.
Reliable assessment tools produce dependable, repeatable, and consistent
information about people. In order to meaningfully interpret test scores and
make useful employment or career-related decisions, you need reliable tools.
This brings us to the next principle of assessment.
Principle of AssessmentUse only reliable assessment
instruments and procedures. In other words, use only assessment tools that
provide dependable and consistent information. |
3. Interpretation of reliability information from test manuals and reviews
Test manuals and independent review of tests provide information on test
reliability. The following discussion will help you interpret the reliability
information about any test.
The reliability of a test is indicated by the
reliability coefficient. It is denoted by the letter "r," and is
expressed as a number ranging between 0 and 1.00, with r = 0 indicating no
reliability, and r = 1.00 indicating perfect reliability. Do not expect to find
a test with perfect reliability. Generally, you will see the reliability of a
test as a decimal, for example, r = .80 or r = .93. The larger the reliability
coefficient, the more repeatable or reliable the test scores. Table 1 serves as
a general guideline for interpreting test reliability. However, do not
select or reject a test solely based on the size of its reliability coefficient.
To evaluate a test's reliability, you should consider the type of test, the type
of reliability estimate reported, and the context in which the test will be
used. |
Table 1. General Guidelines for Interpreting
Reliability Coefficients |
Reliability coefficient value |
Interpretation |
.90 and up |
excellent |
.80 - .89 |
good |
.70 - .79 |
adequate |
below .70 |
may have limited
applicability |
|
4. Types of reliability estimates
There are several types of reliability estimates, each influenced by
different sources of measurement error. Test developers have the responsibility
of reporting the reliability estimates that are relevant for a particular test.
Before deciding to use a test, read the test manual and any independent reviews
to determine if its reliability is acceptable. The acceptable level of
reliability will differ depending on the type of test and the reliability
estimate used.
The discussion in Table 2 should help you develop some familiarity with the
different kinds of reliability estimates reported in test manuals and reviews.
Table 2. Types of Reliability Estimates
Test-retest reliability indicates the repeatability of test
scores with the passage of time. This estimate also reflects the stability of
the characteristic or construct being measured by the test.
Some constructs are more stable than others. For example, an individual's
reading ability is more stable over a particular period of time than that
individual's anxiety level. Therefore, you would expect a higher test-retest
reliability coefficient on a reading test than you would on a test that measures
anxiety. For constructs that are expected to vary over time, an acceptable
test-retest reliability coefficient may be lower than is suggested in Table 1.
|
Alternate or parallel form reliability indicates how
consistent test scores are likely to be if a person takes two or more forms of a
test.
A high parallel form reliability coefficient indicates that the different
forms of the test are very similar which means that it makes virtually no
difference which version of the test a person takes. On the other hand, a low
parallel form reliability coefficient suggests that the different forms are
probably not comparable; they may be measuring different things and
therefore cannot be used interchangeably. |
Inter-rater reliability indicates how consistent test scores
are likely to be if the test is scored by two or more raters.
On some tests, raters evaluate responses to questions and determine the
score. Differences in judgments among raters are likely to produce variations in
test scores. A high inter-rater reliability coefficient indicates that the
judgment process is stable and the resulting scores are reliable.
Inter-rater reliability coefficients are typically lower than other types of
reliability estimates. However, it is possible to obtain higher levels of
inter-rater reliabilities if raters are appropriately trained. |
Internal consistency reliability indicates the extent to which
items on a test measure the same thing.
A high internal consistency reliability coefficient for a test indicates that
the items on the test are very similar to each other in content (homogeneous).
It is important to note that the length of a test can affect internal
consistency reliability. For example, a very lengthy test can spuriously inflate
the reliability coefficient.
Tests that measure multiple characteristics are usually divided into distinct
components. Manuals for such tests typically report a separate internal
consistency reliability coefficient for each component in addition to one for
the whole test.
Test manuals and reviews report several kinds of internal consistency
reliability estimates. Each type of estimate is appropriate under certain
circumstances. The test manual should explain why a particular estimate is
reported. |
5. Standard error of measurement
Test manuals report a statistic called the
standard error of measurement
(SEM). It gives the margin of error that you should expect in an individual
test score because of imperfect reliability of the test. The SEM represents the
degree of confidence that a person's "true" score lies within a particular range
of scores. For example, an SEM of "2" indicates that a test taker's "true" score
probably lies within 2 points in either direction of the score he or she
receives on the test. This means that if an individual receives a 91 on the
test, there is a good chance that the person's "true" score lies somewhere
between 89 and 93.
The SEM is a useful measure of the accuracy of individual test scores. The
smaller the SEM, the more accurate the measurements.
When evaluating the reliability coefficients of a test, it is important to
review the explanations provided in the manual for the following:
- Types of reliability used. The manual should indicate why a certain
type of reliability coefficient was reported. The manual should also discuss
sources of random measurement error that are relevant for the test.
- How reliability studies were conducted. The manual should indicate
the conditions under which the data were obtained, such as the length of time
that passed between administrations of a test in a test-retest reliability
study. In general, reliabilities tend to drop as the time between test
administrations increases.
- The characteristics of the sample group. The manual should indicate
the important characteristics of the group used in gathering reliability
information, such as education level, occupation, etc. This will allow you to
compare the characteristics of the people you want to test with the sample
group. If they are sufficiently similar, then the reported reliability estimates
will probably hold true for your population as well.
For more information on reliability, consult the APA Standards, the SIOP
Principles, or any major textbook on psychometrics or employment testing.
Appendix A lists some possible sources.
6. Test validity
Validity is the most important issue in selecting a test. Validity refers to
what characteristic the test measures and
how well the test
measures that characteristic.
- Validity tells you if the characteristic being measured by a test is related
to job qualifications and requirements.
- Validity gives meaning to the test scores. Validity evidence
indicates that there is linkage between test performance and job performance. It
can tell you what you may conclude or predict about someone from his or her
score on the test. If a test has been demonstrated to be a valid predictor of
performance on a specific job, you can conclude that persons scoring high on the
test are more likely to perform well on the job than persons who score low on
the test, all else being equal.
- Validity also describes the degree to which you can make specific
conclusions or predictions about people based on their test scores. In other
words, it indicates the usefulness of the test.
It is important to understand the differences between
reliability and
validity. Validity will tell you how good a test is for a particular
situation; reliability will tell you how trustworthy a score on that test will
be. You cannot draw valid conclusions from a test score unless you are sure that
the test is reliable. Even when a test is reliable, it may not be valid. You
should be careful that any test you select is both reliable and valid for your
situation.
A test's validity is established in reference to a specific purpose; the test
may not be valid for different purposes. For example, the test you use to make
valid predictions about someone's technical proficiency on the job may not be
valid for predicting his or her leadership skills or absenteeism rate. This
leads to the next principle of assessment.
Principle of AssessmentUse only assessment procedures and
instruments that have been demonstrated to be valid for the specific purpose for
which they are being used. |
Similarly, a test's validity is established in reference to specific groups.
These groups are called the reference groups. The test may not be valid for
different groups. For example, a test designed to predict the performance of
managers in situations requiring problem solving may not allow you to make valid
or meaningful predictions about the performance of clerical employees. If, for
example, the kind of problem-solving ability required for the two positions is
different, or the reading level of the test is not suitable for clerical
applicants, the test results may be valid for managers, but not for clerical
employees.
Test developers have the responsibility of describing the reference groups
used to develop the test. The manual should describe the groups for whom the
test is valid, and the interpretation of scores for individuals belonging to
each of these groups. You must determine if the test can be used appropriately
with the particular type of people you want to test. This group of people is
called your
target population or target group.
Principle of Assessment Use assessment tools that are appropriate
for the target population. |
Your target group and the reference group do
not have to match on all
factors; they must be sufficiently similar so that the test will yield
meaningful scores for your group. For example, a writing ability test developed
for use with college seniors may be appropriate for measuring the writing
ability of white-collar professionals or managers, even though these groups do
not have identical characteristics. In determining the appropriateness of a test
for your target groups, consider factors such as occupation, reading level,
cultural differences, and language barriers.
Recall that the
Uniform Guidelines require assessment tools to have
adequate supporting evidence for the conclusions you reach with them in the
event adverse impact occurs. A valid personnel tool is one that measures an
important characteristic of the job you are interested in. Use of valid tools
will, on average, enable you to make better employment-related decisions. Both
from business-efficiency and legal viewpoints, it is essential to only use tests
that are valid for your intended use.
In order to be certain an employment test is useful and valid, evidence must
be collected relating the test to a job. The process of establishing the job
relatedness of a test is called
validation.
7. Methods for conducting validation studies
The
Uniform Guidelines discuss the following three methods of
conducting validation studies. The
Guidelines describe conditions under
which each type of validation strategy is appropriate. They do not express a
preference for any one strategy to demonstrate the job-relatedness of a test.
- Criterion-related validation requires demonstration of a
correlation or other statistical relationship between test performance and job
performance. In other words, individuals who score high on the test tend to
perform better on the job than those who score low on the test. If the criterion
is obtained at the same time the test is given, it is called concurrent
validity; if the criterion is obtained at a later time, it is called predictive
validity.
- Content-related validation requires a demonstration that the
content of the test represents important job-related behaviors. In other words,
test items should be relevant to and measure directly important requirements and
qualifications for the job.
- Construct-related validation requires a demonstration that the
test measures the construct or characteristic it claims to measure, and that
this characteristic is important to successful performance on the job.
The three methods of validity-criterion-related, content, and
construct-should be used to provide validation support depending on the
situation. These three general methods often overlap, and, depending on the
situation, one or more may be appropriate. French (1990) offers situational
examples of when each method of validity may be applied.
First, as an example of criterion-related validity, take the position of
millwright. Employees' scores (predictors) on a test designed to measure
mechanical skill could be correlated with their performance in servicing
machines (criterion) in the mill. If the correlation is high, it can be said
that the test has a high degree of validation support, and its use as a
selection tool would be appropriate.
Second, the content validation method may be used when you want to determine
if there is a relationship between behaviors measured by a test and behaviors
involved in the job. For example, a typing test would be high validation support
for a secretarial position, assuming much typing is required each day. If,
however, the job required only minimal typing, then the same test would have
little content validity. Content validity does not apply to tests measuring
learning ability or general problem-solving skills (French, 1990).
Finally, the third method is construct validity. This method often pertains
to tests that may measure abstract traits of an applicant. For example,
construct validity may be used when a bank desires to test its applicants for
"numerical aptitude." In this case, an aptitude is not an observable behavior,
but a concept created to explain possible future behaviors. To demonstrate that
the test possesses construct validation support, ". . . the bank would need to
show (1) that the test did indeed measure the desired trait and (2) that this
trait corresponded to success on the job" (French, 1990, p. 260).
Professionally developed tests should come with reports on validity evidence,
including detailed explanations of how validation studies were conducted. If you
develop your own tests or procedures, you will need to conduct your own
validation studies. As the test user, you have the ultimate responsibility for
making sure that validity evidence exists for the conclusions you reach using
the tests. This applies to all tests and procedures you use, whether they have
been bought off-the-shelf, developed externally, or developed in-house.
Validity evidence is especially critical for tests that have adverse impact.
When a test has adverse impact, the
Uniform Guidelines require that
validity evidence for that specific employment decision be provided.
The particular job for which a test is selected should be very similar to the
job for which the test was originally developed. Determining the degree of
similarity will require a
job analysis. Job analysis is a systematic
process used to identify the tasks, duties, responsibilities and working
conditions associated with a job and the knowledge, skills, abilities, and other
characteristics required to perform that job.
Job analysis information may be gathered by direct observation of people
currently in the job, interviews with experienced supervisors and job
incumbents, questionnaires, personnel and equipment records, and work manuals.
In order to meet the requirements of the
Uniform Guidelines, it is
advisable that the job analysis be conducted by a qualified professional, for
example, an industrial and organizational psychologist or other professional
well trained in job analysis techniques. Job analysis information is central in
deciding what to test for and which tests to use.
8. Using validity evidence from outside studies
Conducting your own validation study is expensive, and, in many cases, you
may not have enough employees in a relevant job category to make it feasible to
conduct a study. Therefore, you may find it advantageous to use professionally
developed assessment tools and procedures for which documentation on validity
already exists. However, care must be taken to make sure that validity evidence
obtained for an "outside" test study can be suitably "transported" to your
particular situation.
The
Uniform Guidelines, the
Standards, and the SIOP
Principles state that evidence of transportability is required. Consider
the following when using outside tests:
- Validity evidence. The validation procedures used in the
studies must be consistent with accepted standards.
- Job similarity. A job analysis should be performed to verify
that your job and the original job are substantially similar in terms of ability
requirements and work behavior.
- Fairness evidence. Reports of test fairness from outside
studies must be considered for each protected group that is part of your labor
market. Where this information is not available for an otherwise qualified test,
an internal study of test fairness should be conducted, if feasible.
- Other significant variables. These include the type of
performance measures and standards used, the essential work activities
performed, the similarity of your target group to the reference samples, as well
as all other situational factors that might affect the applicability of the
outside test for your use.
To ensure that the outside test you purchase or obtain meets professional and
legal standards, you should consult with testing professionals. See
Chapter 5 for information on locating consultants.
9. How to interpret validity information from test manuals and independent
reviews
To determine if a particular test is valid for your intended use, consult
the test manual and available independent reviews. (
Chapter
5 offers sources for test reviews.) The information below can help you
interpret the validity evidence reported in these publications.
- In evaluating validity information, it is important to determine whether the
test can be used in the specific way you intended, and whether your target group
is similar to the test reference group.
Test manuals and reviews should describe
- Available validation evidence supporting use of the test for specific
purposes. The manual should include a thorough description of the procedures
used in the validation studies and the results of those studies.
- The possible valid uses of the test. The purposes for which the test can
legitimately be used should be described, as well as the performance criteria
that can validly be predicted.
- The sample group(s) on which the test was developed. For example, was the
test developed on a sample of high school graduates, managers, or clerical
workers? What was the racial, ethnic, age, and gender mix of the sample?
- The group(s) for which the test may be used.
- The criterion-related validity of a test is measured by the
validity coefficient. It is reported as a number between 0 and 1.00 that
indicates the magnitude of the relationship, "r," between the test and a measure
of job performance (criterion). The larger the validity coefficient, the more
confidence you can have in predictions made from the test scores. However, a
single test can never fully predict job performance because success on the job
depends on so many varied factors. Therefore, validity coefficients, unlike
reliability coefficients, rarely exceed r = .40.
Table 3. General Guidelines for Interpreting Validity
Coefficients |
Validity coefficient value |
Interpretation |
above .35 |
very beneficial |
.21 - .35 |
likely to be useful |
.11 - .20 |
depends on circumstances |
below .11 |
unlikely to be useful |
As a general rule, the higher
the validity coefficient the more beneficial it is to use the test. Validity
coefficients of r =.21 to r =.35 are typical for a single test. Validities for
selection systems that use multiple tests will probably be higher because you
are using different tools to measure/predict different aspects of performance,
where a single test is more likely to measure or predict fewer aspects of total
performance. Table 3 serves as a general guideline for interpreting test
validity for a single test. Evaluating test validity is a sophisticated task,
and you might require the services of a testing expert. In addition to the
magnitude of the validity coefficient, you should also consider at a minimum the
following factors:
- level of adverse impact associated with your assessment tool
-
selection ratio (number of applicants versus the number of openings)
- cost
of a hiring error
- cost of the selection tool
- probability of hiring
qualified applicant based on chance alone.
Here are three scenarios illustrating why you should consider these factors,
individually and in combination with one another, when evaluating validity
coefficients:
- Scenario OneYou are in the process of hiring applicants where you
have a high selection ratio and are filling positions that do not require a
great deal of skill. In this situation, you might be willing to accept a
selection tool that has validity considered "likely to be useful" or even
"depends on circumstances" because you need to fill the positions, you do not
have many applicants to choose from, and the level of skill required is not that
high.
Now, let's change the situation.
- Scenario TwoYou are recruiting for jobs that require a high level
of accuracy, and a mistake made by a worker could be dangerous and costly. With
these additional factors, a slightly lower validity coefficient would probably
not be acceptable to you because hiring an unqualified worker would be too much
of a risk. In this case you would probably want to use a selection tool that
reported validities considered to be "very beneficial" because a hiring error
would be too costly to your company.
Here is another scenario that shows why you need to consider multiple factors
when evaluating the validity of assessment tools.
- Scenario ThreeA company you are working for is considering using
a very costly selection system that results in fairly high levels of adverse
impact. You decide to implement the selection tool because the assessment tools
you found with lower adverse impact had substantially lower validity, were just
as costly, and making mistakes in hiring decisions would be too much of a risk
for your company. Your company decided to implement the assessment given the
difficulty in hiring for the particular positions, the "very beneficial"
validity of the assessment and your failed attempts to find alternative
instruments with less adverse impact. However, your company will continue
efforts to find ways of reducing the adverse impact of the system.
Again, these examples demonstrate the complexity of evaluating the validity
of assessments. Multiple factors need to be considered in most situations. You
might want to seek the assistance of a testing expert (for example, an
industrial/organizational psychologist) to evaluate the appropriateness of
particular assessments for your employment situation.
When properly applied, the use of valid and reliable assessment instruments
will help you make better decisions. Additionally, by using a variety of
assessment tools as part of an assessment program, you can more fully assess the
skills and capabilities of people, while reducing the effects of errors
associated with any one tool on your decision making.