MBA and HR solution 4 all: April 2012

Tuesday, 3 April 2012

Self-Assessments

This technique involves applicants generating self-ratings on relevant performance Over time, self-assessments can be useful to clarify job performance expectations between employees and supervisors (Bassett & Meyer, 1968; Campbell & Lee, 1988), but initial discrepancies in understanding of what job requirements and performance dimensions between self- and supervisor ratings cause problems in a performance appraisal system (e.g., Ash, 1980).
Problems with this approach:
1. Self-ratings show greater leniency, less variability, more bias, and less agreement with the judgments of others (Ash, 1980; Harris & Schaubroeck, 1988; Johns, Nilsen & Campbell, 1993; Thornton, 1980; van Vliet, Kletke, & Chakraborty, 1994; Williams & Levy, 1992).
2. The predictive validity of this technique is questionable (Mabe & West, 1982). The predictors related to self-assessments and supervisor's ratings may show a lack of congruence (e.g., self-efficacy related to self-ratings) (Lane & Herriot, 1990).
3. Research suggests that applicants may not honestly respond to this type of technique (Love & Hughes, 1994).
4. Self assessment scores tend to be inflated (Gupta & Beehr, 1982; Ash, 1980).
5. Evidence suggests there is low face validity and perceived fairness associated with using this technique to promote law enforcement personnel.
6. The evidence suggests low accuracy compared to objective measures (George & Smith, 1990; DeNisi & Shaw, 1977).
7. Self-assessments may not correspond to ratings from other sources (e.g., peers) due to a lack of congruence on which specific job dimensions are to be assessed and the relative importance of specific job dimensions (Zalesny & Kirsch, 1989; Zammuto, London, & Rowland, 1982).
8. Congruency in ratings between supervisors and employees may be affected by the decision of supervisors to agree with the self-assessments of employees to avoid potential employee relation conflicts (Farh, Werbel, & Bedeian, 1988).

Future Autobiographies

A candidate is asked to write a future autobiography stating what he/she would be doing in five years. The autobiographies are then scored by two judges for differentiation, demand, and agency. Agency is defined as the extent to which a person sees himself/herself as the prime agent in determining the course of his/her future life. Demand is defined as the extent to which an individual portrays his/her life as a long-term, continuing effort on his/her part. Differentiation is defined as the extent to which an individual has created a complex, detailed mapping of his/her future (Tullar & Barrett, 1976).
Problems with this technique:
1. This test does not measure any of the KSA's that were identified through the job analysis.
2. There is no evidence that this method would reduce adverse impact.

Physical Abilities Tests

Physical Abilities Tests: Tests typically test applicants on some physical requirement such as lifting strength, rope climbing, or obstacle course completion.

Advantages

can idendentify individuals who are physically unable to perform the essential functions of a job without risking injury to themselves or others
can result in decreased costs related to disability/medical claims, insurance, and workers compensation
decreased absenteeism

Disadvantages

costly to administer
requirements must be shown to be job related through a thorough job analysis
may have age based disparate impact against older applicants

Tips

Fitness for the job Rejection of an applicant for failing a physical abilities test must be based on a determination of the individual's fitness for the job not on a general determination on the disabilities of the applicant. Liability Although a physician may administer the physical abilities test, it is the employer who decides to hire or not, therefore the liability for violations of Title VII or ADA will rest with the employer.
Job Analysis Identify stresses that occur on the job.

Work Sample Tests

Work Sample Tests: Designed to have high content validity through a close relationship with the job. Work Sample tests are based on the premise that the best predictor of future behavior is observed behavior under similar situations. These tests require the examinee to perform tasks that are similar to those that are performed on the job.

Advantages

high reliability
high content validity since work samples are a sample of the actual work performed on the job
low adverse impact
because of their relationship to the job, these tests are typically viewed more favorable by examinees than aptitude or personality tests
difficult for applicants to fake job proficiency which helps to increase the relationship between score on the test and performance on the job
Work Sample tests use equipment that is the same or substantially similar to the actual equipment used on the job

Disadvantages

costly to administer; often can only be administered to one applicant at a time
although useful for jobs where tasks and duties can be completed in a short period of time, these tests have less ability to predict performance on jobs where tasks may take days or weeks to complete
less able to measure aptitudes of an applicant thus restricting the test to measuring ability to perform the work sample and not more difficult tasks that may be encountered on the job

Tips

Job Analysis Critical for identifying the content of the job from which samples will be developed. The Critical Incident Technique would be useful for identifying job duties/tasks that, if sampled on the test, would result in high predictive validity (criterion related validity). High Content Validity The test should be constructed with the intent of developing a highly content valid test. The content validity is build into the test.
Equipment If specific equipment is used by incumbents on the job, try to incorporate all or some of that equipment on the test. Of couse, the safety of the applicant should take precedence over use of dangerous or unfamiliar tools or machines.

Types of Work Sample Tests

Work-Sample Tests of Trainability These are tests through a period of instruction when the applicant is expected to learn tasks involved in a work sample. The work-sample tests of trainability are suitable for untrained applicants with no previous job experience. The predictive validity of this technique is low relative to other techniques and there is evidence the validity of the instrument may attenuate over time.
Simulation of an Event These tests present the candidate with a picture of an incident along with quotations from those involved. The candidates then respond to a series of questions in which they write down the decisions they would make. The test is scored by subject matter experts.
Low Fidelity Simulations These tests present applicants with descriptions of work situations and five alternative responses for each situation. Applicants choose the responses they would most likely and least likely make in each situation.
Work-samples Applicants perform observable, job-related behaviors as predictors of criterion performance. It is not feasible to adapt certain work behaviors for testing. Work samples often are not conducive to group administration and, therefore, were dropped from consideration because of concerns regarding test security.

Validating Work Sample Tests

Content Validity The most direct relationship between the test and job would be shown through content validation. The tasks and duties performed on the test would be compared to the tasks and duties performed on the job. The test should encompass significant (in quantity or in importance) tasks/duties of the job.
Criterion Validity To measure this validity, you must first determine what criteria will be used. Two common forms of criteria are:
- Supervisory ratings of the incumbent's job performance. The disadvantage of using supervisory ratings as criteria is that they typically lack sufficient reliability to be used for statistical analysis. The reliability of these measures is attenuated by rater errors such as 'halo' or 'leniency'. These ratings alto tend to lack the variability necessary to show a correlation between predictor and criterion.
- Production measures such as quantity or quality of work. Production measures are not available for some jobs.
The predictor measures used with work sample tests include:
- Number of work samples completed (using a time limit)
- Time to complete work samples (using a limit on the number of work samples to be completed on the test)
- Number and type of errors

Cognitive Ability Measures

Cognitive Abilties Tests: Paper and pencil or individualized assessment measures of an individual's general mental ability or intelligence.

General Intelligence Tests
Aptitude Tests
- Mechanical Aptitude
- Clerical Aptitude
- Spatial Aptitude

Advantages

highly reliable
verbal reasoning and numerical tests have shown high validity for a wide range of jobs
the validity rises with increasing complexity of the job
combinations of aptitude tests have higher validities than individual tests alone
may be administered in group settings where many applicants can be tested at the same time
scoring of the tests may be completed by computer scanning equipment
lower cost than personality tests

Disadvantages

non-minorities typically score one standard deviation above minorities which may result in adverse impact depending on how the scores are used in the selection process
differences between males and females in abilities (e.g., knowledge of mathematics) may negatively impact the scores of female applicants

Tips

Avoid pure intelligence tests Intelligence tests may require special administrive procedures and increased costs associated with administration, scoring, and interpreting the results. Aptitude tests are generally more suited for the employment area. Job Analysis Before any test is administered, you should conduct a job analysis to identify the job requirements and duties. Tests should be chosen to measure aptitudes and abilities related to the job.
Adverse Impact Try to avoid tests that have demonstrated adverse impact. If a test is shown to have adverse impact, then the use of the test should be validated in accordance with the Uniform Guidelines on Employee Selection Procedures.
Follow the Instructions Most tests include instructions for proper test administration and scoring.

Summary of Cognitive Ability Tests

Examples of Cognitive Ability Tests

Employee Aptitude Survey A battery of employment tests designed to meet the practical requirements of a personnel office. Consists of 10 cognitive, perceptual, and psychomotor ability tests. Nine of the 10 tests have 5-minute time limits. The remaining test requires two to ten minutes of testing time. Is a tool for personnel selection and a useful diagnostic tool for vocational guidance and career counseling. For situations in which it is desirable to retest an individual on an alternate form, special retest norms are provided for interpreting retest scores.
- Test 1--Verbal Comprehension. Each item consists of one word in capital letters followed by four words in small letters. The respondent is to choose the word in small letters that means about the same as the word in capital letters. Scoring is the number right minus 1/3 the number wrong.
- Test 2--Numerical Ability. A battery of three tests: integers, decimal fractions and common fractions, each is timed separately. Designed to measure skill in the four basic operations of addition, subtraction, multiplication, and division.
- Test 3--Visual Pursuit. Designed to measure the ability to make rapid scanning movements of the eyes without being distracted by other irrelevant visual stimulation. Involves the visual tracing of lines through an entangled network.
- Test 4--Visual Speed And Accuracy. The test consists of two columns of numbers; the respondent decides whether the number in the first column in exactly the same as the number in the second.
- Test 5--Space Visualization. Designed to measure the ability to visualize forms in space and to manipulate these forms or objects mentally. The test taker is shown a group of numbered, piled blocks and must determine, for a specifically numbered block, how many other blocks touch it.
- Test 6--Numerical Reasoning. Designed to measure the ability to analyze logical relationships and to see the underlying principles of such relationships. This is also known as the process of inductive reasoning--making generalizations from specific instances. The test taker is given a series of numbers and determines what the next number will be. Scoring is the number right minus 1/4 the number wrong.
- Test 7--Verbal Reasoning, Revised. Designed to measure the ability to analyze verbally stated facts and to make valid judgments on the basis of the logical implications of such facts; and thus, the ability to analyze available information in order to make practical decisions. Scoring is the number of right answers minus 1/2 the wrong answers.
- Test 8--Word Fluency. Designed to measure the ability to express oneself rapidly, easily and with flexibility. Word fluency involves the speed and freedom of word usage as opposed to understanding verbal meanings. People who measure high in this ability are particularly good at expressing themselves and in finding the right word at the right time. The test taker is given a letter of the alphabet and asked to write as many words as possible that begin with that letter.
- Test 9--Manual Speed And Accuracy. Designed to measure the ability to make rapid and precise movements with the hands and fingers. Also measures, according to the authors, the temperamental willingness to perform highly repetitive, routine, and monotonous work. The test taker is to put a pencil dot in as many circles as he or she can in five minutes, without letting the dots touch the sides of the small circles.
- Test 10-Symbolic Reasoning. : Designed to measure the ability to think and reason abstractly, using symbols rather than words or numbers; to manipulate abstract symbols mentally; and to make judgments and decisions which are logical and valid. Each problem contains a statement and a conclusion and uses certain symbols such as the equal sign and mathematical symbols for greater than and smaller than, etc. The test taker determines whether the conclusion is definitely true, definitely false, or impossible to determine on the basis of the statement. Scoring is the number of right answers minus 1/2 the wrong answers.
Progressive Matrices, Advanced Sets I and II. A nonverbal test designed for use as an aid in assessing mental ability. Requires the examinee to solve problems presented in abstract figures and designs. Scores are said to correlate well with comprehensive intelligence tests. Set II provides a means of assessing all the analytical and integral operations involved in the higher thought processes and differentiates between people of superior intellectual ability.
Kaufman Brief Intelligence Test. Brief individually administered measure of verbal and nonverbal intelligence for people aged 4-90. Developed specifically for screening purposes and for those situations where it would be difficult to do a more in-depth assessment. Norms are provided for all ages. Composed of two subtests, vocabulary and matrices. Vocabulary measures verbal, school-related skills by assessing word knowledge and verbal concept formation. Matrices measures nonverbal skills and ability to solve new problems. Items in matrices subtest involve pictures and designs.
Short-term Memory Tests A form of cognitive ability test that are exemplified by short-term memory tasks such as forward digit span and serial rote learning, which do not require mental manipulation of inputs in order to provide an output. Short-term memory tests lack face validity in predicting job performance.
Information Processing Tests Selection tests that have the same information processing requirements that occur on the job. In other words, the tests are tailored for each particular job. There is some evidence that adverse impact is reduced.

Biographical Inventories

Personnel Selection: Methods: Biographical Inventories

Biographical Data in Selection: Techniques for scoring application forms or biographical questionnaires to be used for selection of applicants.

Advantages useful for jobs where a large number of employees are performing the same or similar job useful for jobs where there are a large number of applicants relative to the number of openings	Disadvantages
Tips

Summary of Biographical Data Selection Procedures

Types of Biographical Data Selection Procedures

Background Information/Application Blanks Paper-and-pencil questionnaires, interviews, and communications with past employers in order to assess an individual's behavioral reliability, integrity, and personal adjustment. In order to implement this technique a validation study would have to be conducted.
Empirically-keyed Biodata Applicants are presented with a list of questions pertaining to such things as one's economic stability, work ethic orientation, and educational achievement. Applicants' scores are determined by weighting each item according to the item's empirically derived relationship to the criterion of interest. This technique requires a validation study to be carried out in order to obtain the empirically derived weights for the biodata.
Rationally-keyed Biodata Applicants are presented with a list of questions pertaining to such things as one's economic stability, work ethic orientation and educational achievement. Applicants' scores are determined by weighting each item according to the item's rationally derived relationship to the criterion of interest. Research indicates the predictive validity of this technique may be lower than other available techniques with no evidence for reduced adverse impact against minorities.

Personnel Selection: Methods: Personality Tests

Personality Tests: A selection procedure measure the personality characteristics of applicants that are related to future job performance. Personality tests typically measure one or more of five personality dimensions: extroversion, emotional stability, agreeableness, conscientiousness, and openness to experience.

Advantages

can result in lower turnover due if applicants are selected for traits that are highly correlated with employees who have high longevity within the organization
can reveal more information about applicant's abilities and interests
can identify interpersonal traits that may be needed for certain jobs

Disadvantages

difficult to measure personality traits that may not be well defined
applicant's training and experience may have greater impact on job performance than applicant's personality
responses by applicant may may be altered by applicant's desire to respond in a way they feel would result in their selection
lack of diversity if all selected applicants have same personality traits
cost may be prohibitive for both the test and interpretation of results
lack of evidence to support validity of use of personality tests

Tips

Select traits carefully An employer that selects applicants with high degree of 'assertiveness', 'independence', and 'self-confidence' may end up excluding females significantly more than males which would result in adverse impact. Select tests carefully Any tests should have been analyzed for (high) reliability and (low) adverse impact.
Not used exclusively Personality tests should not be the sole instrument used for selecting applicants. Rather, they should be used in conjunction with other procedures as one element of the selection process. Applicants should not be selected on the basis of personality tests alone.

Summary of Personality Tests

Since there is not a correct answer to personality tests, the scoring of the procedure could be questioned.
Recent litigation has suggested that some items for these types of tests may be too intrusive (Soroka v. Dayton Hudson, 1991).
This technique lacks face validity. In other words, it would be difficult to show how individual questions on certain personality measures are job related even if the overall personality scale is a valid predictor of job performance.
Hooke and Krauss (1971) administered three (3) tests to sergeant candidates; the Minnesota Multiphasic Personality Inventory, the Allport-Vemon-Lindzey Study of Values, and the Gough Adjective Check List. These tests did not differentiate candidates rated as good sergeant material from those rates as poorer candidates. The researchers concluded that the groups may have been so similar that these tests were not sensitive enough to differentiate them.

Types of Personality Tests

Personal Attribute Inventory. An interpersonal assessment instrument which consists of 50 positive and 50 negative adjectives from Gough's Adjective Check List. The subject is to select 30 which are most descriptive of the taregt group or person in question. This instrument was specifically designed to tap affective reactions and may be used in either assessing attitudes toward others or as a self-concept scale.
Personality Adjective Checklist A comprehensive, objective measure of eight personality styles (which are closely aligned with DSM-III-R Axis II constructs). These eight personality styles are: introversive, inhibited, cooperative, sociable, confident, forceful, respectful, and sensitive. This instrument is designed for use with nonpsychiatric patients and normal adults who read minimally at the eighth grade level. Test reports are computer-generated and are intended for use by qualified professionals only. Interpretive statements are based on empirical data and theoretical inference. They are considered probabilistic in nature and cannot be considered definitive. (2K )
Cross-Cultural Adaptability Inventory Self-scoring six-point rating scale is a training instrument designed to provide feedback to individuals about their potential for cross-cultural effectiveness. It is most effective when used as part of a training program. It can also be used as a team-building tool for culturally diverse work groups and as a counseling tool for people in the process of cross-cultural adjustment. The inventory contains 50 items, distributed among 4 subscales: emotional resilience, flexibility/openness, perceptual acuity, personal autonomy. Materials:
California Psychological Inventory Multipurpose questionnaire designed to assess normal personality characteristics important in everyday life that individuals make use of to understand, classify, and predict their own behaviors and that of others. In this revision, two new scales, empathy and independence, have been added; semantic changes were made in 29 items; and 18 items were eliminated. The inventory is applicable for use in a variety of settings, including business and industry, schools and colleges, clinics and counseling agencies, and for cross cultural and other research. May be used to advise employees/applicants about their vocational plans.

Sample Questions of Personality Tests

The following items are similar to items found on personality tests:

		Never	Seldom	Sometimes	Often	Always
1.	I enjoy reading books of fiction.
2.	I am more conservative than risk taking.
3.	Sometimes I get very nervous.
4.	I more often introduce myself to strangers than strangers introduce themselves to me.
5.	I consider myself more of a doer than a thinker.
6.	I like to set goals before beginning a project.
7.	I like to follow schedules.
8.	I think it is OK to bend the rules to complete a task on time.
9.	I enjoy long weekends.

Personnel Selection: Methods: Interviews

Interviews: A selection procedure designed to predict future job performance on the basis of applicants' oral responses to oral inquiries.

Advantages

useful for determining if the applicant has requisite communicative or social skills which may be necessary for the job
interviewer can obtain supplementary information
used to appraise candidates' verbal fluency
can assess the applicant's job knowledge
can be used for selection among equally qualified applicants
enables the supervisor and/or co-workers to determine if there is compatability between the applicant and the employees
allows the applicant to ask questions that may reveal additional information useful for making a selection decision
the interview may be modified as needed to gather important information

Disadvantages

subjective evaluations are made
decisions tend to be made within the first few minutes of the interview with the remainder of the interview used to validate or justify the original decision
interviewers form stereotypes concerning the characteristics required for success on the job
research has shown disproportionate rates of selection between minority and non-minority members using interviews
negative information seems to be given more weight
not much evidence of validity of the selection procedure
not as reliable as tests

Tips

Minimize stereotypes. To minimize the influence of racial and sex stereotypes in the interview process, provide interviewers with a job description and specification of the requirements for the position. Interviewers with little information about the job may be more likely to make stereotypical judgements about the suitability of candidates than are interviewers with detailed information about the job. Job Related. Try to make the interview questions job related. If the questions are not related to the job, then the validity of the interview procedure may be lower.
Train Interviewers. Improve the interpersonal skills of the interviewer and the interviewer's ability to make decisions without influence from non-job related information. Interviewers should be trained to:

avoid asking questions unrelated to the job
avoid making quick decisions about an applicant
avoid stereotying applicants
avoid giving too much weight to a few characteristics.
try to put the applicant at ease during the interview
communicate clearly with the applicant
maintain consistency in the questions asked

Summary of Interviews

In general, interviews have the following weaknesses:

validity of the interview is relatively low
reliability of the interview is also low
stereotyping by interviewers, in general, may lead to adverse impact against minorities
the subjective nature of this procedure may allow bias such as favoritism and politics to enter into the selection process
this procedure is not standardized.
not useful when large numbers of applicants must be evaluated and/or selected

Types of Interviews

Unstructured Interview Involves a procedure where different questions may be asked of different applicants.
Situational Interview Candidates are interviewed about what actions they would take in various job-related situations. The job-related situations are usually identified using the critical incidents job analysis technique. The interviews are then scored using a scoring guide constructed by job experts.
Behavior Description Interviews Candidates are asked what actions they have taken in prior job situations that are similar to situations they may encounter on the job. The interviews are then scored using a scoring guide constructed by job experts.
Comprehensive Structured Interviews Candidates are asked questions pertaining to how they would handle job-related situations, job knowledge, worker requirements, and how the candidate would perform various job simulations. Interviews tapping job knowledge offer a way to assess a candidate's current level of knowledge related to relevant implicit dimensions of job performance (i.e., "tacit knowledge" or "practical intelligence" related to a specific job position)
Structured Behavioral Interview This technique involves asking all interviewees standardized questions about how they handled past situations that were similar to situations they may encounter on the job. The interviewer may also ask discretionary probing questions for details of the situations, the interviewee's behavior in the situation and the outcome. The interviewee's responses are then scored with behaviorally anchored rating scales.
Oral Interview Boards This technique entails the job candidate giving oral responses tojob-related questions asked by a panel of interviewers. Each member of the panel then rates each interviewee on such dimensions as work history, motivation, creative thinking, and presentation. The scoring procedure for oral interview boards has typically been subjective; thus, it would be subject to personal biases of those individuals sitting on the board. This technique may not be feasible for jobs in which there are a large number of applicants that must be interviewed.

Using, Scoring, and Interpreting Assessment Instruments

Using, Scoring, and Interpreting Assessment Instruments

This chapter describes some of the most common assessment instrument scoring procedures. It also discusses how to properly interpret results, and how to use them effectively. Other issues regarding the proper use of assessment tools are also discussed.
Chapter Highlights
1. Assessment instrument scoring procedures
2. Test interpretation methods: norm and criterion-referenced tests
3. Interpreting test results
4. Processing test results to make employment decisions-rank-ordering and cut-off scores
5. Combining information from many assessment tools
6. Minimizing adverse impact

Principle of AssessmentEnsure that scores are interpreted properly.

1. Assessment instrument scoring procedures

Test publishers may offer one or more ways to score the tests you purchase. Available options may range from hand scoring by your staff to machine scanning and scoring done by the publisher. All options have their advantages and disadvantages. When you select the tests for use, investigate the available scoring options. Your staff's time, turnaround time for test results, and cost may all play a part in your purchasing decision.

Hand scoring. The answer sheet is scored by counting the number of correct responses, often with the aid of a stencil. These scores may then have to be converted from the raw score count to a form that is more meaningful, such as a percentile or standard score. Staff must be trained on proper hand scoring procedures and raw score conversion. This method is more prone to error than machine scoring. To improve accuracy, scoring should be double checked. Hand scoring a test may take more time and effort, but it may also be the least expensive method when there are only a small number of tests to score.
Computer-based scoring. Tests can be scored using a computer and test scoring software purchased from the test publisher. When the test is administered in a paper-and-pencil format, raw scores and identification information must be key-entered by staff following the completion of the test session. Converted scores and interpretive reports can be printed immediately. When the test is administered on the computer, scores are most often generated automatically upon completion of the test; there is no need to key-enter raw scores or identifying information. This is one of the major advantages of computer-based testing.
Optical scanning. Machine scorable answer sheets are now readily available for many multiple choice tests. They are quickly scanned and scored by an optical mark reader. You may be able to score these answer sheets in-house or send them to the test publisher for scoring.
- On-site. You will need a personal computer system (computer, monitor, and printer), an optical reader, and special test scoring software from the publisher. Some scanning programs not only generate test scores but also provide employers with individual or group interpretive reports. Scanning systems can be costly, and the staff must learn to operate the scanner and the computer program that does the test scoring and reporting. However, using a scanner is much more efficient than hand scoring, or key-entering raw scores when testing volume is heavy.
- Mail-in and fax scoring. In many cases the completed machine-scannable answer sheets can be mailed or faxed to the test publisher. The publisher scores the answer sheets and returns the scores and test reports to the employer. Test publishers generally charge a fee for each test scored and for each report generated. For mail-in service, there is a delay of several days between mailing answer sheets and receipt of the test results from the service. Overnight mail by private or public carrier will shorten the wait but will add to the cost. Some publishers offer a scoring service by fax machine. This will considerably shorten the turn-around time, but greater care must be taken to protect the confidentiality of the results.

2. Test interpretation methods: norm and criterion-referenced tests

Employment tests are used to make inferences about people's characteristics, capabilities, and likely future performance on the job. What does the test score mean? Is the applicant qualified? To help answer these questions, consider what the test is designed to accomplish. Does the test compare one person's score to those obtained by others in the occupation, or does it measure the absolute level of skill an individual has obtained? These two methods are described below.

Norm-referenced test interpretation. In norm-referenced test interpretation, the scores that the applicant receives are compared with the test performance of a particular reference group. In this case the reference group is the norm group. The norm group generally consists of large representative samples of individuals from specific populations, such as high school students, clerical workers, or electricians. It is their average test performance and the distribution of their scores that set the standard and become the test norms of the group. The test manual will usually provide detailed descriptions of the norm groups and the test norms. To ensure valid scores and meaningful interpretation of norm-referenced tests, make sure that your target group is similar to the norm group. Compare the educational level, the occupational, language and cultural backgrounds, and other demographic characteristics of the individuals making up the two groups to determine their similarity.
For example, consider an accounting knowledge test that was standardized on the scores obtained by employed accountants with at least 5 years of experience. This would be an appropriate test if you are interested in hiring experienced accountants. However, this test would be inappropriate if you are looking for an accounting clerk. You should look for a test normed on accounting clerks or a closely related occupation.
Criterion-referenced test interpretation. In criterion-referenced tests, the test score indicates the amount of skill or knowledge the test taker possesses in a particular subject or content area. The test score is not used to indicate how well the person does compared to others; it relates solely to the test taker's degree of competence in the specific area assessed. Criterion-referenced assessment is generally associated with educational and achievement testing, licensing, and certification. A particular test score is generally chosen as the minimum acceptable level of competence. How is a level of competence chosen? The test publisher may develop a mechanism that converts test scores into proficiency standards, or the company may use its own experience to relate test scores to competence standards.
For example, suppose your company needs clerical staff with word processing proficiency. The test publisher may provide you with a conversion table relating word processing skill to various levels of proficiency, or your own experience with current clerical employees can help you to determine the passing score. You may decide that a minimum of 35 words per minute with no more than two errors per 100 words is sufficient for a job with occasional word processing duties. If you have a job with high production demands, you may wish to set the minimum at 75 words per minute with no more than 1 error per 100 words.

It is important to ensure that all inferences you make on the basis of test results are well founded. Only use tests for which sufficient information is available to guide and support score interpretation. Read the test manual for instructions on how to properly interpret the test results. This leads to the next principle of assessment.

Principle of AssessmentEnsure that scores are interpreted properly.

3. Interpreting test results

Test results are usually presented in terms of numerical scores, such as raw scores, standard scores, and percentile scores. In order to interpret test scores properly, you need to understand the scoring system used.

Types of scores
- Raw scores. These refer to the unadjusted scores on the test. Usually the raw score represents the number of items answered correctly, as in mental ability or achievement tests. Some types of assessment tools, such as work value inventories and personality inventories, have no "right" or "wrong" answers. In such cases, the raw score may represent the number of positive responses for a particular trait. Raw scores do not provide much useful information. Consider a test taker who gets 25 out of 50 questions correct on a math test. It's hard to know whether "25" is a good score or a poor score. When you compare the results to all the other individuals who took the same test, you may discover that this was the highest score on the test. In general, for norm-referenced tests, it is important to see where a particular score lies within the context of the scores of other people. Adjusting or converting raw scores into standard scores or percentiles will provide you with this kind of information. For criterion-referenced tests, it is important to see what a particular score indicates about proficiency or competence.
- Standard scores. Standard scores are converted raw scores. They indicate where a person's score lies in comparison to a reference group. For example, if the test manual indicates that the average or mean score for the group on a test is 50, then an individual who gets a higher score is above average, and an individual who gets a lower score is below average. Standard scores are discussed in more detail below in the section on standard score distributions.
- Percentile score. A percentile score is another type of converted score. An individual's raw score is converted to a number indicating the percent of people in the reference group who scored below the test taker. For example, a score at the 70th percentile means that the individual's score is the same as or higher than the scores of 70% of those who took the test. The 50th percentile is known as the median and represents the middle score of the distribution.
Score distribution
- Normal curve. A great many human characteristics, such as height, weight, math ability, and typing skill, are distributed in the population at large in a typical pattern. This pattern of distribution is known as the normal curve and has a symmetrical bell-shaped appearance. The curve is illustrated in Figure 2. As you can see, a large number of individual cases cluster in the middle of the curve. The farther from the middle or average you go, the fewer the cases. In general, distributions of test scores follow the same normal curve pattern. Most individuals get scores in the middle range. As the extremes are approached, fewer and fewer cases exist, indicating that progressively fewer individuals get low scores (left of center) and high scores (right of center).
- Standard score distribution. There are two characteristics of a standard score distribution that are reported in test manuals. One is the mean, a measure of central tendency; the other is the standard deviation, a measure of the variability of the distribution.
  - Mean. The most commonly used measure of central tendency is the mean or arithmetic average score. Test developers generally assign an arbitrary number to represent the mean standard score when they convert from raw scores to standard scores. Look at Figure 2. Test A and Test B are two tests with different standard score means. Notice that Test A has a mean of 100 and Test B has a mean of 50. If an individual got a score of 50 on Test A, that person did very poorly. However, a score of 50 on Test B would be an average score.
  - Standard deviation. The standard deviation is the most commonly used measure of variability. It is used to describe the distribution of scores around the mean. Figure 2 shows the percent of cases 1, 2, and 3 standard deviations (sd) above the mean and 1, 2, and 3 standard deviations below the mean. As you can see, 34% of the cases lie between the mean and +1 sd, and 34% of the cases lie between the mean and -1 sd. Thus, approximately 68% of the cases lie between -1 and +1 standard deviations. Notice that for Test A, the standard deviation is 20, and 68% of the test takers score between 80 and 120. For Test B the standard deviation is 10, and 68% of the test takers score between 40 and 60.
- Percentile distribution. The bottom horizontal line below the curve in Figure 2 is labeled "Percentiles." It represents the distribution of scores in percentile units. Notice that the median is in the same position as the mean on the normal curve. By knowing the percentile score of an individual, you already know how that individual compares with others in the group. An individual at the 98th percentile scored the same or better than 98% of the individuals in the group. This is equivalent to getting a standard score of 140 on Test A or 70 on Test B.

4. Processing test results to make employment decisions-rank-ordering and cut-off scores

The rank-ordering of test results, the use of cut-off scores, or some combination of the two is commonly used to assess the qualifications of people and to make employment-related decisions about them. These are described below.
Rank-ordering is a process of arranging candidates on a list from highest score to lowest score based on their test results. In rank-order selection, candidates are chosen on a top-down basis.
A cut-off score is the minimum score that a candidate must have to qualify for a position. Employers generally set the cut-off score at a level which they determine is directly related to job success. Candidates who score below this cut-off generally are not considered for selection. Test publishers typically recommend that employers base their selection of a cut-off score on the norms of the test.
5. Combining information from many assessment tools

Many assessment programs use a variety of tests and procedures in their assessment of candidates. In general, you can use a "multiple hurdles" approach or a "total assessment" approach, or a combination of the two, in using the assessment information obtained.

Multiple hurdles approach. In this approach, test takers must pass each test or procedure (usually by scoring above a cut-off score) to continue within the assessment process. The multiple hurdles approach is appropriate and necessary in certain situations, such as requiring test takers to pass a series of tests for licensing or certification, or requiring all workers in a nuclear power plant to pass a safety test. It may also be used to reduce the total cost of assessment by administering less costly screening devices to everyone, but having only those who do well take the more expensive tests or other assessment tools.
Total assessment approach. In this approach, test takers are administered every test and procedure in the assessment program. The information gathered is used in a flexible or counterbalanced manner. This allows a high score on one test to be counterbalanced with a low score on another. For example, an applicant who performs poorly on a written test, but shows great enthusiasm for learning and is a very hard worker, may still be an attractive hire. A key decision in using the total assessment approach is determining the relative weights to assign to each assessment instrument in the program.

Figure 3 is a simple example of how assessment results from several tests and procedures can be combined to generate a weighted composite score.

Assessment instrument	Assessment score (0-100)	Assigned weight	Weighted total
Interview	80	8	640
Mechanical ability test	60	10	600
H.S. course work	90	5	450
		Total Score: 1,690

An employer is hiring entry-level machinists. The assessment instruments consist of a structured interview, a mechanical ability test, and high school course work. After consultation with relevant staff and experts, a weight of 8 is assigned for the interview, 10 for the test, and 5 for course work. A sample score sheet for one candidate, Candidate A, is shown above. As you can see, although Candidate A scored lowest on the mechanical ability test, the weights of all of the assessment instruments as a composite allowed him/her to continue on as a candidate for the machinist job rather than being eliminated for consideration as a result of the one low score.
Figure 3. Score-sheet for entry level machinist job: Candidate A. 6. Minimizing adverse impact

A well-designed assessment program will improve your ability to make effective employment decisions. However, some of the best predictors of job performance may exhibit adverse impact. As a test user, there are several good testing practices to follow to minimize adverse impact in conducting personnel assessment and to ensure that, if adverse impact does occur, it is not a result of deficiencies in your assessment tools.

Be clear about what needs to be measured, and for what purpose. Use only assessment tools that are job-related and valid, and only use them in the way they were designed to be used.
Use assessment tools that are appropriate for the target population.
Do not use assessment tools that are biased or unfair to any group of people.
Consider whether there are alternative assessment methods that have less adverse impact.
Consider whether there is another way to use the test that either reduces or is free of adverse impact.
Consider whether use of a test with adverse impact is necessary. Does the test improve the quality of selections to such an extent that the magnitude of adverse impact is justified by business necessity?
If you determine that it is necessary to use a test that may result in adverse impact, it is recommended that it be used as only one part of a comprehensive assessment process. That is, apply the whole-person approach to your personnel assessment program. This approach will allow you to improve your assessment of the individual and reduce the effect of differences in average scores between groups on a single test.

Issues and Concerns with Assessment

Issues and Concerns with Assessment

It is important to remember that an assessment instrument, like any tool, is most effective when used properly and can be very counterproductive when used inappropriately. In previous chapters you have read about the advantages of using tests and procedures as part of your personnel assessment program. You have also read about the limitations of tests in providing a consistently accurate and complete picture of an individual's employment-related qualifications and potential. This chapter highlights some important issues and concerns surrounding these limitations. Careful attention to these issues and concerns will help you produce a fair and effective assessment program.
Chapter Highlights
1. Deciding whether to test or not to test
2. Viewing tests as threats and invasions of privacy
3. Fallibility of test scores
4. Appeals process and retesting
5. Qualifications of assessment staff
6. Misuse or overuse of tests
7. Ensuring both efficiency and diversity
8. Ethnic, linguistic, and cultural differences and biases
9. Testing people with disabilities

1. Deciding whether to test or not to test

How successful is your current assessment program? Is it in need of improvement? The decision to use a test is an important one. You need to carefully consider several technical, administrative, and practical matters. Sometimes a more vigorous employee training program will help to improve individual and organizational performance without expanding your current selection procedures. Sometimes a careful review of each candidate's educational background and work history will help you to select better workers, and sometimes using additional tests will be beneficial.
Consider how much additional time and effort will be involved in expanding your assessment program. As in every business decision, you will want to determine whether the potential benefits outweigh the expenditure of time and effort. Be sure to factor in all the costs, such as purchase of tests and staff time, and balance these against all the benefits, including potential increases in productivity.
In summary, before expanding your assessment program, it is important to have a clear picture of your organization's needs, the benefits you can expect, and the costs you will incur.
2. Viewing tests as threats and invasions of privacy

Many people are intimidated at the mere thought of taking a test. Some may fear that testing will expose their weaknesses, and some may fear that tests will not measure what they really can do on the job. Also, some people may view certain tests as an invasion of privacy. This is especially true of personality tests, honesty tests, medical tests, and tests that screen for drug use. Fear or mistrust of tests can lower the scores of some otherwise qualified candidates. To reduce these feelings, it is important to take the time to explain a few things about the testing program before administering a test. Any explanation should, at a minimum, cover the following topics:
! why the test is being administered
! fairness of the test
! confidentiality of test results
! how the test results will be used in the assessment process.

3. Fallibility of test scores

All assessment tools and procedures are subject to measurement errors. This means that a test neither measures a characteristic with perfect accuracy for all people, nor fully accounts for their job performance. Thus, there will always be some errors in employment decisions made based on assessment results. This is true of all assessment procedures, regardless of how objective or standardized they might be. It is, therefore, important not to rely entirely on any one assessment instrument in making employment decisions. Using a variety of assessment tools will help you obtain a fuller and more accurate picture of an individual. Consider such information as an evaluation of a person's education, work experience and other job-relevant factors in addition to standardized test results.
4. Appeals process and retesting

Every test taker should have a fair chance to demonstrate his or her best performance on an assessment procedure. However, at times this might not occur. If the results may not be valid for an individual, consider retesting or using alternative assessment procedures before screening the individual. There are external circumstances or conditions that could invalidate the test results. These may include the test taker's state of mind or health at the time of the test, the conditions under which the test is given, and his or her familiarity with particular questions on the test. To give some specific examples, a person who has a child at home with the measles may not be able to concentrate on taking a vocabulary test. Someone sitting next to a noisy air conditioner may also not be able to concentrate on the test questions. On another day, under different circumstances, these individuals might obtain a different score.
If you believe that the test was not valid for an individual, you should consider a retest. If other versions of the test are not available, consider alternative means of assessment. Check the test manual for advice from the publisher regarding retesting. It is advisable to develop a policy on handling complaints regarding testing and appeals for retesting, so that these concerns can be resolved fairly and consistently.
5. Qualifications of assessment staff

Test results may not be accurate if the tests have not been administered and scored properly, or if the results are not interpreted appropriately. The usefulness of test results depends on proper administration, scoring and interpretation. Qualified individuals must be chosen to administer and score tests and interpret test results. These individuals must be trained appropriately. Test manuals will usually specify the qualifications and training needed to administer and score the tests and interpret results. 6. Misuse or overuse of tests

A single test cannot be expected to be valid in all situations and for all groups of people. A test generally is developed to measure specific characteristics and to predict specific performance criteria for a particular group. For example, a test with items designed to select salespersons may not be valid for identifying good sales managers. In addition, test results usually provide specific information that is valid for a specific amount of time. Therefore, it is unlikely to be appropriate to consider an employee for a promotion based on his or her test scores on a proficiency test taken 5 years earlier.
The test manual and independent reviews of the test remain your best guides on administering, scoring, and interpreting the test.
7. Ensuring both efficiency and diversity

Use of reliable and valid assessment tools can result in improved performance of your workforce. However, when designing an assessment system, it is also important to consider how to ensure a diverse workforce that can help your organization be successful in todays diverse marketplace. To encourage diversity in your organization, consider how different types of people perform on different types of tests. Some research has indicated that older workers and members of a variety of racial and ethnic groups do not do as well on certain types of tests as members of other groups. For example, older people and women tend to do less well on physical ability and endurance tests. Members of some ethnic and racial groups, on average, may do less well on ability tests. Older people tend not to score as high as younger people on timed tests. Even though these groups perform less well on certain tests, they may still perform on the job successfully. Thus by using certain types of assessments, or relying heavily on one type of test, you may limit the diversity of your workforce and miss out on some very productive potential employees (e.g., if you used only physical ability tests, you may unnecessarily exclude older workers). You might also be violating federal, state, and local equal employment opportunity laws. To help ensure both efficiency and diversity in your workforce, apply the whole-person approach to assessment. Use a variety of assessment tools to obtain a comprehensive picture of the skills and capabilities of applicants and employees. This approach to assessment will help you make sure you don't miss out on some very qualified individuals who could enhance your organization's success.
8. Ethnic, linguistic, and cultural differences and biases

The American workforce is made up of a diverse array of ethnic and cultural groups, including many persons for whom English is not the primary language. Some of these individuals may experience difficulty on standardized tests due to cultural differences or lack of mastery of the English language. Depending on the nature of the job for which they are applying, this could mean that their test scores will not accurately predict their true job potential. Before selecting new tests, consider the composition of your potential candidate population. Are the tests appropriate for all of them? The test manuals may provide assistance in determining this. If you need further clarification, contact the test publisher.
There may be cases where appropriate standardized tests are not available for certain groups. You may have to rely on other assessment techniques, such as interviews and evaluations of education and work experience, to make your employment decisions.
9. Testing people with disabilities

Many people with disabilities are productive workers. The ADA protects qualified individuals with disabilities from discrimination in all aspects of employment, including personnel assessment. Your staff should be trained to evaluate requests for reasonable accommodation and provide these accommodations if they are necessary and would not cause "undue hardship." These situations must be handled with professionalism and sensitivity. Properly handled, this can be accomplished without compromising the integrity of the assessment process. Accommodation may involve ensuring physical accessibility to the test site, modifying test equipment or tests, or providing other forms of assistance. Giving extra time for certain kinds of tests to test takers with dyslexia or other learning disabilities and administering a braille version of a test for the blind may be examples of reasonable accommodation. See Chapters 2 and 6 for further discussions on testing people with disabilities.

How to Select Tests-Standards for Evaluating Tests

How to Select Tests-Standards for Evaluating Tests

Previous chapters described a number of types of personnel tests and procedures and use of assessment tools to identify good workers and improve organizational performance. Technical and legal issues that have to be considered in using tests were also discussed. In this chapter, information and procedures for evaluating tests will be presented.
Chapter Highlights
1. Sources of information about tests
2. Standards for evaluating a test-information to consider to determine suitability of a test for your use
3. Checklist for evaluating a test.

Principle of AssessmentUse assessment instruments for which understandable and comprehensive documentation is available.

1. Sources of information about tests

Many assessment instruments are available for use in employment contexts. Sources that can help you determine which tests are appropriate for your situation are described below.

Test manual. A test manual should provide clear and complete information about how the test was developed; its recommended uses and possible misuses; and evidence of reliability, validity, and fairness. The manual also should contain full instructions for test administration, scoring, and interpretation. In summary, a test manual should provide sufficient administrative and technical information to allow you to make an informed judgment as to whether the test is suitable for your use. You can order specimen test sets and test manuals from most test publishers. Test publishers and distributors vary in the amount and quality of information they provide in test manuals. The quality and comprehensiveness of the manual often reflect the adequacy of the research base behind the test. Do not mistake catalogs or pamphlets provided by test publishers and distributors for test manuals. Catalogs and pamphlets are marketing tools aimed at selling products. To get a balanced picture of the test, it is important to consult independently published critical test reviews in addition to test manuals.
Mental Measurements Yearbook (MMY). The MMY is a major source of information about assessment tools. It consists of a continuing series of volumes. Each volume contains reviews of tests that are new or significantly revised since the publication of the previous volume. New volumes do not replace old ones; rather, they supplement them.

The MMY series covers nearly all commercially available psychological, educational, and vocational tests published for use with English-speaking people. There is a detailed review of each test by an expert in the field. A brief description of the test covering areas such as purpose, scoring, prices, and publisher is also provided.
The MMY is published by the Buros Institute of Mental Measurements. The Buros Institute also makes test reviews available through a computer database. This database is updated monthly via an on-line computer service. This service is administered by the Bibliographic Retrieval Services (BRS).

Tests in Print (TIP). TIP is another Buros Institute publication. It is published every few years and lists virtually every test published in English that is available for purchase at that time. It includes the same basic information about a test that is included in the MMY, but it does not contain reviews. This publication is a good starting place for determining what tests are currently available.
Test Critiques. This publication provides practical and straightforward test reviews. It consists of several volumes, published over a period of years. Each volume reviews a different selection of tests. The subject index at the back of the most recent volume directs the reader to the correct volume for each test review.
Professional consultants. There are many employment testing experts who can help you evaluate and select tests for your intended use. They can help you design personnel assessment programs that are effective and comply with relevant laws.

If you are considering hiring a consultant, it is important to evaluate his or her qualifications and experience beforehand. Professionals working in this field generally have a Ph.D. in industrial/organizational psychology or a related field. Look for an individual with hands-on experience in the areas in which you need assistance. Consultants may be found in psychology or business departments at universities and colleges. Others serve as full-time consultants, either working independently, or as members of consulting organizations. Typically, professional consultants will hold memberships in APA, SIOP, or other professional organizations.
Reference libraries should contain the publications discussed above as well as others that will provide information about personnel tests and procedures. The Standards for Educational and Psychological Testing and the Principles for the Validation and Use of Personnel Selection Procedures can also help you evaluate a test in terms of its development and use. In addition, these publications indicate the kinds of information a good test manual should contain. Carefully evaluate the quality and the suitability of a test before deciding to use it. Avoid using tests for which only unclear or incomplete documentation is available, and tests that you are unable to thoroughly evaluate. This is the next principle of assessment.

Principle of AssessmentUse assessment instruments for which understandable and comprehensive documentation is available.

2. Standards for evaluating a test-information to consider to determine suitability of a test for your use

The following basic descriptive and technical information should be evaluated before you select a test for your use. In order to evaluate a test, you should obtain a copy of the test and test manual. Consult independent reviews of the test for professional opinions on the technical adequacy of the test and the suitability of the test for your purposes.

General information
- Test description. As a starting point, obtain a full description of the test. You will need specific identifying information to order your specimen set and to look up independent reviews. The description of the test is the starting point for evaluating whether the test is suitable for your needs.
  - Name of test. Make sure you have the accurate name of the test. (There are tests with similar names, and you want to look up reviews of the correct instrument.)
  - Publication date. What is the date of publication? Is it the latest version? If the test is old, it is possible that the test content and norms for scoring and interpretation have become outdated.
  - Publisher. Who is the test publisher? Sometimes test copyrights are transferred from one publisher to another. You may need to call the publisher for information or for determining the suitability of the test for your needs. Is the publisher cooperative in this regard? Does the publisher have staff available to assist you?
  - Authors. Who developed the test? Try to determine the background of the authors. Typically, test developers hold a doctorate in industrial/organizational psychology, psychometrics, or a related field and are associated with professional organizations such as APA. Another desirable qualification is proven expertise in test research and construction.
  - Forms. Is there more than one version of the test? Are they interchangeable? Are forms available for use with special groups, such as non-English speakers or persons with limited reading skills?
  - Format. Is the test available in paper-and-pencil and/or computer format? Is it meant to be administered to one person at a time, or can it be administered in a group setting?
  - Administration time. How long does it take to administer?
- Costs. What are the costs to administer and score the test? This may vary depending on the version used, and whether scoring is by hand, computer, or by the test publisher.
- Staff requirements. What training and background do staff need to administer, score, and interpret the test? Do you have suitable staff available now or do you need to train and/or hire staff?
Purpose, nature, and applicability of the test
- Test purpose. What aspects of job performance do you need to measure? What characteristics does the test measure? Does the manual contain a coherent description of these characteristics? Is there a match between what the developer says the test measures and what you intend to measure? The test you select for your assessment should relate directly to one or more important aspects of the job. A job analysis will help you identify the tasks involved in the job, and the knowledge, skills, abilities, and other characteristics required for successful performance.
- Similarity of reference group to target group. The test manual will describe the characteristics of the reference group that was used to develop the test. How similar are your test takers, the target group, to the reference group? Consider such factors as age, gender, racial and ethnic composition, education, occupation, and cultural background. Do any factors suggest that the test may not be appropriate for your group? In general, the closer your group matches the characteristics of the reference group, the more confidence you will have that the test will yield meaningful scores for your group.
- Similarity of norm group to target group. In some cases, the test manual will refer to a norm group. A norm group is the sample of the relevant population on whom the scoring procedures and score interpretation guidelines are based. In such cases, the norm group is the same as the reference group. If your target group differs from the norm group in important ways, then the test cannot be meaningfully used in your situation.
Technical information
- Test reliability. Examine the test manual to determine whether the test has an acceptable level of reliability before deciding to use it. . A good test manual should provide detailed information on the types of reliabilities reported, how reliability studies were conducted, and the size and nature of the sample used to develop the reliability coefficients. Independent reviews also should be consulted.
- Test validity. Determine whether the test may be validly used in the way you intended. Check the validity coefficients in the relevant validity studies. Usually the higher the validity coefficient, the more useful the test will be in predicting job success. . A good test manual will contain clear and complete information on the valid uses of the test, including how validation studies were conducted, and the size and characteristics of the validation samples. Independent test reviews will let you know whether the sample size was sufficient, whether statistical procedures were appropriate, and whether the test meets professional standards.
- Test fairness. Select tests developed to be as fair as possible to test takers of different racial, ethnic, gender, and age groups. Read the manual and independent reviews of the test to evaluate its fairness to these groups. To secure acceptance by all test takers, the test should also appear to be fair. The test items should not reflect racial, cultural, or gender stereotypes, or overemphasize one culture over another. The rules for test administration and scoring should be clear and uniform. Does the manual indicate any modifications that are possible and may be needed to test individuals with disabilities?
- Potential for adverse impact. The manual and independent reviews should help you to evaluate whether the test you are considering has the potential for causing adverse impact. As discussed earlier, mental and physical ability tests have the potential for causing substantial adverse impact. However, they can be an important part of your assessment program. If these tests are used in combination with other employment tests and procedures, you will be able to obtain a better picture of an individual's job potential and reduce the effect of average score differences between groups on one test.
Practical evaluation
- Test tryout. It is often useful to try the test in your own organizational setting by asking employees of your organization to take the test and by taking the test yourself. Do not compute test scores for these employees unless you take steps to ensure that results are anonymous. By trying the test out, you will gain a better appreciation of the administration procedures, including the suitability of the administration manual, test booklet, answer sheets and scoring procedures, the actual time needed, and the adequacy of the planned staffing arrangements. The reactions of your employees to the test may give you additional insight into the effect the test will have on candidates.
- Cost-effectiveness. Are there less costly tests or assessment procedures that can help you achieve your assessment goals? If possible, weigh the potential gain in job performance against the cost of using the test. Some test publishers and test reviews include an expectancy chart or table that you can consult to predict the expected level of performance of an individual based on his or her test score. However, make sure your target group is comparable to the reference group on which the expectancy chart was developed.
- Independent reviews. Is the information provided by the test manual consistent with independent reviews of the test? If there is more than one review, do they agree or disagree with each other? Information from independent reviews will prove most useful in evaluating a test.
- Overall practical evaluation. This involves evaluating the overall suitability of the test for your specific circumstances. Does the test appear easy to use or is it unsettling? Does it appear fair and appropriate for your target groups? How clear are instructions for administration, scoring, and interpretation? Are special equipment or facilities needed? Is the staff qualified to administer the test and interpret results or would extensive training be required?

3. Checklist for evaluating a test

It is helpful to have an organized method for choosing the right test for your needs. A checklist can help you in this process. Your checklist should summarize the kinds of information discussed above. For example, is the test valid for your intended purpose? Is it reliable and fair? Is it cost-effective? Is the instrument likely to be viewed as fair and valid by the test takers? Also consider the ease or difficulty of administration, scoring, and interpretation given available resources. A sample checklist that you may find useful appears on the following page. Completing a checklist for each test you are considering will assist you in comparing them more easily. CHECKLIST FOR EVALUATING A TEST

Characteristic to be measured by test (skill, ability, personality trait)
Job/training characteristic to be assessed
Candidate population (education, or experience level, other background)
Test Characteristics
- Test name:
- Version:
- Type: (paper-and-pencil, computer) Alternate forms available
- Scoring method: (hand-scored, machine-scored)
Technical considerations
- Reliability: r =
- Validity: r =
- Reference/norm group:
- Test fairness evidence
- Adverse impact evidence
- Applicability (indicate any special group)
Administration considerations
- Administration time:
- Materials needed (include start-up costs, operational and scoring cost):
- Costs:
- Facilities needed:
Staffing requirements
Training requirements
Other considerations (consider clarity, comprehensiveness, utility)
Test manual
Supporting documents from the publisher
Publisher assistance
Independent reviews
Overall evaluation

Understanding Test Quality-Concepts of Reliability and Validity

Understanding Test Quality-Concepts of Reliability and Validity

Test reliability and validity are two technical properties of a test that indicate the quality and usefulness of the test. These are the two most important features of a test. You should examine these features when evaluating the suitability of the test for your use. This chapter provides a simplified explanation of these two complex ideas. These explanations will help you to understand reliability and validity information reported in test manuals and reviews and use that information to evaluate the suitability of a test for your use.
Chapter Highlights
1. What makes a good test?
2. Test reliability
3. Interpretation of reliability information from test manuals and reviews
4. Types of reliability estimates
5. Standard error of measurement
6. Test validity
7. Methods for conducting validation studies
8. Using validity evidence from outside studies
9. How to interpret validity information from test manuals and independent reviews.

Principles of Assessment DiscussedUse only reliable assessment instruments and procedures. Use only assessment procedures and instruments that have been demonstrated to be valid for the specific purpose for which they are being used.
Use assessment tools that are appropriate for the target population.

1. What makes a good test?

An employment test is considered "good" if the following can be said about it:

The test measures what it claims to measure consistently or reliably. This means that if a person were to take the test again, the person would get a similar test score.
The test measures what it claims to measure. For example, a test of mental ability does in fact measure mental ability, and not some other characteristic.
The test is job-relevant. In other words, the test measures one or more characteristics that are important to the job.
By using the test, more effective employment decisions can be made about individuals. For example, an arithmetic test may help you to select qualified workers for a job that requires knowledge of arithmetic operations.

The degree to which a test has these qualities is indicated by two technical properties: reliability and validity.
2. Test reliability

Reliability refers to how dependably or consistently a test measures a characteristic. If a person takes the test again, will he or she get a similar test score, or a much different score? A test that yields similar scores for a person who repeats the test is said to measure a characteristic reliably. How do we account for an individual who does not get exactly the same test score every time he or she takes the test? Some possible reasons are the following:

Test taker's temporary psychological or physical state. Test performance can be influenced by a person's psychological or physical state at the time of testing. For example, differing levels of anxiety, fatigue, or motivation may affect the applicant's test results.
Environmental factors. Differences in the testing environment, such as room temperature, lighting, noise, or even the test administrator, can influence an individual's test performance.
Test form. Many tests have more than one version or form. Items differ on each form, but each form is supposed to measure the same thing. Different forms of a test are known as parallel forms or alternate forms. These forms are designed to have similar measurement characteristics, but they contain different items. Because the forms are not exactly the same, a test taker might do better on one form than on another.
Multiple raters. In certain tests, scoring is determined by a rater's judgments of the test taker's performance or responses. Differences in training, experience, and frame of reference among raters can produce different test scores for the test taker.

These factors are sources of chance or random measurement error in the assessment process. If there were no random errors of measurement, the individual would get the same test score, the individual's "true" score, each time. The degree to which test scores are unaffected by measurement errors is an indication of the reliability of the test.
Reliable assessment tools produce dependable, repeatable, and consistent information about people. In order to meaningfully interpret test scores and make useful employment or career-related decisions, you need reliable tools. This brings us to the next principle of assessment.

Principle of AssessmentUse only reliable assessment instruments and procedures. In other words, use only assessment tools that provide dependable and consistent information.

3. Interpretation of reliability information from test manuals and reviews

Test manuals and independent review of tests provide information on test reliability. The following discussion will help you interpret the reliability information about any test.

The reliability of a test is indicated by the reliability coefficient. It is denoted by the letter "r," and is expressed as a number ranging between 0 and 1.00, with r = 0 indicating no reliability, and r = 1.00 indicating perfect reliability. Do not expect to find a test with perfect reliability. Generally, you will see the reliability of a test as a decimal, for example, r = .80 or r = .93. The larger the reliability coefficient, the more repeatable or reliable the test scores. Table 1 serves as a general guideline for interpreting test reliability. However, do not select or reject a test solely based on the size of its reliability coefficient. To evaluate a test's reliability, you should consider the type of test, the type of reliability estimate reported, and the context in which the test will be used.

Table 1. General Guidelines for Interpreting Reliability Coefficients
Reliability coefficient value	Interpretation
.90 and up	excellent
.80 - .89	good
.70 - .79	adequate
below .70	may have limited applicability

4. Types of reliability estimates

There are several types of reliability estimates, each influenced by different sources of measurement error. Test developers have the responsibility of reporting the reliability estimates that are relevant for a particular test. Before deciding to use a test, read the test manual and any independent reviews to determine if its reliability is acceptable. The acceptable level of reliability will differ depending on the type of test and the reliability estimate used. The discussion in Table 2 should help you develop some familiarity with the different kinds of reliability estimates reported in test manuals and reviews.
Table 2. Types of Reliability Estimates

Test-retest reliability indicates the repeatability of test scores with the passage of time. This estimate also reflects the stability of the characteristic or construct being measured by the test. Some constructs are more stable than others. For example, an individual's reading ability is more stable over a particular period of time than that individual's anxiety level. Therefore, you would expect a higher test-retest reliability coefficient on a reading test than you would on a test that measures anxiety. For constructs that are expected to vary over time, an acceptable test-retest reliability coefficient may be lower than is suggested in Table 1.

Alternate or parallel form reliability indicates how consistent test scores are likely to be if a person takes two or more forms of a test. A high parallel form reliability coefficient indicates that the different forms of the test are very similar which means that it makes virtually no difference which version of the test a person takes. On the other hand, a low parallel form reliability coefficient suggests that the different forms are probably not comparable; they may be measuring different things and therefore cannot be used interchangeably.

Inter-rater reliability indicates how consistent test scores are likely to be if the test is scored by two or more raters. On some tests, raters evaluate responses to questions and determine the score. Differences in judgments among raters are likely to produce variations in test scores. A high inter-rater reliability coefficient indicates that the judgment process is stable and the resulting scores are reliable.
Inter-rater reliability coefficients are typically lower than other types of reliability estimates. However, it is possible to obtain higher levels of inter-rater reliabilities if raters are appropriately trained.

Internal consistency reliability indicates the extent to which items on a test measure the same thing. A high internal consistency reliability coefficient for a test indicates that the items on the test are very similar to each other in content (homogeneous). It is important to note that the length of a test can affect internal consistency reliability. For example, a very lengthy test can spuriously inflate the reliability coefficient.
Tests that measure multiple characteristics are usually divided into distinct components. Manuals for such tests typically report a separate internal consistency reliability coefficient for each component in addition to one for the whole test.
Test manuals and reviews report several kinds of internal consistency reliability estimates. Each type of estimate is appropriate under certain circumstances. The test manual should explain why a particular estimate is reported.

5. Standard error of measurement

Test manuals report a statistic called the standard error of measurement (SEM). It gives the margin of error that you should expect in an individual test score because of imperfect reliability of the test. The SEM represents the degree of confidence that a person's "true" score lies within a particular range of scores. For example, an SEM of "2" indicates that a test taker's "true" score probably lies within 2 points in either direction of the score he or she receives on the test. This means that if an individual receives a 91 on the test, there is a good chance that the person's "true" score lies somewhere between 89 and 93. The SEM is a useful measure of the accuracy of individual test scores. The smaller the SEM, the more accurate the measurements.
When evaluating the reliability coefficients of a test, it is important to review the explanations provided in the manual for the following:

Types of reliability used. The manual should indicate why a certain type of reliability coefficient was reported. The manual should also discuss sources of random measurement error that are relevant for the test.
How reliability studies were conducted. The manual should indicate the conditions under which the data were obtained, such as the length of time that passed between administrations of a test in a test-retest reliability study. In general, reliabilities tend to drop as the time between test administrations increases.
The characteristics of the sample group. The manual should indicate the important characteristics of the group used in gathering reliability information, such as education level, occupation, etc. This will allow you to compare the characteristics of the people you want to test with the sample group. If they are sufficiently similar, then the reported reliability estimates will probably hold true for your population as well.

For more information on reliability, consult the APA Standards, the SIOP Principles, or any major textbook on psychometrics or employment testing. Appendix A lists some possible sources.
6. Test validity

Validity is the most important issue in selecting a test. Validity refers to what characteristic the test measures and how well the test measures that characteristic.

Validity tells you if the characteristic being measured by a test is related to job qualifications and requirements.
Validity gives meaning to the test scores. Validity evidence indicates that there is linkage between test performance and job performance. It can tell you what you may conclude or predict about someone from his or her score on the test. If a test has been demonstrated to be a valid predictor of performance on a specific job, you can conclude that persons scoring high on the test are more likely to perform well on the job than persons who score low on the test, all else being equal.
Validity also describes the degree to which you can make specific conclusions or predictions about people based on their test scores. In other words, it indicates the usefulness of the test.

It is important to understand the differences between reliability and validity. Validity will tell you how good a test is for a particular situation; reliability will tell you how trustworthy a score on that test will be. You cannot draw valid conclusions from a test score unless you are sure that the test is reliable. Even when a test is reliable, it may not be valid. You should be careful that any test you select is both reliable and valid for your situation.
A test's validity is established in reference to a specific purpose; the test may not be valid for different purposes. For example, the test you use to make valid predictions about someone's technical proficiency on the job may not be valid for predicting his or her leadership skills or absenteeism rate. This leads to the next principle of assessment.

Principle of AssessmentUse only assessment procedures and instruments that have been demonstrated to be valid for the specific purpose for which they are being used.

Similarly, a test's validity is established in reference to specific groups. These groups are called the reference groups. The test may not be valid for different groups. For example, a test designed to predict the performance of managers in situations requiring problem solving may not allow you to make valid or meaningful predictions about the performance of clerical employees. If, for example, the kind of problem-solving ability required for the two positions is different, or the reading level of the test is not suitable for clerical applicants, the test results may be valid for managers, but not for clerical employees.
Test developers have the responsibility of describing the reference groups used to develop the test. The manual should describe the groups for whom the test is valid, and the interpretation of scores for individuals belonging to each of these groups. You must determine if the test can be used appropriately with the particular type of people you want to test. This group of people is called your target population or target group.

Principle of Assessment
Use assessment tools that are appropriate for the target population.

Your target group and the reference group do not have to match on all factors; they must be sufficiently similar so that the test will yield meaningful scores for your group. For example, a writing ability test developed for use with college seniors may be appropriate for measuring the writing ability of white-collar professionals or managers, even though these groups do not have identical characteristics. In determining the appropriateness of a test for your target groups, consider factors such as occupation, reading level, cultural differences, and language barriers.
Recall that the Uniform Guidelines require assessment tools to have adequate supporting evidence for the conclusions you reach with them in the event adverse impact occurs. A valid personnel tool is one that measures an important characteristic of the job you are interested in. Use of valid tools will, on average, enable you to make better employment-related decisions. Both from business-efficiency and legal viewpoints, it is essential to only use tests that are valid for your intended use.
In order to be certain an employment test is useful and valid, evidence must be collected relating the test to a job. The process of establishing the job relatedness of a test is called validation.
7. Methods for conducting validation studies

The Uniform Guidelines discuss the following three methods of conducting validation studies. The Guidelines describe conditions under which each type of validation strategy is appropriate. They do not express a preference for any one strategy to demonstrate the job-relatedness of a test.

Criterion-related validation requires demonstration of a correlation or other statistical relationship between test performance and job performance. In other words, individuals who score high on the test tend to perform better on the job than those who score low on the test. If the criterion is obtained at the same time the test is given, it is called concurrent validity; if the criterion is obtained at a later time, it is called predictive validity.
Content-related validation requires a demonstration that the content of the test represents important job-related behaviors. In other words, test items should be relevant to and measure directly important requirements and qualifications for the job.
Construct-related validation requires a demonstration that the test measures the construct or characteristic it claims to measure, and that this characteristic is important to successful performance on the job.

The three methods of validity-criterion-related, content, and construct-should be used to provide validation support depending on the situation. These three general methods often overlap, and, depending on the situation, one or more may be appropriate. French (1990) offers situational examples of when each method of validity may be applied.
First, as an example of criterion-related validity, take the position of millwright. Employees' scores (predictors) on a test designed to measure mechanical skill could be correlated with their performance in servicing machines (criterion) in the mill. If the correlation is high, it can be said that the test has a high degree of validation support, and its use as a selection tool would be appropriate.
Second, the content validation method may be used when you want to determine if there is a relationship between behaviors measured by a test and behaviors involved in the job. For example, a typing test would be high validation support for a secretarial position, assuming much typing is required each day. If, however, the job required only minimal typing, then the same test would have little content validity. Content validity does not apply to tests measuring learning ability or general problem-solving skills (French, 1990).
Finally, the third method is construct validity. This method often pertains to tests that may measure abstract traits of an applicant. For example, construct validity may be used when a bank desires to test its applicants for "numerical aptitude." In this case, an aptitude is not an observable behavior, but a concept created to explain possible future behaviors. To demonstrate that the test possesses construct validation support, ". . . the bank would need to show (1) that the test did indeed measure the desired trait and (2) that this trait corresponded to success on the job" (French, 1990, p. 260).
Professionally developed tests should come with reports on validity evidence, including detailed explanations of how validation studies were conducted. If you develop your own tests or procedures, you will need to conduct your own validation studies. As the test user, you have the ultimate responsibility for making sure that validity evidence exists for the conclusions you reach using the tests. This applies to all tests and procedures you use, whether they have been bought off-the-shelf, developed externally, or developed in-house.
Validity evidence is especially critical for tests that have adverse impact. When a test has adverse impact, the Uniform Guidelines require that validity evidence for that specific employment decision be provided.
The particular job for which a test is selected should be very similar to the job for which the test was originally developed. Determining the degree of similarity will require a job analysis. Job analysis is a systematic process used to identify the tasks, duties, responsibilities and working conditions associated with a job and the knowledge, skills, abilities, and other characteristics required to perform that job.
Job analysis information may be gathered by direct observation of people currently in the job, interviews with experienced supervisors and job incumbents, questionnaires, personnel and equipment records, and work manuals. In order to meet the requirements of the Uniform Guidelines, it is advisable that the job analysis be conducted by a qualified professional, for example, an industrial and organizational psychologist or other professional well trained in job analysis techniques. Job analysis information is central in deciding what to test for and which tests to use.
8. Using validity evidence from outside studies

Conducting your own validation study is expensive, and, in many cases, you may not have enough employees in a relevant job category to make it feasible to conduct a study. Therefore, you may find it advantageous to use professionally developed assessment tools and procedures for which documentation on validity already exists. However, care must be taken to make sure that validity evidence obtained for an "outside" test study can be suitably "transported" to your particular situation. The Uniform Guidelines, the Standards, and the SIOP Principles state that evidence of transportability is required. Consider the following when using outside tests:

Validity evidence. The validation procedures used in the studies must be consistent with accepted standards.
Job similarity. A job analysis should be performed to verify that your job and the original job are substantially similar in terms of ability requirements and work behavior.
Fairness evidence. Reports of test fairness from outside studies must be considered for each protected group that is part of your labor market. Where this information is not available for an otherwise qualified test, an internal study of test fairness should be conducted, if feasible.
Other significant variables. These include the type of performance measures and standards used, the essential work activities performed, the similarity of your target group to the reference samples, as well as all other situational factors that might affect the applicability of the outside test for your use.

To ensure that the outside test you purchase or obtain meets professional and legal standards, you should consult with testing professionals. See Chapter 5 for information on locating consultants.
9. How to interpret validity information from test manuals and independent reviews

To determine if a particular test is valid for your intended use, consult the test manual and available independent reviews. (Chapter 5 offers sources for test reviews.) The information below can help you interpret the validity evidence reported in these publications.

In evaluating validity information, it is important to determine whether the test can be used in the specific way you intended, and whether your target group is similar to the test reference group. Test manuals and reviews should describe
- Available validation evidence supporting use of the test for specific purposes. The manual should include a thorough description of the procedures used in the validation studies and the results of those studies.
- The possible valid uses of the test. The purposes for which the test can legitimately be used should be described, as well as the performance criteria that can validly be predicted.
- The sample group(s) on which the test was developed. For example, was the test developed on a sample of high school graduates, managers, or clerical workers? What was the racial, ethnic, age, and gender mix of the sample?
- The group(s) for which the test may be used.

The criterion-related validity of a test is measured by the validity coefficient. It is reported as a number between 0 and 1.00 that indicates the magnitude of the relationship, "r," between the test and a measure of job performance (criterion). The larger the validity coefficient, the more confidence you can have in predictions made from the test scores. However, a single test can never fully predict job performance because success on the job depends on so many varied factors. Therefore, validity coefficients, unlike reliability coefficients, rarely exceed r = .40.

Validity coefficient value	Interpretation
Table 3. General Guidelines for Interpreting Validity Coefficients
above .35	very beneficial
.21 - .35	likely to be useful
.11 - .20	depends on circumstances
below .11	unlikely to be useful

As a general rule, the higher the validity coefficient the more beneficial it is to use the test. Validity coefficients of r =.21 to r =.35 are typical for a single test. Validities for selection systems that use multiple tests will probably be higher because you are using different tools to measure/predict different aspects of performance, where a single test is more likely to measure or predict fewer aspects of total performance. Table 3 serves as a general guideline for interpreting test validity for a single test. Evaluating test validity is a sophisticated task, and you might require the services of a testing expert. In addition to the magnitude of the validity coefficient, you should also consider at a minimum the following factors:

Here are three scenarios illustrating why you should consider these factors, individually and in combination with one another, when evaluating validity coefficients:

Scenario OneYou are in the process of hiring applicants where you have a high selection ratio and are filling positions that do not require a great deal of skill. In this situation, you might be willing to accept a selection tool that has validity considered "likely to be useful" or even "depends on circumstances" because you need to fill the positions, you do not have many applicants to choose from, and the level of skill required is not that high. Now, let's change the situation.
Scenario TwoYou are recruiting for jobs that require a high level of accuracy, and a mistake made by a worker could be dangerous and costly. With these additional factors, a slightly lower validity coefficient would probably not be acceptable to you because hiring an unqualified worker would be too much of a risk. In this case you would probably want to use a selection tool that reported validities considered to be "very beneficial" because a hiring error would be too costly to your company. Here is another scenario that shows why you need to consider multiple factors when evaluating the validity of assessment tools.
Scenario ThreeA company you are working for is considering using a very costly selection system that results in fairly high levels of adverse impact. You decide to implement the selection tool because the assessment tools you found with lower adverse impact had substantially lower validity, were just as costly, and making mistakes in hiring decisions would be too much of a risk for your company. Your company decided to implement the assessment given the difficulty in hiring for the particular positions, the "very beneficial" validity of the assessment and your failed attempts to find alternative instruments with less adverse impact. However, your company will continue efforts to find ways of reducing the adverse impact of the system. Again, these examples demonstrate the complexity of evaluating the validity of assessments. Multiple factors need to be considered in most situations. You might want to seek the assistance of a testing expert (for example, an industrial/organizational psychologist) to evaluate the appropriateness of particular assessments for your employment situation.
When properly applied, the use of valid and reliable assessment instruments will help you make better decisions. Additionally, by using a variety of assessment tools as part of an assessment program, you can more fully assess the skills and capabilities of people, while reducing the effects of errors associated with any one tool on your decision making.