3 Things to Look for When Assessing Assessments

There are literally hundreds (if not thousands) of assessments that are available to organizations. With all of those assessments, it is quite easy to make a mistake and use an assessment that may not only be poor but may also cause harm to the workplace. To help you choose a good assessment, there are only three things you need to consider. These things are reliability, validity, and appropriateness.

Reliability

Any assessment, be it a measure of employee engagement, job performance, or personality type, must be reliable. Briefly, reliability is consistency of measurement. If you get on a scale and it says you weigh 135, and then you get on it three minutes later and it says you weigh 170, it’s probably not a very good scale. The same thing applies to assessments. If you take an assessment on Tuesday and it says you are an introvert, and then you take that same assessment the following Friday and it says you are an extravert, it probably isn’t a good assessment.

Although there are several types of reliability, the two most common are test-retest and internal consistency (or alpha). Test-retest reliability is pretty straightforward. Like the example above, it is the degree of consistency you could expect if you gave an assessment on, say, a Tuesday, and then gave the same test to the same people on the following Friday. The more consistent the responses across all questions, the higher your test-retest reliability.

The second type of reliability is internal consistency, which is also known as Cronbach’s alpha, coefficient alpha, or just simply alpha. As you can imagine from the example above, it is a bit of a hassle to get employees to take the same test twice. So, internal consistency is simply a formula that estimates what the reliability would be if it were given twice. That formula is based on having employees take the test once, dividing all the items into two halves, and then calculating the correlation between the two halves. Doing this for all possible split half versions and coming up with an overall average is internal consistency, or alpha.

Although both estimates of reliability are good, be aware that, all things being equal, test-retest reliability tends to be a bit lower than internal consistency estimates (alphas). There may be several reasons for this, but certainly one reason is that internal consistency estimates (alphas) are determined in one test administration whereas test-retest estimates are determined in two test administrations. The extent to which external events like mood, life events, or even physical health are different between test administrations may cause the test-retest value to be different and, thus, the test-retest reliability to be lower. So, all types of reliability are not created equal, and you should consider the type of reliability when evaluating it for the assessment under scrutiny. That means that, all things being equal, you should expect a slightly lower reliability estimate when the type of reliability is calculated using the test-retest process.

Now, what level of reliability is acceptable? It is generally agreed upon in organizational science that a good level of reliability is at least .70; anything lower than .70 and the assessment may not be reliable enough to be useful. However, internal consistency formulas are “biased” in that they tend to yield higher values as the number of items increases. Said another way, obtaining an alpha of .75 for a scale of three items is much more impressive than obtaining an alpha of .75 for a scale of 300 items. Unfortunately, there is no “correction formula” for equaling out this bias, but you should know that a higher alpha is more likely with a larger number of items. Hence, you should not only consider the value of the alpha, but the number of items that determine that alpha.

This fact that internal consistency estimates (i.e., alphas) are biased upward with more items directly contrasts with the goal that assessments be as brief as possible. It is widely known that shorter assessments tend to produce higher completion rates, so there is a constant struggle to achieve an accurate assessment with the fewest number of items possible. Because of that struggle, some assessments may have internal consistency estimates below .70. As a result, you may want to sacrifice a little in terms of reliability if the assessment is shorter, but I would recommend insisting on internal consistency values at least above .60.

At the other end of the spectrum, internal consistency values above .90 are somewhat rare, and anything in the mid- to high .90’s should cause you concern, because those levels are practically unheard of and can fall into the if-its-too-good-to-be-true-it-probably-is category. In other words, a reliability of .95 or above may not be genuine.

Validity

Hence, one issue when considering an assessment is the reliability of an assessment. If an assessment is not reliable, there is no way it cannot be valid or appropriate. There is a phrase often used in the organizational sciences that “reliability is a necessary, but not sufficient, condition for validity”. So, if your assessment is not reliable, it cannot be valid. If you discover your assessment has low reliability, you may as well stop using it, for it cannot be valid and thus it cannot be useful to you. However, if your assessment is reliable, that alone does not mean it is valid.

Thus, another thing to consider when assessing an assessment is the validity of an assessment. Like reliability, there are several types of validity, but I will focus on three that are most relevant to assessing assessments. One type of validity is construct validity. Put simply, construct validity is how well the assessment measures what it claims to measure. For instance, if an assessment claims to measure job satisfaction, how do you know that it really measures job satisfaction? Things like job satisfaction, employee engagement, personality, and many other job-related concepts like commitment are abstract, meaning they are not tangible — you cannot touch them, feel them, or see them.

To demonstrate construct validity, scale developers typically collect “evidence of construct validity” by showing that their measure is related to other measures of the same concept. Thus, scale developers may try to demonstrate that their assessment of job satisfaction is related to other measures of job satisfaction. If there is a high correlation between the newly developed scale of job satisfaction and other, established measures of job satisfaction, one may conclude that the newly developed scale is indeed an adequate measure of job satisfaction.

Scale developers may also demonstrate “evidence of construct validity” by showing that their scale is related to other variables that it should be related to. For instance, we could assume that job satisfaction should be related to lower absenteeism, lower turnover, and higher job performance. If we can show that scores on our newly created job satisfaction scale are indeed related to lower absenteeism and turnover but higher job performance ratings, we could establish additional evidence of construct validity. In this way, establishing construct validity is like a lawyer trying to prove someone’s innocence or guilt. The more evidence they can collect that supports their argument, the better their case. Likewise, the more evidence a scale developer can collect supporting their assertion that their measure assesses an abstract concept, the more evidence they will have of construct validity.

Another type of validity is external validity. External validity is the extent to which an assessment is generalizable to a wide population. Unfortunately, the extent of external validity is often ignored until it is too late. For example, let’s say an assessment has a lot of evidence of construct validity. Let’s also say that the assessment contains some fairly complex content that requires a college level reading comprehension. Now, let’s say we give that fairly complex assessment to children in the 7^th grade. Yes, the assessment has been shown that it measures what is portends to measure, and yes, the assessment has been shown to be related to similar measures and to expected outcomes, but will it be useful when given to a group of kids in the 7^th grade? Probably not. Said more succinctly, using assessments that were validated on populations and/or in situations that deviate from yours may not be generalizable, or externally valid.

There are at least two common situations where external validity may be particularly relevant to those who use organizational assessments. One situation is when assessments that were developed primarily in one setting (e.g., white collar employees) are being applied to a different setting (e.g., blue collar employees). Another situation likely to become more common as globalization increases is using an assessment developed in one language (e.g., English), but applying that assessment (after translation) to employees speaking another language (e.g., Chinese). Notwithstanding cultural differences, there are likely to be differences in the validity, or usefulness, of an assessment developed in English for Chinese employees. Hence, when you are assessing assessments, make sure that the assessment was developed using populations that are similar to the one you will be assessing. This will help to ensure your assessment has external validity and thus is appropriate for your situation.

The third type of validity is face validity. Basically, this type of validity is that, on the surface (or the face), it looks like it measures what it purports to measure. In other words, its assessment claims appear to be logical. This type of thinking is akin to the old adage, “If it walks like a duck and sounds like a duck, it probably is a duck”. Well, that may be fine in the physical universe, but when assessing abstract concepts, that sort of thinking is very dangerous and oftentimes harmful. One reason why that may be harmful is because when assessing abstract concepts, the transparency of the questions (which may contribute to face validity) may actually assess something else. Let’s say there is an integrity assessment that asks about attitudes toward theft, sabotage, and greed. Sounds like a good assessment, right? Well, maybe. Let’s say this assessment is part of a selection process. If asked about attitudes towards theft, sabotage, and greed, some applicants may not answer truthfully but instead respond in a manner that will make them look the best to their potential employer. So, they don’t respond truthfully; instead, they literally fake their answers so that they will be evaluated positively. In this instance, an assessment that may look good on the surface, or face, may in fact be of little use to predicting employees with a great deal of integrity. In fact, such an assessment may actually identify applicants who are best at intentional deception, which is quite opposite of integrity.

Appropriateness

Finally, the third thing you should look for when assessing assessments is the purpose, or the appropriateness, of the assessment for its intended use. You could have a very valid assessment of baseball knowledge, but if you tried to use that assessment to hire a Vice President of Marketing, it would not be valid for that purpose. This issue is common in many personality assessments because they are often misapplied to a number of situations. For example, the Myers Briggs Type Indicator is a very popular personality assessment that was developed partly on the work of Carl Jung (a famous psychologist), but also on the developer’s own unsupported philosophy of personality. Most importantly, the Myers Briggs was not developed to construct teams and/or to determine if certain employees should work together, yet many organizations use the Myers Briggs for those exact purposes. In fact, one could argue that having a team comprised of diverse profiles may actually be more creative and productive than teams with the same profiles. The bottom line in all of this is that even if the assessment you are using has reliability and validity, you must make sure that the business application of it is appropriate for your situation.

Another example of an often misapplied and therefore an inappropriate use of a personality assessment is the Minnesota Multiphasic Personality Inventory (MMPI). Although the MMPI has sufficient evidence of reliability and validity, it is just as important to know that the MMPI was developed on and is appropriate for abnormal populations. However, given that it has evidence of reliability and validity, many organizations may be tempted to use the MMPI in their selection assessments. Yet, unless the organization is attracting applicants with abnormal tendencies, the MMPI would not be an appropriate selection assessment. Instead, a reliable and valid assessment that was developed on normal populations would be more appropriate.

Yet, it is not only personality assessments that may be used inappropriately. We have recently developed a reliable and valid assessment of team civility. This assessment measures the degree to which team members report perceptions of respect within several sources of workgroup functioning, such as coworkers, supervisors, and the overall organization. Although we have received several requests for individuals wanting to see how civil they are across situations, it is not appropriate for that use because the questions ask respondents to reference their workgroup, not their daily interactions with others in all walks of life. Hence, our team civility assessment, although very reliable and valid, is not appropriate for determining degrees of individual civility across a multitude of situations.

The bottom line in these examples is that even though an assessment may have ample evidence of reliability and validity (the first two issues discussed above), the use of that assessment must also be appropriate for the given situation.

Assessing Assessments

When assessing assessments, you should be able to find information on all of the above things in a Test Manual. A Test Manual includes the background and development of an assessment as well evidence of reliability and validity. It may also include the use (or uses) of the assessment, but if it does not, then you must decide if it is appropriate for your intended use. Truth be told, most Test Manuals will not use the terms construct and external validity. Instead, you can evaluate the extent to which your use of an assessment is generalizable (or externally valid) by comparing your intended use to how the assessment was developed. This information is usually found in the Background/Development section of a Test Manual and, to a lesser extent, in the description of how the evidence of validity was established by looking at the description of the participants used to validate the assessment. In terms of construct validity, this information will be found in the Validity section of the document, as most Test Manuals simply use the term “validity” to represent construct validity. To determine the appropriateness of the assessment, pay particular attention to the background and development of the assessment. If the assessment was developed to assess mental instability, and you want to use it for creating work teams, it is probably not appropriate. Finally, and this is very important, if your assessment publisher does not have a Test Manual readily available for you to review, then I would advise you to avoid using the assessment and to seriously consider other assessments.