The National Teaching & Learning Forum logo
 Volume 14 Number 2


DEVELOPER'S DIARY:
Tests as Anchors that Wobble: Understanding Imperfect Correlations in Educational Measurements: Educating in Fractal Patterns XI
Edward B. Nuhfer, Center for Teaching and Learning, Idaho State University

This Diary has a more applied emphasis than previous ones. It seeks to explain why correlations found in educational research are usually lower than those found in the physical sciences, and why tests are imperfect measures of knowledge, and it encourages faculty members to use the statistical capabilities of Excel®  to look at their own tests' reliability.

What instructional practices are most effective in producing learning? How well are student ratings of professors tied to students' learning? Educational research seeking to provide answers to such questions involves comparisons of test scores with varied practices or student ratings. The answers are usually summarized in terms of a single number called a correlation coefficient, conventionally symbolized as "r." The numerical value of the coefficient expresses the strength of the numerical relationship between two variables. If you managed to get a doctorate without calculating a correlation coefficient and doing a "least-squares line fit," then congratulate yourself; many of us were not so fortunate! For the fortunate, let's start with some conceptual rudiments. In Figure 1, a perfect correlation is shown in "A," where r = 1.0. "A" reveals a plot of two variables in which all points fit perfectly as a "best fit" along a line. In contrast, the relationship in "B" shows absolutely no linearity between two variables. The coefficient calculated from these points is zero. The more a two dimensional plot of scattered points is elongate and approaches a line (see "C" and "D" in Figure 1), the better the correlation. Correlation coefficients can run from r = -1 (perfect inverse correlation) to r = 1 (perfect positive correlation).

When is a numerical relationship good enough to be useful? Cashin (1988) recognized in his discussions regarding student evaluations: "Correlations between .20 and .49 are practically useful. Correlations between .50 and .70 are very useful but they are rare when studying complex phenomena." Researchers accustomed to working with simpler physical relationships are accustomed to higher values of r such as 0.9 or greater. They seem disappointed when educational measurements fail to produce similar results. Some presume that tests are analogous to stable, calibrated lab instruments and that test scores are rock-solid "measures of actual knowledge."

The fractal nature of brain neurology, synaptic connections, and thus, learning, helps to explain why no single tool, including tests, can be the stable "measure of knowledge" presumed or hoped for. Three-dimensional, interconnected, interfolded, branching neural networks produced through learning are so complex as to insure that the best measures of learning are mere samples of the actual knowledge stored within these. The fractal nature of brain neurology and its resultant variability of expression in learning practically guarantees that no single measure of student learning will prove good enough to characterize the actual knowledge present. That's why legitimate assessments require multiple measures of learning. We see characteristics of fractals in the "multiple measures" language of more recent philosophies of faculty evaluation and of assessment.

For illustration, think of neural networks that contain particular knowledge (such as content of a course or a unit of a course) as a fractal surface—like a large area of Earth's surface with its natural rough topography. A test is like taking some measurements across this surface. If two tests separately derive knowledge that is a good representation of the surface, then they should both correlate very highly with one another. A problem, though, lies in the inherent roughness of any fractal surface, which means the likelihood of two different sets of measures correlating perfectly with one another is almost zero. In itself, this reveals why two tests given over the same material won't correlate perfectly with themselves.

Add to this the fact that a test of a class samples not one brain surface, but many, and one can recognize why writing a good test is a challenge. Individual test questions trigger responses from students to supply information or to use information to engage in higher level thinking such as synthesis or evaluation. Different learners perceive knowledge differently and their brains retrieve it a bit differently. If information is retrieved differently, an individual test question may trigger a ­response in some students and not in others, even though all may have the knowledge. In teaching, we know that to come at material from as many ways as possible accommodates the varied learning styles inherent in different students' neural wiring. If such differences have credibility, then good test design must take this into account, just as good instructional design does. A "good" test will efficiently trigger responses from as many people as possible that have the knowledge.

We presume tests are our anchor...

For faculty who conceptualize teaching as simply pouring in knowledge, it is easy to believe that tests reveal how well students have learned by their ability to pour out the same knowledge on a test. Such faculty have great faith in tests as a sole and solid tool to evaluate individual student learning, so the idea that tests might be less than perfect measures will cause them apprehension.

Assessment, in contrast to individual grading, looks less at individuals, and more at units such as a class, a course, or a curriculum as a whole. While assessment approaches differ from the approaches to the evaluation of individuals, the two can be mutually informative when considered as multiple measures. For example, assessment encourages us to examine our tests in the context of our classes as a whole, rather than simply as a tool for grading students. When we do, we find some valuable insights.

One is the concept of test reliability. The best any test can be is a good sampling of what information is stored in students' minds. From the score on the test, we extrapolate that test score/grade to express the student's command of that whole knowledge. If our extrapolation is perfectly correct, then an equally good test over the same material should reveal a perfect correlation with the first. Of course it doesn't, for reasons considered above: a single sampling is not identical to another sampling of a complex fractal form.

We can use correlation to see the reliability of our own tests. To do a correlation, we need data from at least two samplings. As one approach, we could give all of our students two different tests on the same material, and see how consistently our tests measured the same students' knowledge. However, giving two tests is a lot of work, and we can use a standard alternate method called "split halves" (See Jacobs and Chase, 1992, Developing and Using Tests Effectively..., Jossey-Bass, pp. 32-50) to discover the general reliability our tests provide. This involves randomly splitting a single test into two tests, such as using odd numbered items as one test and even numbered as another. If "perfectly reliable," each half should give the same result per student and a plot as in "A" will result. Another way is to look at our past semester's grade sheet and treat our entire course grading as a single test. Thus, if we gave ten quizzes or four tests, we could split our quizzes/tests randomly and do a similar check.

But anchors wobble more than we think...

"C" results from a split half analysis on ten quizzes. The r = 0.7 shown in C is, as we shall see, a reasonable result, but the plot reveals plenty of internal imperfection. Some tests will be better representative samples than others. With study and tuning, tests can be optimized, but a faculty member doesn't have the luxury of tuning a test until fit for marketing. It's usually created without pre-testing and intended for one-time use. In routine test design, one hopes for an r-value greater than 0.6—ideally 0.7 or greater. The next Diary will summarize ways to write more reliable tests, but this column cautions us to examine the tests we now use.

And wobble can shake our perceptions.

When we think of our exams as rock solid "measures of actual knowledge" we assume (wrongly) that they have the reliability of the perfect correlation (r = 1) shown in "A." If we are trying to do a correlation between something such as test grades and another measure (such as a knowledge survey—see NTLF V13N2, pp 8-11), we cannot expect correlations any better than the internal reliability or "wobble" of the worst measurement tools involved. (In fact, the knowledge survey in D shows higher internal reliability than the tests in C.) Faculty often see correlations such as r = 0.47 between student ratings and test performance (Cohen, 1981), or r = 0.56 between test performance and the teachers' degree of preparing and organizing their courses (Feldman, 1998). Faculty with unexamined faith in test reliability are prone to judge these "low correlations" erroneously and presume that they result from fogginess of student ratings or the lack of real importance of course organization rather than some of the problems lying in tests. When we get an r-value such as 0.47 between student ratings and test performance, part of the imprecision comes from imprecision in ratings and part of the imprecision comes from the tests themselves. Neither can be perfect samples of what we are trying to measure, but the tendency is to blame the imprecision on a measure other than a test. The fact is that two routine class tests may not correlate much better than they do with other good measures. Given the "wobble" associated with test reliability, correlations of r = 0.6 between another measure and exam scores may be nearly as good as we can expect, given the nature of tests.


Figure 1. Scatter plots with associated correlation coefficients. A is a
perfect correlation; r = 1.0. Such plots can also arise from two identical sets
of data. B is zero correlation. C is from test data in a freshman course that
yields r = 0.71. The scatter is apparent but the pattern is elongate. D is from
data from a knowledge survey of nearly 300 freshmen. The scatter pattern
is very elongate and yields an r-value of 0.95.

For brevity, I admittedly simplified the nature of correlation for this column. As seen in Figure 1, correlations work best when there is a range of scatter of both sets of data. Without such a range, some absurdities can result. Consider, for instance, a situation in which tests reveal that students learned little and student evaluations confirm as much. Data like these are likely to condense into such little scatter that a line-fit may show a low correlation even though there is high agreement in the actual situation. Consider the alternative situation in which nearly every student got an "A" and all students agreed that the learning that took place was high; a low correlation may still result even though there is high agreement.

Use Excel® to Check Your Own Tests

If you've never used your own class test data to make such a split-halves check, try this yourself for your own tests. The Excel package usually supplied to professors by institutions as part of Microsoft Office can calculate r-values easily. (Prior to such tools, this was a laborious task.) Suppose you have given a test to ten students. You have split the test into two halves: even items, odd items, and graded each. You now have two grades for each student from the same test. (Alternately, you could have given two tests, and you'd like to see how reliably two tests compare. If any student missed a test or took a makeup that differed from the first test you are analyzing, remove such students from the database. You'll want clean data from only the test or tests you are examining. In any event, you now have a data pair for each student.) Type the data into two columns of an Excel spreadsheet as shown in Table 1. Each row represents a student's data. Click on the Tools Menu. You may see "Data Analysis" as an option in the pull-down menu. If not, click on "Add-Ins" and select the "Analysis Tool Pack." Click "OK." "Data Analysis" will then appear as an option under Tools. Select "correlation" and click "OK." Because we have labels in the first row, check the box "Labels in first row." We want to correlate our data arranged in two columns, so click on "Columns." To keep life easy, select "New Worksheet ply" for outcomes. For the input range, click on the upper left cell (the one with "Odds" in it), type a colon (:), then click on the lower right cell. The input range is always upper left to lower right of the data set. If you want to check, say, consistency in the results of five quizzes, then you can have five columns in your data set. As soon as you click OK, your correlation coefficient(s) should appear, and will look like Table 2. The data in Table 1 yield the r-value (0.528) shown in Table 2. Use the data here for a practice run with Excel® , then put in your own test data. Finally, the more data points (that is, the more students there are involved and the longer the test), the more meaningful the correlation.


Table 1. Data typed into Excel spreadsheet columns.


Table 2. Correlation coefficient for data from two columns in Figure 2.

The coefficient from split halves (or two tests over same material) should have an r-value greater than 0.7 although 0.6 is acceptable and 0.5 or less shows that we need to brush up in learning how to do a better job writing tests (topic of next Diary). The fact we don't often get r values greater than 0.7 shows that tests are not as solid as we think because they are not perfect samples of knowledge. We (personally as individuals) should use the Excel tool to check just how reliable our tests are, rather than presume that they are good "measures of actual knowledge." Consideration from the viewpoint of fractals underscores the need for multiple measures in learning and teaching issues; no single measuring tool will likely ever prove solid enough to rely upon by itself. 

References

  Cashin, W. E. 1988. "Student ratings of teaching: a summary of the research." Kansas State Univ. Center for Faculty Evaluation and Development, Idea Paper n. 20.

  Cohen, P. A. 1981. "Student Ratings of Instruction and Student Achievement: A Meta-analysis of Multisection Validity Studies." Review of Educ. Res. 51: 281-309.

  Feldman, K. A. 1998. "Identifying Exemplary Teachers and Teaching: Evidence from Student Ratings." In Teaching and Learning in the College Classroom 2nd ed., K. A. Feldman and M. B. Paulsen, eds., Needham Heights, MA: Simon & Schuster, 391-414.

Contact:

Ed Nuhfer
Director
Center for Teaching and Learning
Museum Building 434, Campus Box 8010
Idaho State University
Pocatello, ID 83209-8010
Telephone: (208) 282-4703
FAX (208) 282-5361
E-mail:
nuhfed@isu.edu

 

Table of ContentsIssuesCredits
NTLF home page

Table of Contents | Issues | Credits |
NTLF.com Home
 

© Copyright 1996-2005. Published by James Rhem & Associates, Inc.
(ISSN 1057-2880)
All rights reserved worldwide.
Web Weaving™ By InfoStreet, Inc.
Email the webmaster