 |
Volume 14 Number 2
DEVELOPER'S DIARY:
Tests as Anchors
that Wobble: Understanding Imperfect Correlations in Educational Measurements:
Educating in Fractal Patterns XI
Edward B. Nuhfer, Center for Teaching
and Learning, Idaho State
University
This
Diary has a more applied emphasis than previous ones. It seeks to explain why
correlations found in educational research are usually lower than those found in
the physical sciences, and why tests are imperfect measures of knowledge, and it
encourages faculty members to use the statistical capabilities of Excel® to
look at their own tests' reliability.
What instructional practices are most effective in produ cing
learning? How well are student ratings of professors tied to students' learning?
Educational research seeking to provide answers to such questions involves
comparisons of test scores with varied practices or student ratings. The answers
are usually summarized in terms of a single number called a correlation
coefficient, conventionally symbolized as "r." The numerical value of the
coefficient expresses the strength of the numerical relationship between two
variables. If you managed to get a doctorate without calculating a correlation
coefficient and doing a "least-squares line fit," then congratulate yourself;
many of us were not so fortunate! For the fortunate, let's start with some
conceptual rudiments. In Figure 1, a perfect correlation is shown in "A," where
r = 1.0. "A" reveals a plot of two variables in which all points fit perfectly
as a "best fit" along a line. In contrast, the relationship in "B" shows
absolutely no linearity between two variables. The coefficient calculated from
these points is zero. The more a two dimensional plot of scattered points is
elongate and approaches a line (see "C" and "D" in Figure 1), the better the
correlation. Correlation coefficients can run from r = -1 (perfect inverse
correlation) to r = 1 (perfect positive correlation).
When is a numerical relationship good enough to be useful?
Cashin (1988) recognized in his discussions regarding student evaluations:
"Correlations between .20 and .49 are practically useful. Correlations between
.50 and .70 are very useful but they are rare when studying complex phenomena."
Researchers accustomed to working with simpler physical relationships are
accustomed to higher values of r such as 0.9 or greater. They seem disappointed
when educational measurements fail to produce similar results. Some presume that
tests are analogous to stable, calibrated lab instruments and that test scores
are rock-solid "measures of actual knowledge."
The fractal nature of brain neurology, synaptic
connections, and thus, learning, helps to explain why no single tool, including
tests, can be the stable "measure of knowledge" presumed or hoped for.
Three-dimensional, interconnected, interfolded, branching neural networks
produced through learning are so complex as to insure that the best measures of
learning are mere samples of the actual knowledge stored within these. The
fractal nature of brain neurology and its resultant variability of expression in
learning practically guarantees that no single measure of student learning will
prove good enough to characterize the actual knowledge present. That's why
legitimate assessments require multiple measures of learning. We see
characteristics of fractals in the "multiple measures" language of more recent
philosophies of faculty evaluation and of assessment.
For illustration, think of neural networks that contain
particular knowledge (such as content of a course or a unit of a course) as a
fractal surface—like a large area of Earth's surface with its natural rough
topography. A test is like taking some measurements across this surface. If two
tests separately derive knowledge that is a good representation of the surface,
then they should both correlate very highly with one another. A problem, though,
lies in the inherent roughness of any fractal surface, which means the
likelihood of two different sets of measures correlating perfectly with one
another is almost zero. In itself, this reveals why two tests given over the
same material won't correlate perfectly with themselves.
Add to this the fact that a test of a class samples not one
brain surface, but many, and one can recognize why writing a good test is a
challenge. Individual test questions trigger responses from students to supply
information or to use information to engage in higher level thinking such as
synthesis or evaluation. Different learners perceive knowledge differently and
their brains retrieve it a bit differently. If information is retrieved
differently, an individual test question may trigger a response in some
students and not in others, even though all may have the knowledge. In teaching,
we know that to come at material from as many ways as possible accommodates the
varied learning styles inherent in different students' neural wiring. If such
differences have credibility, then good test design must take this into account,
just as good instructional design does. A "good" test will efficiently trigger
responses from as many people as possible that have the knowledge.
We presume tests are our anchor...
For faculty who conceptualize teaching as simply pouring in
knowledge, it is easy to believe that tests reveal how well students have
learned by their ability to pour out the same knowledge on a test. Such faculty
have great faith in tests as a sole and solid tool to evaluate individual
student learning, so the idea that tests might be less than perfect measures
will cause them apprehension.
Assessment, in contrast to individual grading, looks less
at individuals, and more at units such as a class, a course, or a curriculum as
a whole. While assessment approaches differ from the approaches to the
evaluation of individuals, the two can be mutually informative when considered
as multiple measures. For example, assessment encourages us to examine our tests
in the context of our classes as a whole, rather than simply as a tool for
grading students. When we do, we find some valuable insights.
One is the concept of test reliability. The best any test
can be is a good sampling of what information is stored in students' minds. From
the score on the test, we extrapolate that test score/grade to express the
student's command of that whole knowledge. If our extrapolation is perfectly
correct, then an equally good test over the same material should reveal a
perfect correlation with the first. Of course it doesn't, for reasons considered
above: a single sampling is not identical to another sampling of a complex
fractal form.
We can use correlation to see the reliability of our own
tests. To do a correlation, we need data from at least two samplings. As one
approach, we could give all of our students two different tests on the same
material, and see how consistently our tests measured the same students'
knowledge. However, giving two tests is a lot of work, and we can use a standard
alternate method called "split halves" (See Jacobs and Chase, 1992, Developing
and Using Tests Effectively..., Jossey-Bass, pp. 32-50) to discover the general
reliability our tests provide. This involves randomly splitting a single test
into two tests, such as using odd numbered items as one test and even numbered
as another. If "perfectly reliable," each half should give the same result per
student and a plot as in "A" will result. Another way is to look at our past
semester's grade sheet and treat our entire course grading as a single test.
Thus, if we gave ten quizzes or four tests, we could split our quizzes/tests
randomly and do a similar check.
But anchors wobble more than we think...
"C" results from a split half analysis on ten quizzes. The
r = 0.7 shown in C is, as we shall see, a reasonable result, but the plot
reveals plenty of internal imperfection. Some tests will be better
representative samples than others. With study and tuning, tests can be
optimized, but a faculty member doesn't have the luxury of tuning a test until
fit for marketing. It's usually created without pre-testing and intended for
one-time use. In routine test design, one hopes for an r-value greater than
0.6—ideally 0.7 or greater. The next Diary will summarize ways to write more
reliable tests, but this column cautions us to examine the tests we now use.
And wobble can shake our perceptions.
When we think of our exams as rock solid "measures of
actual knowledge" we assume (wrongly) that they have the reliability of the
perfect correlation (r = 1) shown in "A." If we are trying to do a correlation
between something such as test grades and another measure (such as a knowledge
survey—see NTLF V13N2, pp 8-11), we cannot expect correlations any better than
the internal reliability or "wobble" of the worst measurement tools involved.
(In fact, the knowledge survey in D shows higher internal reliability than the
tests in C.) Faculty often see correlations such as r = 0.47 between student
ratings and test performance (Cohen, 1981), or r = 0.56 between test performance
and the teachers' degree of preparing and organizing their courses (Feldman,
1998). Faculty with unexamined faith in test reliability are prone to judge
these "low correlations" erroneously and presume that they result from fogginess
of student ratings or the lack of real importance of course organization rather
than some of the problems lying in tests. When we get an r-value such as 0.47
between student ratings and test performance, part of the imprecision comes from
imprecision in ratings and part of the imprecision comes from the tests
themselves. Neither can be perfect samples of what we are trying to measure, but
the tendency is to blame the imprecision on a measure other than a test. The
fact is that two routine class tests may not correlate much better than they do
with other good measures. Given the "wobble" associated with test reliability,
correlations of r = 0.6 between another measure and exam scores may be nearly as
good as we can expect, given the nature of tests.

Figure 1. Scatter plots with associated correlation
coefficients. A is a
perfect correlation; r = 1.0. Such plots can also arise from two identical sets
of data. B is zero correlation. C is from test data in a freshman course that
yields r = 0.71. The scatter is apparent but the pattern is elongate. D is from
data from a knowledge survey of nearly 300 freshmen. The scatter pattern
is very elongate and yields an r-value of 0.95.
For brevity, I admittedly simplified the nature of
correlation for this column. As seen in Figure 1, correlations work best when
there is a range of scatter of both sets of data. Without such a range, some
absurdities can result. Consider, for instance, a situation in which tests
reveal that students learned little and student evaluations confirm as much.
Data like these are likely to condense into such little scatter that a line-fit
may show a low correlation even though there is high agreement in the actual
situation. Consider the alternative situation in which nearly every student got
an "A" and all students agreed that the learning that took place was high; a low
correlation may still result even though there is high agreement.
Use Excel® to Check Your Own Tests
If you've never used your own class test data to make such
a split-halves check, try this yourself for your own tests. The Excel package
usually supplied to professors by institutions as part of Microsoft Office can
calculate r-values easily. (Prior to such tools, this was a laborious task.)
Suppose you have given a test to ten students. You have split the test into two
halves: even items, odd items, and graded each. You now have two grades for each
student from the same test. (Alternately, you could have given two tests, and
you'd like to see how reliably two tests compare. If any student missed a test
or took a makeup that differed from the first test you are analyzing, remove
such students from the database. You'll want clean data from only the test or
tests you are examining. In any event, you now have a data pair for each
student.) Type the data into two columns of an Excel spreadsheet as shown in
Table 1. Each row represents a student's data. Click on the Tools Menu. You may
see "Data Analysis" as an option in the pull-down menu. If not, click on
"Add-Ins" and select the "Analysis Tool Pack." Click "OK." "Data Analysis" will
then appear as an option under Tools. Select "correlation" and click "OK."
Because we have labels in the first row, check the box "Labels in first row." We
want to correlate our data arranged in two columns, so click on "Columns." To
keep life easy, select "New Worksheet ply" for outcomes. For the input range,
click on the upper left cell (the one with "Odds" in it), type a colon (:), then
click on the lower right cell. The input range is always upper left to lower
right of the data set. If you want to check, say, consistency in the results of
five quizzes, then you can have five columns in your data set. As soon as you
click OK, your correlation coefficient(s) should appear, and will look like
Table 2. The data in Table 1 yield the r-value (0.528) shown in Table 2. Use the
data here for a practice run with Excel® , then put in your own test data.
Finally, the more data points (that is, the more students there are involved and
the longer the test), the more meaningful the correlation.

Table 1. Data typed into Excel spreadsheet columns.

Table 2. Correlation coefficient for data from two columns in
Figure 2.
The coefficient from split halves (or two tests over same
material) should have an r-value greater than 0.7 although 0.6 is acceptable and
0.5 or less shows that we need to brush up in learning how to do a better job
writing tests (topic of next Diary). The fact we don't often get r values
greater than 0.7 shows that tests are not as solid as we think because they are
not perfect samples of knowledge. We (personally as individuals) should use the
Excel tool to check just how reliable our tests are, rather than presume that
they are good "measures of actual knowledge." Consideration from the viewpoint
of fractals underscores the need for multiple measures in learning and teaching
issues; no single measuring tool will likely ever prove solid enough to rely
upon by itself.
References
•
Cashin, W. E. 1988. "Student ratings of teaching: a summary of the research."
Kansas State Univ. Center for Faculty Evaluation and Development, Idea Paper n.
20.
•
Cohen, P. A. 1981. "Student Ratings of Instruction and Student Achievement: A
Meta-analysis of Multisection Validity Studies." Review of Educ. Res. 51:
281-309.
•
Feldman, K. A. 1998. "Identifying Exemplary Teachers and Teaching: Evidence from
Student Ratings." In Teaching and Learning in the College Classroom 2nd
ed., K. A. Feldman and M. B. Paulsen, eds., Needham Heights, MA: Simon &
Schuster, 391-414.
Contact:
Ed Nuhfer
Director
Center for Teaching and Learning
Museum Building 434, Campus Box 8010
Idaho State University
Pocatello, ID 83209-8010
Telephone: (208) 282-4703
FAX (208) 282-5361
E-mail: nuhfed@isu.edu
  
Table of Contents | Issues | Credits |
NTLF.com Home
© Copyright 1996-2005. Published by James Rhem & Associates, Inc.
(ISSN 1057-2880) All rights reserved worldwide.
Web Weaving By InfoStreet, Inc.
Email
the webmaster
|