From john_damron@pop.mindlink.net
Mon Oct 14 20:56:48 1996
POLITICS OF THE CLASSROOM (revised)
Pt. 1/SimpleText Version


INSTRUCTOR PERSONALITY AND THE POLITICS OF THE CLASSROOM


By John C Damron, PhD

e-mail: john_damron@mindlink.bc.ca

Copyright John C Damron, Douglas College, 700 Royal Ave, New Westminster, British Columbia, Canada. This document can be freely redistributed in whole or in part, provided that this copyright notice is included intact, and that no material profit is generated from such a transaction.

The author can be contacted in the Social Sciences Department, Douglas College, P.O. Box 2503, New Westminster, British Columbia, Canada, V3L 5B2 (telephone (604) 527-5312).


TABLE OF CONTENTS
-
MAIN TEXT (Quotations indicated by quote)
-
APPENDIX A: THE VALIDITY AND ACCURACY OF STUDENT RATINGS-

APPENDIX B: MORE ON INSTRUCTOR PERSONALITY AND STUDENT RATINGS-

APPENDIX C: THE VALIDITY OF STUDENT RATINGS INTERPRETATIONS-

APPENDIX D: A CROSS-DISCIPLINE RATING BIAS-

APPENDIX E: THE POLITICS OF VALIDITY -- DISCRIMINANT AND CONVERGENT-

ENDNOTES (Numerically denoted in text as [x]-

REFERENCES

INTRODUCTION

From their inception earlier this century, student instructional rating questionnaires have been touted as a cheap and convenient means of evaluating the teaching of college and university faculty. They were eagerly embraced by college administrators in the 1960s because they offered a ready vehicle for assessing faculty hired to teach the troves of students entering post-secondary institutes. Their promise, technical appearance and utter simplicity have ensured the popular use of student instructional ratings for well over thirty years now.

Nevertheless, student ratings were and continue to be controversial because it is not clear that they indeed do assay teaching effectiveness. For example, some critics argue that student ratings are unduly affected "by the personal style of the instructor rather than [the] instructor's ability to convey instructional material (Abrami, Leventhal & Perry, 1982)." Ultimately, such concerns gave rise to a host of studies that explored the relationship of instructor personality to student ratings and, in some instances, to student achievement.

PROCESSES AND PRODUCTS


Before examining the contribution of instructor personality to student instructional ratings let us first take note of two fundamental dimensions of teaching - instructional processes and instructional products. Instructional processes are comprised of the mix of routines, techniques and strategies employed by instructors to promote learning in their students. Among others, these may include informative lectures, classroom and laboratory demonstrations, field trips, debates, and instructional videos. Instructional products, in contrast, consist of the substantive outcomes promoted by instructional processes (Abrami, d'Apollonia & Cohen, 1990). Although both are essential elements of teaching, instructional products are utterly fundamental. It is the products of teaching that students come to possess. And it is the products of teaching that students take with them when they leave places of learning in pursuit of further challenges. Whether called student learning, mastery, or cognitive growth, it is the capacity to prompt such changes that distinguish effective from ineffective teaching. Thus, one can say that instructional products establish whether and to what extent classroom practices are, indeed, effective instructional processes (e.g., Abrami, d'Apollonia & Cohen, 1990). The reader is advised to bear the process-product distinction clearly in mind while reading the pages that follow.

STUDENT RATINGS AND INSTRUCTOR PERSONALITY

Studies of instructor personality have considered two distinctly different views of personality (Murray, Rushton,& Paunonen, 1990; Feldman, 1986) and a broad feature of personality, instructor expressiveness (e.g., Abrami, Leventhal & Perry, 1982). Research bearing on the latter has usually been conducted within the framework of the "Dr. Fox" (or educational seduction) paradigm, in which instructional processes and their assessment are examined in a laboratory setting.

Investigations of the relationship of instructor personality to student ratings have assessed personality by either asking instructors to complete an established personality inventory or by having students or colleagues record their perceptions of instructors' personalities on questionnaires provided for this purpose (Feldman, 1986). In a comprehensive review of the literature on instructor personality and student instructional ratings Feldman (1986) found mixed evidence that students' ratings were systematically related to instructors' personalities. Although some studies found little or no relationship between the personality characteristics of instructors and their student ratings, others reported broad and robust relationships. These disparate outcomes are attributable to the procedures used to assess instructor personality (Feldman, 1986). When personality is assessed with an established instructor-written personality inventory there is virtually no correlation between instructor personality and student ratings. However, when the assessment is based on the perceptions of students or colleagues, the overall relationship of instructor personality to student ratings is substantial, with positive correlations ranging from moderate to high (Feldman, 1986).

The effect of perceived instructor personality on student instructional ratings can be substantial. Murray (1975), for example, found that 67% of between-teacher variance in student ratings was attributable to and predictable from, peer ratings of instructor liberalism, lightheartedness, extraversion, exhibitionism, and other perceived attributes. Sherman & Blackburn (1975) and Tomasco (1980) reported that student ratings were predictable from student perceptions of, among other characteristics, instructor amicability, pragmatism, nurturance, changeability, and exhibitionism. Similarly, Rushton, Murray & Paunonen (1983) and Murray, Rushton & Paunonen (1990) found that 40 to 70% of between-instructor variance in student ratings was attributable to peer assessments of instructors' personality attributes. For the most part, student perceptions of instructor personality are similar to those formed by departmental colleagues (Feldman, 1986). Research of this sort suggests that a considerable portion of the differences in instructors' student ratings is based on perceived personality characteristics rather than instructional effectiveness.

A fascinating extension of these findings was turned up in the aforementioned study by Murray, Rushton & Paunonen (1990). They found that the mix of perceived personality characteristics that yield high student ratings varied markedly for different course types. For instance, highly rated instructors of large introductory courses were perceived as liberal, neurotic and extraverted, while high scoring instructors of smaller discussion oriented classes were perceived as gregarious, adaptable and supportive. Findings such as these prompted the authors to conclude that instructors "tend to be differently suited to different types of courses" and that "the compatibility of teachers to courses appears to be determined in part by personality characteristics."

INSTRUCTOR EXPRESSIVENESS


As noted above, researchers have also examined the effects of a more restricted range of personality characteristics on student ratings and student achievement. Most can be broadly summarized as "instructor expressiveness." Although details differ from study to study, the overall results suggest that vacuous but animated, charismatic, and amusing lectures yield significantly higher student ratings than substantive but less animated lectures.

A meta-analysis of a dozen of these studies revealed that "instructor expressiveness had a substantial impact on student ratings but a small impact on student achievement" (Abrami, Leventhal & Perry, 1982). Summary and global ratings, which are frequently used to make tenure and promotion decisions, were particularly elevated by instructor expressiveness. The analysis also found that lecture content had a sizable influence on student achievement but only a negligible impact on student ratings. Findings such as these prompted the architects of the educational seduction paradigm to conclude that student instructional ratings "should not be used to make decisions about faculty promotion and tenure because charismatic and enthusiastic faculty can receive favorable student ratings regardless of how well they know their subject matter or. . .how much their students learn" (Abrami, Leventhal & Perry, 1982, p. 447; Ware & Williams, 1975; 1980).[1;2]

WHAT DOES IT MEAN?


What is one to make of these findings? Some researchers (e.g., Murray, 1975; Tomasco, 1980; Murray, Rushton & Paunonen, 1990) suggest that the high positive correlation between perceived instructor personality and student instructional ratings is evidence that instructors are effective because of the persona they inject into their classroom work. Thus, in a sense, instructor personality is conceived of as an instructional process or an antecedent thereof. However, this view begs a fundamental question. Namely, are student perceptions of instructor effectiveness (as revealed by student ratings) equivalent to more objective assays of instructional effectiveness, e.g., examinations and other systematic measures of instructional products? In fact, they are not. As noted above, although the correlation between instructor expressiveness and student ratings is sizable, the correlation between expressiveness and student achievement is negligible (Abrami, Leventhal & Perry, 1982). More generally, while the correlation between student ratings and perceived instructor personality is considerable, their relationship to measures of student achievement is markedly smaller (Dowell & Neal, 1982, 1983; McCallum, 1984) [3]. Thus, we surmise that expressiveness and other perceived instructor characteristics share a minor portion of variance with measures of student achievement. Also, however, given their substantial contribution to student ratings and the weak correlation between student ratings and student achievement, it is clear that perceived instructor traits share much more variance with student ratings than with student achievement (see Appendix B for more information).

Perceived instructor personality is conceptually related to a phenomenon known to social psychologists as implicit personality theory (Brewer & Crano, 1994, pp. 128-139). As suggested by the name, implicit personality theories are generic presumptions people make in order to infer the personality attributes of other people. These inferences are usually made quickly and on the basis of scanty evidence. Indeed, they are often inordinately influenced by the first impressions formed of other people (e.g., Widmeyer & Loy, 1988). Thus, as Brewer & Crano (1994, p. 143) note, we tend to "make snap judgments about the personalities of other individuals that cannot possibly be. . .accurate" and indeed, "[there is] plenty of evidence of biases and distortions in the person perception process." Considering the ease with which impressions are formed and the typically sizable role of preconceptions, stereotypes and social context in their formation, perceived instructor personality may reveal as much about students as they do about instructors, a point made by Leventhal, Abrami & Perry (1976) in a closely related context.

The idea that student perceptions of instructors are affected by preconceptions and stereotypes is supported by research showing that male and female university faculty are evaluated differently by their students (Goodwin & Stevens, 1993). For example, Kaschak (1978) found that male students tended to rate male professors higher than their female counterparts but female students rated male and female professors equally. A similar study by Basow & Silberg (1987) found that male and female students rated female professors significantly lower than male professors, even when professors were matched for type of course, years of teaching experience, and tenure status. When considered together with the results of closely related studies (e.g., Lombardo & Tocci, 1979; Ferber & Huber, 1975) such findings prompt the conclusion that "less favorable ratings of women are most likely to occur when women are seen as not fitting gender stereotypes (Basow & Silberg, 1987, p. 312)," a conclusion reinforced by Berry's (1989) observation that "women faculty members are evaluated less favorably, especially when they step out of traditionally 'feminine' areas of knowledge."

Students are also apt to confuse situationally dictated and role-driven behavior with instructor personality, a point vividly illustrated below by Brewer and Crano.

 
Since people usually behave consistently when they are occupying a particular role, we have a strong tendency to think of the person as having the traits that go with that role. As teachers, we experience this all the time. Most of our students know us only in our role as lecturer, where our behavior is constrained to be somewhat formal, controlling, and in very large classes unsympathetic to individual student needs. By the end of the academic term, these students often believe that we are uptight, unsympathetic personalities. Many would not recognize us by the way we behave at home or at a party (1994, p. 198).


Students are prompted to attribute instructor demeanor to personality rather than roles and situations because the role is enacted by the instructor, who is also the focus of attention. Situational factors are not as salient (e.g., Brewer & Crano, 1994, p. 199) [4]. Thus, it is not altogether surprising that, as noted above, instructors of small, discussion oriented classes are often perceived as gregarious and supportive while instructors of large introductory university courses are seen as extraverted but also somewhat "neurotic" (e.g., exposed, apprehensive, controlled and aloof). Neither is it surprising that instructors become known in their academic communities for these attributes. For the most part, the circumstances of their teaching promote them.

THE POLITICS OF THE CLASSROOM


Three interrelated observations seem warranted at this juncture. First, to the extent that use of student instructional rating questionnaires is predicated on the assumption that they measure instructional effectiveness, the foregoing findings are clearly problematic. When considered together with validity research yielding only marginal and unstable relationships between student ratings and instructional outcomes (e.g., Palmer, Carliner & Romer, 1975; Dowell & Neal, 1982, 1983; Abrami, d'Apollonia & Cohen, 1990), it seems likely that most of the factors contributing to student instructional ratings are unrelated to instructors' ability to promote student learning (e.g., Small, Hollenbeck, & Haley, 1982; Ware & Williams, 1980; Chandler, 1978; see Appendix A, page 9). Indeed, given the considerable contribution of perceived instructor personality, student ratings may be more closely linked to student stereotypes of teachers than to instructional effectiveness (Erikson, 1983).

Second, the research summarized above creates knotty problems for conscientious instructors whose continued employment depends on receiving high student ratings. Since expressiveness and other perceived personality characteristics contribute greatly to high student ratings, instructors may choose to commit substantial time and psychological energy to projecting these qualities in class. However, while these inflate instructional ratings, they may add relatively little to student achievement (see Appendix B).

Alternatively, instructors may dedicate themselves to preparing substantive, conceptually challenging lectures because these contribute substantially to student achievement. However, sober, substantive and carefully given lectures may limit instructors' ability to evince exhibitionism, lightheartedness, amicability, and the like, even if they are prompted to do so. And, since lecture content contributes much less to student instructional ratings, the price instructors pay for this strategy is lower student ratings and, possibly, loss of promotions, salary increments, or employment.

Or, instructors may attempt a compromise by allocating classroom time to both strategies. But this would result in declines in either student ratings or student achievement compared to teaching keyed to one or the other strategy. Moreover, sizable doses of persona may draw students' attention away from intellectually challenging lecture materials (Ware & Williams, 1977, 1980). And clearly, since student ratings may bear on future employment and student achievement is essentially invisible to convenors and administrators, the safest strategy is to simply maximize student ratings.[5]

Third, an upshot of the research cited above is that student ratings are questionable assays of teaching effectiveness. However, in one important sense, this observation misses the fundamental point of student instructional ratings. Student ratings are often required of new teaching faculty (and encouraged for tenured faculty) because they make it more likely that instruction will uphold the "open-humanistic-excellence" rhetoric embraced by college administrations and sold to education ministries (see endnote 5).[6] As a means of classroom surveillance, administrative use of student ratings insures that teaching will seldom rise to a level of substance that precipitates student unhappiness. For the most part, this practice is embraced independently of course objectives and standards, the capabilities and commitments of students, the validity of the evaluation instrument, and the consequences of shallow teaching for long-term student success. When combined with open-door admissions and exceedingly liberal grading policies, the net result can be little more than the trappings of student success.[7] Rather than unfettered excellence in post secondary education, the overarching institutional agenda revealed by such practices is classroom marketability, elevated enrollments, and very high consumer satisfaction. Unfortunately, while such a stratagem may produce contented students, it essentially forsakes responsibility for educational leadership for the methods, goals and standards of the commodity marketplace (McMurtry, 1991). Although endemic in private-for-profit training schools (e.g., Trend College, CompuCollege, and the like) this surely is not a state of affairs one expects to find in an institute of higher learning.

Consumer-market models of education have weighty and widespread implications for college and university instructors, particularly in light of the research on perceived instructor personality discussed in the pages above. Within the logic of such models, instructors themselves become commodities whose market value is established substantially by student (i.e., customer) ratings.[8] McMurtry (1992) makes a similar observation.

 
The consumer-market model of education has become increasingly dominant, and now threatens to undermine the educational process itself by conceiving of education as a "product" to be "delivered" to students, whose quality can be assessed by the satisfaction of student consumers . . . We are used to the political game of "making people feel good"substituting for reality, and the underlying public relations strategy of manipulating perceptions rather than addressing actual problems. We deplore this pattern in politics and the media. But the same tidal movement. . .has taken over our understanding of education. We find it in the tirelessly repeated idea that universities and schools "must deal with public perceptions" and "sell themselves to the public," and we find its correlative in microcosm in the idea that good teachers are those who score highly on in class opinion polls of their students. At some point, educators will need to move past the surface play of perceptions to the actual learning achievements of the students they have instructed (p. 4).


Whether embraced explicitly or tacitly, consumer-market models of education raise a host of overarching issues that are in need of prompt and thorough examination.[9] However, one point seems obvious from the very start. There are limits on how much of themselves instructors can be required to market in the interest of student ratings. And there are substantial limits on how much instructors should contribute to practices whose aims are centered more on consumer satisfaction than student achievement.

EDITORIAL CONCLUSION


Ostensibly, student ratings are used to evaluate teaching because administrators believe that they assess instruction with an accuracy and certitude great enough to support personnel decisions and pedagogical development. If student ratings did not yield such assays they would serve no tangible educational purpose. Yet it is precisely these qualities that are called into question by the research discussed in these pages. If the cogency of these assumptions is an empirical matter, the research reviewed herein would suggest that neither assumption is true. Indeed, it would suggest that student instructional ratings are rooted substantially in factors unrelated to teaching and learning, not the least of which is perceived instructor personality.


Unfortunately, troublesome findings such as these have had little if any influence on college teaching evaluation practices. The reasons are fourfold. First, routine teaching evaluation is neither influenced nor guided by the professional research literature. Indeed, teaching evaluation is not widely understood as a practice that must be conducted with validated instruments and professional expertise. Deans, chairmen and convenors who wouldn't think of interpreting other psychometric profiles are often not reluctant to interpret student ratings data and ground rather weighty decisions in their interpretations. Second, widespread use of student ratings creates the appearance of teaching evaluation whether or not the evaluation instrument has been validated. Everyone (including students, boards of governors and education ministers) can see that the institution is dedicated to "excellence in teaching." This salutary effect would be lost if validation testing proved the instrument to be invalid. Third, even invalid evaluation instruments are capable of assessing student satisfaction with instructors and their courses, a factor that college administrators may value as much or more than instructional effectiveness. Fourth, college administrators embrace student instructional ratings because they serve their managerial interests. Compared to qualitative evaluation methods, student ratings render teaching and teaching evaluation calculable and comparable across instructors and disciplines. And finally, even vacuous teaching evaluation allows administrators to police faculty and induce them to comply with administrative agendas. Few other administrative prerogatives offer such control. Unfortunately, when acted on conscientiously by instructors, the data yielded by invalid evaluation instruments are likely to prompt classroom changes that diminish instructional effectiveness (see endnote 3). Presumably, this compromise is acceptable to those who place greater value on consumer satisfaction than effective teaching and student achievement.

Although much of the foregoing constitutes a vigorous critique of student instructional ratings, it overshadows an obvious but noteworthy final point. Ultimately, it is not incumbent upon instructors to prove that student ratings are invalid measures of instructional effectiveness. The case for student ratings must be made by those who purport to evaluate teaching for the same reasons that those presuming to assess intelligence, aptitude, and mental health must demonstratethe validity of their procedures. However, as suggested in the pages above, such a case will be not easily made and indeed, may not be possible. Yet clearly, without convincing empirical evidence to the contrary, it is not obvious that student instructional rating questionnaires are accurately characterized as teaching effectiveness metrics. But what then are they?

Student ratings and the instructional changes implied by them are little more than prescriptions for professional demeanor. Although demonstrably linked to the social perceptions of students, these prescriptions are largely unrelated to the substantive products of instruction (see Appendix A). Moreover, when asserted dogmatically and enforced by threats to promotions, raises or continued employment, student ratings become a potent means of manipulating the behavior of college and university teachers. As such, they expose teaching faculty to arbitrary regimentation and thus constitute a considerable threat to academic freedom. For these and other reasons noted in the pages above, it is not surprising that instructors often chafe against mandatory use of student instructional ratings.


INSTRUCTOR PERSONALITY AND THE POLITICS OF THE CLASSROOM (revised) Pt. 2

------------------------------------------------------------
Copyright John C Damron, Douglas College, 700 Royal Ave, New Westminster, B.C. Canada. This document can be freely redistributed in whole or in part, provided that this copyright notice is included intact, and that no material profit is generated from such a transaction.

APPENDIX A: THE VALIDITY AND ACCURACY OF STUDENT RATINGS


As noted earlier in this paper, although instructional processes and products are essential components of teaching, instructional products are the more fundamental because they embody the effects of instructional processes on student achievement. As such, they provide criteria for establishing whether and to what extent classroom practices are viable instructional processes. High positive correlations between student ratings and instructional products and low correlations with extraneous factors form the basis of student rating validity.

According to Dowell & Neal (1982), the student ratings research literature is extensive, inconsistent, and of strikingly variable quality. It is thus difficult to interpret and summarize concisely. However, as outlined below by Abrami, d'Apollonia & Cohen (1990), several methodological considerations characterize all coherent rating validity research.

 
The first consideration for a validation design is that it must allow one to assess the degree to which the ratings of students reflect the impact of teachers on student learning and other outcomes. In particular, the design must control for plausible rival explanations of the effects of instructors on student learning (and student ratings of instruction).

 

Generally, these plausible rival explanations center around the effects of "biasing" characteristics - mainly student characteristics (e.g., ability), but also course and setting effects (e.g., class size) and extraneous instructor characteristics (e.g., grading standards). Thus, the validation design should control for plausible threats to internal validity.

 

The second requirement is that the design allows for generalization across students, instructors, courses, and other setting characteristics, various rating instruments, and different products of effective instruction. For example, we wish to conclude that ratings predict teacher impacts in a variety of courses for a variety of learning types and other effectiveness measures. Thus, the validation design should control for plausible threats to external validity. The strongest validation design will control for plausible threats to both internal and external validity (p. 220).


After circumscribing a similar set of controls, Dowell & Neal (1982, p. 51) note that such methodological requisites are essential considerations in any critical examination of validity studies "because these elements are inappropriately implemented in all but a few studies." Bearing the foregoing in mind, the validity of student ratings will be summarized here in light of the results of four meta-analyses of existing validity studies and Abrami, d'Apollonia & Cohens' (1990) analysis of variability in multisection validation findings.

In a meta-analysis of 41 early multisection validity studies, Cohen (1981) found that student achievement explained 18.5% and 22% of overall instructor and course rating variance, leaving most between-teacher rating variance unaccounted for. It is appropriate to interpret these results cautiously, however, because many of the primary studies in Cohen's analysis used neither random assignment to course sections nor statistical procedures to control for initial differences in student ability (Cohen, 1981). In a better controlled subsequent meta-analysis, Cohen (1983) found that student achievement accounted for 14.4% of overall instructor rating variance.

Other analyses have turned up somewhat lower estimates of student rating validity. In a meta-analysis of 14 multisection validity studies, McCallum (1984) found that student achievement explained 10.1% and 6.4% of (respectively) overall instructor and course rating variance. And, in a quantitative analysis of six validity studies chosen for their exceptional control of student presage variables, Dowell & Neal (1982) found that student achievement accounted for only 3.9% of between-teacher student rating variance.[10] Indeed, in their primary study with the most extensive controls for student ability (Palmer, Carliner, & Romer, 1978), student achievement accounted for only 1.44% of between-instructor ratings variance. Both findings suggest that the link between student instructional ratings and achievement is mediated by student ability (Dowell & Neal, 1982).

Of equal significance, Dowell & Neal (1982) found evidence that validity coefficients are strongly affected by situational factors. They note that the research literature

 
can be seen as yielding unimpressive estimates of the validity of student ratings. The literature does not support claims that the validity of student ratings is a consistent quantity across situations. Rather, the evidence suggests that the validity of student ratings is modest at best and quite variable (p. 59).

 

The variability in obtained validity coefficients even in studies with reasonable methodological requirements. . .lead us to suspect that the validity of student ratings is influenced by situational factors to such an extent that a meaningful, generalizable estimate of their validity does not exist. In general. . .no meaningful estimate of the validity of student ratings can be provided with confidence that is generalizable enough to be useful.

 

In short, [student] ratings have a long way to go to gain respectability. . . (p. 60).


Abrami, d'Apollonia & Cohen (1990) draw a similar conclusion about variability in validity outcomes across validity studies and over rating dimensions.

 
Whereas the average validity coefficient for global ratings is moderately positive, the results of these studies appear inconsistent both from study to study and across rating dimensions. . . Not surprisingly, opinions about rating validity have differed markedly....
The study features that explain variability in validity outcomes do not generalize across global and specific aspects of teaching. Characteristics that predict the relationship between student perceptions of teaching and instructor effects on learning vary with the aspect of teaching being investigated. . . Unfortunately, the precise nature of this pattern of effects could not be elaborated here (p. 230).

 

[This] finding . . .underscores the importance of studying student ratings of instruction in the local institutional context to determine reliability, validity and utility more accurately (p. 230). In addition, users of ratings are better advised to exercise caution when interpreting rating results and when generalizing across conditions, especially in making promotion and tenure decisions using multidimensional rating forms (p. 224).


In a closely related observation, Dowell & Neal (1982) conclude that

 
the literature does not support a position that ratings are sufficiently valid for policy uses or clearly invalid across all validity studies... When reasonable methodological requirements are imposed on the studies to be interpreted, the literature indicates statistically significant but very modest validity coefficients.

 

This has important implications for the use of student ratings in decision making about reappointment and advancement. Because student characteristics are an important potential moderator of validity, local validation of the rating instrument should be conducted and evidence obtained that validity is reasonable before an instrument is used for policy decisions. Even then, the numerous biases and potential moderators documented in the literature suggest that different norms and different estimates of validity may be appropriate for different departments, course levels, and course types (pp. 60-61).


As noted on the first page of this appendix, validation designs must incorporate control procedures that protect against threats to internal and external validity. Failure to properly implement such controls corrupts the integrity and generality of the data yielded by the design and renders them essentially uninterpretable (Dowell & Neal, 1982).

Cohen (1981) acknowledges that his meta-analysis cannot address concerns about the internal validity of validation studies. He notes that "it is difficult to determine. . .the extent to which achievement differences among sections can be attributed to differences among teachers" (p. 305). Dowell & Neal (1982) take a step further in concluding that most validation studies are of questionable internal validity (p. 51; p. 60). They also suggest that many or most validation studies lack external validity, a conclusion that Cohen (1981) resists, at least with respect to his own study (p. 305). However, Abrami, d'Apollonia & Cohens' (1990, pp. 222-224) examination of Cohen's validity outcome data suggests that his validity coefficients cannot be generalized to other students, instructors and institutional settings [Note: Cohen is the same person in both references].

Thus, the research above suggests that the validity of student instructional ratings is problematic for three fundamental reasons. First, validation studies that do not properly control for biasing factors (e.g., student characteristics) yield internally invalid and uninterpretable estimates of rating validity. Second, when appropriate controls are implemented, resulting validity estimates account for only a small fragment of between instructor rating variance. The proportion of variance accounted for appears to be inversely related to the scope of the controls. Third, even among well designed validity studies, validity coefficients tend to be highly variable and mediated by situational factors to such a degree that coherent context-independent estimates of validity are not possible (Dowell & Neal, 1982; 1983; Abrami, d'Apollonia & Cohen, 1990). The latter two problems have weighty implications for the accuracy and developmental utility of student ratings.

As noted above, meta-analyses of published validity studies (e.g., Dowell & Neal, 1982; Cohen, 1983; McCallum, 1984) indicate that student achievement explains between 4-14% of student ratings variance, leaving 86-96% of variance unaccounted for and attributable to factors other than teaching effectiveness (e.g., perceived instructor personality and the like).[11; 12] Such validity levels are characterized by near maximum standard errors of estimate (.9798 -.9250), exceedingly wide confidence intervals, and (consequently) a very high incidence of specious and adventitious differences between and among instructors' effectiveness ratings (e.g., Ferguson, 1981, pp. 130-132; Howell, 1992, pp. 237-240).[13]

As shown below, in comparing the results of their (1982) quantitative review of multisection validity studies with those of a critic (Cohen, 1983), Dowell & Neal (1983) come to a similar conclusion regarding the questionable accuracy of student ratings.

 
In validating a measure, we often accept statistically significant correlations of any magnitude between the measure and a criterion as evidence that the measure is tapping the desired dimension. . . It is [thus] clear that in one sense [student ratings] are "valid"; that is, they are statistically associated at some magnitude with the criterion. . . However, to view this validity as evidence for the accuracy of student ratings is a leap in inference unwarranted by the results of both reviews.

 

In our review, we developed confidence intervals for the true teaching ability of a typical instructor using the empirically derived validity estimate and plausible assumptions about the parameters of student ratings (a 1-5 scale, mean = 3, standard deviation = 1. . . The 95% confidence interval ranged from +/- 1.8 scale units or 3.6 out of a possible 4 (i.e., 1-5). To illustrate the importance of the difference between our validity estimate and that of our critic, we [computed] the same interval using our critics validity coefficient. This interval ranges from +/- 1.68 scale units or 3.4 out of a possible 4 units. Our confidence interval covers 90% of the range of possible values, and our critic's covers 84% of the range of possible values [Also see endnotes [13 and 14].

 

These findings [based on a total of 19 validity studies] force the conclusion that the results of both reviews substantiate two points. First, ratings are statistically correlated with learning, but, second, ratings are very inaccurate predictors of one important criterion of teaching effectiveness, student learning. (pp. 460-461).


Dowell & Neal (1983) coclude their discussion with the following observation about the managerial appeal of student ratings.

 
The attraction of student ratings to higher education officials lies in the seductive way they seem to reduce a complex human activity, teaching, to simple numbers. Administrative overreliance upon ratings may encourage faculty to "perform to criterion," to teach in such a way as to accrue good ratings, even though this may not result in optimal educational practice. Ratings cannot substitute adequately for more informed judgment about teaching effectiveness, which might be obtainable from reviewing course materials, visiting classes, and discussing teaching practices with faculty. Unfortunately, these detailed qualitative methods are currently regarded as too expensive and difficult to use in most institutions.

 

Evidence presented in [Cohen's 1983] review reinforces our earlier conclusion that student ratings are _inaccurate_ indicators of student learning and therefore are best regarded as indices of "consumer satisfaction" rather than teaching effectiveness. . . (p. 462; see endnote 2 for related comments).


This observation is virtually identical to a thesis of the present paper.

According to educator Wilbert McKeachie (1987, p. 4), "for personnel purposes, faculty and administrators rightly have great concerns about the reliability and validity of evaluation data." He goes on to suggest that these concerns are not as urgent when student ratings are used for instructional development. However, this is true only in the narrow sense that use of student ratings for instructional development does not usually affect one's employment status. The fact remains that student ratings may still play a considerable role in guiding and substantiating instructional change. At the very least, this function is served poorly by marginally valid or invalid teaching evaluation instruments.

APPENDIX B: MORE ON INSTRUCTOR PERSONALITY AND STUDENT RATINGS


It was noted earlier in this paper that laboratory studies show that instructor expressiveness has a substantial impact on student instructional ratings but a smaller effect on student achievement. Indeed, a meta-analysis of such findings (Abrami, Leventhal & Perry, 1982) found that instructor expressiveness accounted for 29% of student rating variance but only 4.3% of student achievement variance. In contrast, lecture content accounted for 16% of student achievement variance but only 4.6% of student ratings variance. Thus, an upshot of educational seduction studies is that student ratings are over sensitive to the expressive style of instructors and substantially insensitive to instructors' ability to promote student learning (Abrami, Leventhal & Perry, 1982). As illustrated below by the authors of the foregoing meta-analysis, these differences are not particularly subtle.

 
On average, how large were the expressiveness and content effects in terms of mean differences? Imagine that summary and global ratings were collected on a five-point scale with the ratings distributed to yield a standard within-group standard deviation of 1.00. The difference between the ratings of high- and low-expressive instructors would be approximately 1.20 scale units, or more than a standard deviation. The difference between the ratings of high and low content lectures would be approximately .45 scalepoints, or less than one-half standard deviation. Now imagine that students also completed a 100 item multiple choice quiz with scores distributed to yield a within-group standard deviation of 15.00. The difference between the quiz results for high and low expressiveness instructors would be approximately 6.75 items or less than one-half standard deviation. The difference between the quiz results for high and low content lectures would be almost one standard deviation (pp. 455-456).


In related studies of student and instructor characteristics, student ratings, and student achievement, Abrami, Perry & Leventhal (1982) obtained results similar to those reported above. Moreover, Abrami, Perry & Leventhal (1982) found that in both laboratory and field studies (see Abrami & Mizener, 1985)

 
teacher effects on ratings were significantly greater than teacher effects on achievement. . . Taken together these studies have implications for the field use of TRFs [teacher rating forms] where ratings are used to estimate a teacher's impact on student learning. Extremely high or low TRF scores do not suggest equally extreme learning impacts. Large differences between instructors on ratings suggest smaller differences in their effect on student achievement. For summative purposes, our data suggest that ratings are useful only to make gross distinctions between instructors. . . Finer discriminations would appear to identify illusory impacts on student learning (p. 124).


Unlike educational seduction research, studies of the relationship of perceived instructor personality to student achievement have been somewhat rare (e.g., Feldman, 1986). However, in a field (real classroom) study by Murray (1978), instructional effectiveness was assessed by students' ratings and by objective measures of student achievement. Murray found that peer ratings of instructor personality characteristics were more highly correlated with student ratings than with student achievement. He also found that the perceived personality traits associated with student ratings overlapped little with the particular attributes associated with student achievement. Thus, while instructor liberalism, exhibitionism, extraversion and personal warmth contributed sizably to student ratings, they were unrelated to objective measures of student achievement.

 
Peer ratings of personality showed significant correlations with both ratings and achievement measures, but the correlations with teacher ratings were considerably higher than those with student achievement. Thirteen of 20 personality traits correlated significantly with student ratings of teaching, whereas only 5 traits correlated significantly with student final exam performance. More interestingly, personality traits associated with high or low student ratings overlapped very little with those related to student achievement. The four personality traits correlating highest with teacher ratings were Liberalism (.68), Leadership (.66), Exhibitionism (.65), and Extraversion (.60). The four personality traits correlating highest with final exam performance were Authoritarianism (.42), Independence (-.41), Leadership (.37), and Compulsiveness (.38). Only two personality traits (Leadership and Orderliness) correlated significantly and in the same direction with both measures of teaching, whereas two other traits (Independence and Authoritarianism) correlated significantly but in opposite directions with teacher ratings and student achievement. These results indicate that the same teacher characteristics that lead to high levels of student morale [i.e,. student ratings] may have no effect, or in some cases a negative effect, upon student learning of course content. . . (pp. 2-3; also see Murray's Table 3, p. 9).


Multiple correlations between peer ratings of instructor personality and, respectively, student ratings and student achievement, revealed that perceived instructor personality traits explained 2.6 times more variance in student ratings than student achievement (R2s = .5476 and .2116). On the basis of the foregoing, Murray concluded that

 
the teacher who receives high ratings from students but is below average in terms of student achievement appears to be a highly expressive extraverted type who is friendly. . . The teacher who engenders high levels of student achievement but is not highly rated by his students presents quite a different personality profile. Here we get the picture of a tough taskmaster who pays little or no attention to students' personal needs. . . The teacher who excels in both student ratings and student achievement is apparently able to draw a delicate balance between being strict and demanding on the one hand, and friendly and expressive on the other. . . (p. 5).



In a frequently cited multisection validity study, Sullivan & Skanes (1974) obtained similar results using a methodology different from Murray's. They found a sizable subgroup of instructors who facilitated high achievement in their students yet received low ratings from them, and a second subgroup whose members prompted low student achievement but nonetheless received high ratings.

In institutional environments in which instructors' student ratings are visible and student achievement is essentially invisible, inequities are inevitable. Instructors who are skilled at the art of impression management are likely to receive high student ratings whether or not their students have adequately mastered course materials. In contrast, instructors with effective pedagogical skills who cannot or will not manage students' impressions will receive substantially poorer ratings, especially if they fail to exude liberalism, exhibitionism and other key personality attributes. These findings, which would seem to be consistent with those of educational seduction researchers, call into question the validity of student ratings as measures of instructional effectiveness.[15]


INSTRUCTOR PERSONALITY AND THE POLITICS OF THE CLASSROOM (revised) Pt. 3

------------------------------------------------------------
Copyright John C Damron, Douglas College, 700 Royal Ave, New Westminster, B.C. Canada. This document can be freely redistributed in whole or in part, provided that this copyright notice is included intact, and that no material profit is generated from such a transaction.

APPENDIX C: THE VALIDITY OF STUDENT RATINGS INTERPRETATIONS


Although recent critiques of student ratings have quite properly focused on rating instrument validity, another form of validity -- the validity of ratings interpretations -- has also been discussed. The point here is that student rating data can be valid or invalid, as can interpretations of ratings data. This point is well taken. Even if a sufficiently valid rating questionnaire existed, there are no guarantees that interpretations of ratings data will be valid (or reasonable, coherent or fair). The edited excerpts below on ratings interpretation validity are drawn from a paper written by Jennifer Franklin and Micheal Theall (1990). These authors oversee faculty development offices at Northeastern University and the University of Alabama respectively.
_______________________________

Conversations with faculty and administrators...led increasingly to concerns about what users [e.g., chairmen; deans] were doing with the information we were providing. We saw that some departmental administrators, who routinely use ratings to make decisions about personnel, evaluation policy, and resource allocation, were not familiar enough with important ratings issues to make well informed decisions...

We received many requests from faculty for assistance in interpreting reports, and we discovered that our clients would not or could not use many of the instructions for interpretation that we had provided. Clearly stated disclaimers regarding the limitations of ratings data in particular circumstances appeared to have little effect on the inclination of some clients to use invalid or inadequate data...

Our research findings, as well as anecdotal reports from many of our colleagues, suggest that many of those who routinely use ratings are liable to be seriously uninformed about critical issues. For example, among faculty respondents who reported using ratings for personnel decisions involving other faculty, nearly half were unable to identify likely sources of bias in ratings results, recognize standards for proper samples, or interpret commonly used descriptive statistics...

A great deal of scholarly attention has been paid to the validity and reliability of student ratings as a measure of instructional quality. Considerably less has been given to actual practice... Utilization of ratings is one of the least often studied or discussed issues in the realm of ratings phenomenon. There are far fewer reported observations of ratings users in action in personnel decision making or of the ways in which teaching improvement consultants use ratings in interactions with their faculty clients...

Even given the inherently less than perfect nature of ratings data and the analytical inclinations of academics, the problem of unskilled users, making decisions based on invalid interpretations of ambiguous or frankly bad data, deserves attention. According to Thompson (1988, p. 217) 'Bayes' Theorem shows that anything close to an accurate interpretation of the results of imperfect predictors is very elusive at the intuitive level. Indeed, empirical studies have shown that persons unfamiliar with conditional probability are quite poor at doing so (that is, interpreting ratings results) unless the situation is quite simple." It seems likely that the combination of less than perfect data with less than perfect users could quickly yield completely unacceptable practices, unless safeguards were in place to insure that users knew how to recognize problems of validity and reliability, understood the inherent limitations of rating data and knew valid procedures for using ratings data in the contexts of summative and formative evaluation.

Whether the practices of those who operate rating systems or use ratings can stand close inspection has become open to question. It is hard to ignore the mounting anecdotal evidence of abuse. Our findings, and the evidence that ratings use is on the increase, taken together, suggest that ratings malpractice, causing harm to individual careers and undermining institutional goals, deserves our attention ...(pp. 78-80).

_________________________________

The mechanics and style of interpreting ratings appear to vary dramatically across the domains of ratings use, particularly with respect to the role of quantitative information. It is our impression that many teaching consultants employ subjective, experientially based methods of dealing with information, while administrative decision makers may strive to construct empirically based (or "empirical looking") formulas...

There are some fundamental concepts for using numbers in decision making. To the degree that these concepts are ignored, interpretations of data become, at best, projective tests reflecting what the user (e.g., a chairperson or dean) already knows, believes, or perceives in the data. Treating tables of numbers like inkblots ('ratings by Rorschach') will cause decisions to be subjective and liable to error or even litigation...

Ratings are particularly subject to sampling problems, such as not having enough courses on which to base a comparison between two instructors and not involving enough students in rating each course section. Moreover, the fact that classes with fewer than thirty students are statistically small samples means that special statistical methods are required for some purposes.... Substantially different models for analysis are also required for various uses of the data. Given such problems, there are many opportunities for error in dealing with numbers. Three types of errors come to mind immediately.

The first involves interpretation of severely flawed data, with no recognition of the limitations imposed by problems in data collection, sampling, or analysis. This error can be compared to a Type I error in research -- wrongly rejecting the null hypothesis -- because it involves incorrectly interpreting the data and coming to an unwarranted conclusion. In this case, misinterpretation of statistics could lead to a decision favoring one instructor over another, when in fact the two instructors or not significantly different.

The second type of error occurs when, given adequate data, there is a failure to distinguish significant differences from insignificant differences. This error can be compared to a Type II error. -- failure to reject the null hypothesis - because the user does not realize that there is enough evidence to warrant a decision. In this case, failure to use data from available reports (assuming the reports to be complete, valid, reliable, and appropriate) may be prejudicial to an instructor whose performance has been outstanding but who, as a result of the error, is not appropriately rewarded or worse, is penalized.

The third type of error occurs when, given significant differences, there is a failure to account for or correctly identify the sources of differences. This error combines the other two types and is caused by misunderstanding of the influences of relevant and irrelevant variables. In this case, a personal predisposition toward teaching style.., may lead a user to attribute negative meanings to good ratings, or to misinterpret the results of an item as negative evidence when the item is actually irrelevant and there is no quantitative justification for such a decision.

Any of these errors can render an interpretation entirely invalid...

How can we conceptualize the problem of ensuring that users do not make decisions or take actions that are based on invalid interpretations of data? In the followingexample, invalid interpretations are seen to result from either invalid or unreliable data or from lack of skill, knowledge, or necessary information on the part of the user. The strategy is to make sure that users either have or have access to sufficient skills or information to form valid hypotheses. Valid, reliable hypotheses are those interpretations of ratings that knowledgeable, skilled users, with adequate information concerning the present data, would be likely to produce or concur with.

Let us...state our goal in the following way: "The user will make decisions that are based on valid, reliable hypotheses about the meaning of data." In this case, the user should receive or construct working hypotheses that do the following things:-

Take into account problems in measurement, sampling, or data collection and include any appropriate warnings or disclaimers regarding the suitability of the data for interpretation and use.-

Do not attempt to account for differences between any results when they are statistically not significant (probably <.05).-

Disregard any significant differences that are merely artifacts (for example, small differences observed in huge samples), which can technically be significant but are unimportant).-

Account for any practically important, significant differences between results in terms of known, likely sources of systematic bias in ratings or reliably observed correlations, as well as in terms of relevant praxiological constructs about teaching or instruction.

The user should also refrain from constructing or acting on hypotheses that do not meet these conditions... (pp. 87-89)...

The validity of inferences or interpretations should concern those who design and operate ratings systems as much as validity and reliability of instruments used to obtain the data... How use occurs ought to be very important issue, one for which those who develop ratings systems ought to be held accountable... (pp. 80-81).
_______________________________

APPENDIX D: A CROSS-DISCIPLINE RATINGS BIAS


The following is an edited excerpt from an article on student ratings written by William Cashin (1990), a prominent evaluation expert at Kansas State University. The paper is titled "Students Do Rate Different Academic Fields Differently." Cashin examined very large data bases of students' ratings obtained with either the Educational Testing Service's Student Instructional Ratings questionnaire (SIR) or Kansas State University's IDEA questionnaire. Both are widely used in the USA. In a nutshell, he found sizable differences in how students rate teaching across various academic disciplines.
__________________________________

If you ask a college teacher whether students rate different academic fields differently, he or she will most probably say yes. If you ask why, you are not likely to be given much justification beyond the conviction that different fields are different. Nevertheless, there is increasing evidence that the conventional wisdom is correct. Students do rate academic fields differently. What is not clear is why...

The high group tends to consist of the arts and humanities. This trend is not universal, however; English language and literature and history both fall into the medium-low group. The low groups tend to consist mostly of business, economics, computer science, math, physical sciences, and engineering. The biological and social sciences and health and other professions tend to fall somewhere in the middle.

If we look at "Course Effectiveness" and "Instructor Effectiveness" combined, we see that the fine and applied arts and music fall into the high group for both measures. If we consider fields that are high on one measure and medium-high on the other, art, communications, foreign languages and literature, home economics, secretarial studies, and speech also fall toward the high end. This is very much a humanities cluster, with the exception of home economics and secretarial studies.

Several fields fall into the low group for both course effectiveness and instructor effectiveness: business and management, computer and information sciences, data processing technologies, economics, engineering, physical sciences, and physics. To the fields that were low on one measure and medium low on the other we must add accounting, chemistry, mathematical sciences, and philosophy. This is very much a math-science technical cluster, with the exception of philosophy and, perhaps, business and management.

The primary implication [of these findings] is that...we need to decide what to do about this phenomenon when we interpret student-ratings data. Administrators can no longer look at data from a variety of fields and unquestioningly compare numbers directly. Instructors cannot look at two courses they are teaching and necessarily assume that, if their ratings for the two courses are the same, that they taught both courses equally well.

The real problem arises from our not knowing why the different fields are rated differently. This finding is not due just to variations in student motivation (for example, required verses elective courses) or class size. In one unpublished analysis of IDEA data it was found, even after researchers controlled for students' motivation and class size, that differences in academic fields explained an additional 10 percent or more of the variance for some IDEA course objectives. In another study of a sample of IDEA data, 14-18 percent of the remaining variance was explained after controlling for differences among institutions, in number of courses for each field, in student motivation, and in class size.

There are several possible explanations for differences in the ratings of different academic fields. One is that the more quantitative courses tend to receive lower ratings. The low fields tend to be math, science, engineering, and quantitative business courses (for example, accounting and economics). A possible explanation for these differences is that students' quantitative skills are more poorly developed than their verbal skills. This would make quantitative courses more difficult to teach. Moreover, quantitative courses may receive lower ratings because students have lower expectations of success and lower actual rates of success. We have evidence that higher student ratings are related to...students' satisfaction and that, as grades decrease, students more frequently attribute their poor performance to factors external to themselves.

Another explanation of different ratings for different fields is that the more sequential courses, where success depends heavily on the mastery of material from a previous course, tend to receive lower ratings. This holds true for most math and science courses and for many professional courses, but it also holds true of foreign language courses, which tend to receive low ratings. Sequential courses may receive lower ratings because today's students are not studying as much as students have in previous decades and so do not have as solid a foundation for the courses that come later in a sequence...

Yet another explanation is that students in different majors rate course differently, because of differences in attitude, in academic skill and goals, in motivation, in learning styles, or in models of effective teaching. Although students majoring in any given field are likely to vary in many ways, it is quite possible that, taken as a group, they have certain characteristics that are related to how they rate courses and instructors... (pp. 113-119).----------------------------------------

 

* Lawrence Aleamoni, an evaluation expert at the University of Arizona, has made an observation of a similar sort regarding rating biases against required courses and student biases associated with various course levels (e.g., freshman, sophomore, and the like). He reports that

the variables that distinguish a required course from an elective, and that identify courses by level (freshman, sophomore, and so on) do seem to generate significant differences in student ratings. For example, the higher the proportion of students taking the class as a requirement, the lower the overall rating. [Moreover], freshmen tend to rate their teachers significantly lower than do sophomores, sophomores tend to rate them significantly lower than do juniors, and so on.

APPENDIX E: THE POLITICS OF VALIDITY -- DISCRIMINANT AND CONVERGENT


There are literally hundreds of student rating questionnaires in use in North America, many of which are "home grown" or mass produced by optical scanner manufacturers. Although a few questionnaires have undergone convergent validity testing (e.g., the SIR, IDEA and MSSIR) most have not dispite the not the finding that validity coefficients can be negative (see endnote 14). Moreover, with very rare exception, none have undergone discriminant validation. Thus, although weighty decisions about faculty careers are often made on the basis of student ratings, the psychometric integrity of the rating instrument is essentially unknown but treated as coherent and valid. It is difficult to imagine a more cavalier state of affairs.

INSTRUCTOR PERSONALITY AND THE POLITICS OF THE CLASSROOM (revised) Pt. 4------------------------------------------------------------

Copyright John C Damron, Douglas College, 700 Royal Ave, New Westminster, B.C. Canada. This document can be freely redistributed in whole or in part, provided that this copyright notice is included intact, and that no material profit is generated from such a transaction.

ENDNOTES

1. The research cited here on perceived instructor personality (e.g., Feldman, 1986) and instructor expressiveness (e.g., Abrami, Leventhal, & Perry, 1982) complement each other in methodologically significant ways. The latter research was conducted in a laboratory setting where study variables are carefully manipulated and measured in a context free of extraneous influences. Studies of this sort are said to have high internal validity because they accurately identify relationships between or among variables. The former research was done in a more natural but less controlled setting where contaminating variables may influence the outcome of the study. Such studies tend to have high external validity, meaning that their results are readily generalizable to other settings. Conclusions rooted in both sorts of study are preferred.

2. Closely related concerns about administrative uses of student ratings have been expressed, respectively, by educators Paul Rosenfeld and S.C. Erikson.

 
The use of teaching evaluations for administrative purposes -- promotions, tenure, raises, etc. -- has been, and continues to be, controversial. The [questionnaires] often have not undergone the standard validity tests required of any good assessment device . . . Thus the use of teacher evaluation forms for administrative purposes becomes a somewhat political issue in academia, and instructors sometimes end up structuring their teaching to get "good" evaluations so they can keep their jobs, get raises, and be promoted (Rosenfeld, 1987).

 

Steps should be taken to eliminate the questionable practice of using the results of student ratings for purposes of administrative assessment. If increased pay and promotion are possible consequences, we have an open invitation for the teacher to teach for the evaluation. A more constructive use of student ratings is to allow the teacher to select items which ask questions of direct interest to the immediate purpose of improving how he or she manages the classroom hour (Erikson, 1983).


3. The ratings yielded by virtually all student instructional rating questionnaires bear a marginal and unstable or unknown relationship to the very quality effective teaching must promote: student learning. College instructors are expected, nevertheless, to submit themselves to teaching evaluation programmes, permit their professional performances to be judged by them, and, perhaps, change their instructional techniques in light of feedback from them. Although many instructors indeed do these things, it is difficult to imagine a practice more harmful to a community that is ostensibly committed to instructional effectiveness. While such programmes create the appearance of coherent teaching evaluation, they provide little or no basis for accurately assessing instructional products or promoting the development of effective pedagogical techniques. And ironically, because student instructional ratings are poorly correlated with instructional products, the changes instructors make to their teaching routines to elevate student ratings are more likely to compromise than improve teaching effectiveness.

This can be a rather disconcerting realization for enthusiastic proponents of student ratings, who may prefer to believe that evaluation feedback is inherently veridical. But on what grounds is this belief sustainable? What is the relationship of student feedback to classroom proficiency? On what basis can instructors know that the classroom changes implied by student rating feedback are, in fact, improvements? And if feedback bears only a meager relationship to student learning, to what of pedagogical worth is it related? These questions are neither frivolous, unreasonable, nor esoteric. They are utterly fundamental and in need unambiguous answers.

4. This tendency is known widely amongst social psychologists as the "fundamental attribution error," according to which people regularly attribute role or situationally driven behavior to the personality of the actor.

5. As noted below by a seasoned B.C. community college observer, predicaments such as these are mostly of political rather than pedagogical origin.

 
Instructors are clearly in a double-bind: the institution wants something close to an "open-door" enrolment policy, yet emphasizes the importance of excellence and success. For university transfer students this can result in serious misrepresentation of what the college offers and a betrayal in terms of what it delivers. For instructors of these courses the alternatives are to uphold standards and risk poor evaluations from under-prepared or under-motivated students, or to dilute standards to meet students' emotional needs and ensure satisfactory evaluations.

 

[This] contradiction has developed because senior administration and the college board have always adopted a rhetoric aimed at pleasing the ministry, the community, and a variety of special interest groups without considering the pedagogical consequences of such a mixed bag of political promises. Divisions and departments have unwittingly absorbed and repeated this rhetoric in order to preserve or increase funding and enrollments. I think we all perpetuate parts or all of the "excellence-open-second chance humanistic" discourse that results in this instructional bind....

 

It is quite clear that a more enlightened concept of student evaluation is needed, as is more institutional support for the instructor's position (name withheld by request, 1992).


6. In most instances this socializing strategy succeeds. New instructors are usually on probation for two years or longer and are vulnerable to dismissal. Most are prepared to do what they must to comply with the expectations of those empowered to affect their employment. And most, therefore, are ready to do what is necessary to assure that the student ratings they receive are acceptably high. In an atmosphere in which good teaching is equated with high student ratings, it makes sense to weave into ones classroom performance virtually anything that elevates such ratings. For the most part, this is simple survival. However, in light of the research reported herein, it is not likely to give rise to effective teaching processes (also see endnote 3).

7. In a day and age dominated by vulgar consumerism, college degree mills and media hype, the differences between education and the trappings of education can go easily unnoticed. They go unnoticed because both endow students with the conspicuous signs of education -- course credits, grades, transcripts, and degrees, diplomas or certificates. The mere trappings of education are now pervasive in North America. Despite their grades and diplomas, graduates of the pubic school systems often read, write and think at levels that belie their thirteen years of public education. As indicated in the observation below, this is not a particularly new problem.

 
In our studies at the University of Texas, we have determined that today's high school graduate leaves the public school experience with a "B" average over four years of high school. But this same graduate reads below the eighth grade level. All indications are that high school graduates have lost between one and two grade levels in basic reading over the past decade. That this finding is self-evident is attested to by the colleges and universities across the nation that are swamped by thousands of would be college freshman who are not prepared with the basic academic skills they need to pursue freshman level academic work... (p. 17).

 

When I was dean at a community college 17 years ago, it was common knowledge thata student's high school grade point average was the best single indicator and predictor of his success in college. This is no longer true! In fact, today's high school graduate who reads below the eighth grade level will likely possess a B or B+ average... Grade inflation is so rampant in American public schools that grades earned no longer reflect academic skills attained (p. 23).


This observation was made in 1981 by John Roueche, professor and director of the Community College Leadership Program, University of Texas at Austin. Twelve years later, public school students fared no better.

 
Students today are leaving high school no better prepared than they were in the mid 1960s. In fact, evidence indicates that despite higher grade point averages in high school, students' skills and competencies are at the lowest levels in American history. For example, the National Center for Education Statistics reports that in tests of mathematical progress, more than one-third of all students tested performed at levels below the lowest identified level, and more than 80% are believed to be functioning below their appropriate grade level in math (Roueche & Roueche, 1993, p. 19).


Statistics such as these make it all the more necessary for colleges to adopt strategies that are demonstrably effective. Yet, as Roueche & Roueche (1993, p. 20) note below, college administrations across North America have not been particularly eager to do so.

 
Even still more unfortunate than the curious infrequency with which successful, time honored teaching and learning strategies are implemented is the continuing inattention to careful evaluation of our services to...students. Long time accusations continue about the failure of community colleges to look critically at their outcomes, and there is little evidence to the contrary. When evaluation is conducted, it is often based on naive or shallow criteria. For example, some colleges define success as the number of students who complete courses, not as the number who complete the even more important follow up courses.


8. This "market value" can sizably affect instructor employability, tenure, promotions, raises and awards. Probationary and nonregular faculty are, of course, particularly vulnerable.

9. As noted below by McMurtry (1991), the incursion of commodity market influences into academic and educational matters is now considerable.

 
It is important to remind ourselves just how far the basic research and teaching process has already been subordinated to corporate control. . . Academic awards and grants are now so universally trade-named by multinational corporations that few remain which are not dependent for their survival on business or business foundation sponsorship, a sponsorship which has historically controlled research agenda. Textbook production and distribution in public educational institutions is now almost entirely under the control of large corporations, with over two-thirds of the school text market in North America controlled by just eight multinationals by 1981. Academic teachers, under Chapter 15 of the 1989 Canada-United States Free Trade Agreement, are now explicitly conceptualized by the Governments of the United States and Canada as "business persons" who function as "providers of goods and services" subject to the requirements of free trade across borders.

 

Even the language of educational purpose has undergone a sea-shift of transformation into business terminology and the going discourse of the corporate culture -- 'resource units' for what used to be subject disciplines and their professors; educational 'consumers' for what used to be students and learners; 'uniform standards' for what used to be the search for quality, depth and originality; 'program packages' for what used to be curriculum, 'products' for what used to be graduates; 'buying' ideas for what used to be the search for truth. It is difficult to avoid the conclusion that the educational process has been so pervasively subordinated to the aims and practices of business that its agents can no longer comprehend their vocation in any other terms.

 

Educators since Plato have traditionally resisted the reduction of education to the demands of appetite and social conditioning. But what is clearly lacking today from even those more reflective educators who intuit that the integrity of education is being undermined. . .is any clear understanding of the oppositions in principle between them. Without these bearings of analytic understanding, educators are quite unable to comprehend, let alone resist, the. . . assimilation of the goals and methods of education to those of commodity production and sale (pp. 210-211; see McMurty (1991) for an analysis of the contrary principles of education and the commodity market).


10. This validity coefficient, which was based on coefficients computed by primary researchers, rose to 6.8% when a primary study with a negative validity coefficient was excluded.

11. Cohen's (1981) results were not used to calculate this range because they were influenced by primary studies in which students were not randomly assigned to course sections and student ability controls were not used, thus yielding dubious evidence of validity (Abrami, Leventhal, & Perry, 1982, p. 459). Inclusion of Cohen's global validity coefficients would result in an upper limit of 22% and a slightly smaller standard error of estimate.

12. Of course,large systematic sources of ratings variance such as perceived instructor personality (40-70% of rating variance) essentially drive student ratings when perceptions of instructor personality aren't controlled for.

13. In this context the confidence interval is the range of rating values that contains an instructor's true teaching effectiveness score 95 out of 100 samples (i.e., assessments). In the research cited on page 12 by Dowell & Neal the confidence interval very nearly contains all possible rating values. Thus, virtually any rating score could be an instructor's true score.

14. Dowell & Neal (1983) note that similar variability characterizes mean validity estimates, an observation echoed by McCallum (1984, p. 151) and Abrami, d'Apollonia & Cohen (1990, e.g., pp. 222-223).

 
The data aggregation technique applied in both reviews uses a theoretically derived standard error of estimate for the weighted mean validity coefficient. This theoretical standard error is dependent only upon the sample size and assumes that samples are from a single or equivalent population, are randomly constituted, and are independent. Since validity coefficients are estimates of a population parameter (i.e., the "true" coefficient), the standard deviation of the distribution of obtained coefficients should approximate the theoretically derived standard error of estimate if the assumptions of the method are met by the data.

 

Using a theoretically derived standard error of estimate, our critic reported a 95 percent confidence interval for the true validity coefficient. This interval was from 0.31 to o.45. If the assumptions are met by the data, we would expect to find 95 percent or about thirty one of the obtained validity coefficients to fall within this interval. In fact, only eight of the thirty-three coefficients fell within this interval. The substantial discrepancy between the theoretically expected and obtained distributions strongly suggest that the assumptions of the method are not met by the data. This criticism applies equally to the method used in our review and that used by our critic and indicates that conclusions based upon the mean, aggregated coefficient may be misleading. In our review, we emphasized this variability and and suggested that it may be derived from local characteristics of individual validity studies. Whatever the source of the variability, it underscores the importance of local instrument validation.

 

In choosing to emphasize the statistical validity of ratings, our critic has glossed over what might be the most important characteristic of the validity of ratings: it is an unstable quantity... (Dowell & Neal, 1993, p. 461-462).


15. Some social psychological studies also shed light on the relationship of perceived instructor personality to student ratings. For example, in a study of the effects of instructor "warmth" on student perceptions of instructor personality and teaching ability, Widmeyer & Loy (1988) informed a 270 student physical education class that their regular professor would be temporarily unavailable. In his place would be a guest lecturer, "Dr. Jim Wilson," who would deliver a lecture on the compatibility of sport and education (Dr. Wilson was actually senior author W.N. Widmeyer). Students were then given a written biographical sketch of Dr. Wilson. For half of the students, the biographical sketch disclosed that Dr. Wilson was considered to be a rather warm person - industrious, critical, practical, and determined" (the "warm" condition). Remaining students received the same sketch with the exception that Dr. Wilson was characterized as a rather cold person (the "cold" condition). All students were informed that their impressions of Dr. Wilson would be sought after the lecture. Dr. Wilson was then ushered into class where he delivered a 40 minute "neutral and informative lecture" to all students. He then departed the lecture theatre. A teaching assistant distributed evaluation materials and instructed students to assess Dr. Wilson's personality and teaching ability.

Statistical analyses yielded significant differences (p < .001) between assessments performed in the warm and cold conditions. With respect to personality, students who were told that Dr. Wilson was cold assessed him as less pleasant, less sociable, less good-natured and less humorous than students in the warm condition. With respect to teaching ability, students in the cold condition assessed Dr. Wilson as less knowledgeable, less considerate, less interesting and less intelligent than students in the warm condition. Students in the cold condition were also less likely than their counterparts to surmise that Dr. Wilson would "go far" in his teaching career. These results are consistent with early findings of person perception researchers (e.g., Asch, 1946; Kelly, 1950).

Widmeyer & Loy (1988) concluded that student assessments of teaching abilities

 
were related to warmth, whereas certain personality characteristics were not. By being perceived as a warm individual, a teacher can influence students' ratings not only of his or her personality, but also of his or her teaching abilities. This has considerable implications within education. In view of the major role that teacher evaluations play in universities in regard to tenure and promotion, the findings suggest that if instructors want to "get ahead," they should present themselves as "warm" (p. 120).


[In related research, Murray (1978) also found that warmth was significantly correlated (0.47) with student ratings. However, Murray found that warmth was virtually unrelated (- 0.01) to student achievement.]

---------------------------------
* An earlier version of this paper appeared in the June 1994 issue of FACULTY MATTERS (No. 5, pages 9-12) and the September, 1994 issue of UPDATE (the newsletter of the Okanagan University College Faculty Association). The author would like to express his gratitude to Dr. Bruce Landon, Dr. Stephen Mainprize, Dr. Ray Koopman, Dr. Bruce Alexander, Dr. John McMurtry, Mr. Ross Powell, Mr. Bill Main, Ms. Jean Cockburn, Ms. Pam Burry, and Ms. Roslyn Dixon for their helpful and encouraging comments on earlier drafts of this paper. The author can be contacted in the Social Sciences Department, Douglas College, P.O. Box 2503, New Westminster, British Columbia, Canada V3L 5B2 (Telephone (604) 527-5312).
----------------------------------

December 27, 1994

Revised June 17, 1995
Revised December 14, 1995
Revised July 21, 1996
Rwvised October 8, 1996

Word Count

Main Text: 3400
Total: 12,911
------------

REFERENCES

Abrami, P.C., d'Apollonia, S., & Cohen, P.A. (1990). Validity of student ratings of instruction: What we know and what we do not. Journal of Educational Psychology, 82, 219-231.

Abrami, P.C., Leventhal, L., & Perry, R.P.(1982). Educational seduction. Review of Educational Research, 32, 446-464.

Abrami, P.C., Perry, R.P. & Leventhal. L. (1982). The relationship between student personality characteristics,teacher ratings, and student achievement. Journal of Educational Psychology, 74, 111-125

Abrami, P.C. & Mizener, D.A. (1985). Student/instructor attitude similarity, student ratings, and course performance. Journal of Educational Psychology, 77, 693-702.

Aleamoni L. (1989). Typical faculty concerns about evaluation of teaching. In L.M. Aleamoni (Ed.) Techniques for evaluating and improving instruction. Jossey-Bass, Inc. San Francisco.

Asch, S.E. (1946). Forming impressions of personality. Journal of Abnormal and Social Psychology, 41, 258-290.

Basow, S.A., & Silberg, N.T. (1987). Student evaluations of college professors: Are female and male professors rated differently? Journal of Educational Psychology, 79, 308-314.

Berry, E. (1989). Taking women professors seriously. Paper presented at the annual meeting of the American Psychological Association, San Francisco, April 1989.

Brewer, M.B., & Crano, W.D. (1994). Social Psychology. St. Paul, MN: West Publishing Co.

Cashin, William. (1990). Students Do Rate Different Academic Fields Differently. In Theall, M. & Franklin J. (Eds), Student Ratings Of Instruction: Issues For Improving Practice. Jossey-Bass, Inc. San Francisco, 1990.

Chandler, T.A. (1978). The questionable status of student evaluations of teaching. Teaching of Psychology, 5, 150-152.

Cohen, P.A. (1981). Student ratings of instruction and student achievement: A meta analysis of multisection validity studies. Review of Educational Research, 51, 281-309.

Cohen, P.A. (1983). Comment on "a selective review of the validity of student ratings of teaching." Journal of Higher Education, 54, 448-458.

Dowell, D.A. & Neal, J.A. (1982). A selective review of the validity of student ratings of teaching. Journal of Higher Education, 53, 51-62.

Dowell, D.A., & Neal, J.A., (1983). The validity and accuracy of student ratings of instruction: A reply to Peter A. Cohen. Journal of Higher Education, 54, 459-463.

Erikson, S.C. (1983). Private measures of good teaching. Teaching of Psychology, 10, 133 136.

Feldman, K.A., (1986). The perceived instructional effectiveness of college teachers as related to their personality and attitudinal characteristics: A review and synthesis. Research in Higher Education, 24, 139-213.

Ferber, M.A,. & Huber, J.A. (1975). Sex of student and instructor: A study of student bias. American Journal of Sociology, 80, 949-963.

Ferguson, G. (1981). Statistical analysis in psychology and education. (5th ed.). New York: McGraw-Hill.

Franklin, J., & Theall, M. (1990). Communicating student ratings to decision makers: Design for good practice. In Theall, M. & Franklin J. (Eds), Student Ratings of Instruction: Issues For Improving Practice, Jossey-Bass, Inc. San Francisco.

Goodwin, L.D., & Stevens. E.A. (1993). The influence of gender on university faculty member's perceptions of "good" teaching. Journal of Higher Education, 64, 166-185.

Howell. D.C. (1992). Statistical methods for psychology (2nd ed.). Belmont, Cal: Duxbury Press.

Kaschak, E. (1978). Sex bias in student evaluations of college instructors. Psychology of Women Quarterly, 2, 235-43.

Kelly, H.H. (1950). The warm-cold variable in first impressions of persons. Journal of Personality, 18, 431-439.

Leventhal, L., Abrami, P.C., & Perry. R.P. (1976). Do teacher rating forms reveal as much about students as about teachers? Journal of Educational Psychology, 68, 441-445.

Lombardo, J. & Tocci, M.E. (1979). Attribution of positive and negative characteristics of instructors as a function of attractiveness and sex of instructor and sex of subject. Perceptual and Motor Skills, 48, 491-494.

McCallum, L.W. (1884). A meta-analysis of course evaluation data and its use in the tenure decision. Research in Higher Education, 21, 150-158.

McKeachie, W. (1987). Can evaluating instruction improve teaching? In L.M.Aleamoni (Ed.), Techniques for evaluating and Improving Instruction. San Francisco: Jossey-Bass, Inc.

McMurtry, J. (1991). Education and the market model. Journal of Philosophy of Education, 25, 209-217.

McMurtry, J. (1992). Evaluating teaching by evaluating learning. Unpublished manuscript. University of Guelph, Ontario, Canada.

Murray, H.A. (1975). Predicting student ratings of college teaching from peer ratings of personality type. Teaching of Psychology, 2, 66-70.

Murray, H.A. (1978). Teacher ratings, student achievement, and teacher personality traits. Paper read at the annual meeting of the Canadian Psychological Association.

Murray, M.J., Rushton, J.P. & Paunonen, S.V. (1990). Teacher personality and student instructional ratings in six types of university courses. Journal of Educational Psychology, 82, 250-261.

Palmer, J.G., Carliner, J. & Romer, T. (1978). Leniency, learning, and evaluations. Journal of Educational Psychology, 70, 855-863.

Rosenfeld, P. (1987). Instructor's Manual to Accompany Scarr and Vander Zadens' Understanding Psychology (5th ed.), New York: Random House.

Roueche, J.E. (1981). Transfer and attrition points of view: Don't close the door. Community and Junior College Journal. December/January.

Roueche, S.D., & Roueche.,J.E. (1993). Making good on the promise: The view from between a rock and a hard place. American Association of Community Colleges Journal. April/May.

Rushton, J.P., Murray, H.G., & Paunonen, S.V. (1983). Personality, research creativity, and teaching effectiveness in university professors. Scientometrics, 5, 93-116.

Sherman, B.R. & Blackburn, R.T. (1975). Personal characteristics and teaching effectiveness of college faculty. Journal of Educational Psychology, 67, 124-131.

Small, A.C., Hollenbeck, A.R., & Haley, L. (1982). The effect of emotional state on student ratings of instruction. Teaching of Psychology, 9, 205-211.

Sullivan, A.M. & Skanes, G.R. (1974). Validity of student evaluation of teaching and the characteristics of successful instructors. Journal of Educational Psychology, 66, 584-590.

Tomasco, A.T. (1980). Student perceptions of instructional and personality characteristics of faculty: A canonical analysis. Teaching of Psychology, 7, 79-82.

Thompson, G.E. (1988). Difficulties in interpreting course evaluations: Some Bayesian insights. Research in Higher Education, 28, 217-222.

Ware, J.E., & Williams, R.G. (1975). The Dr. Fox effect: A study of lecturer effectiveness and ratings of instruction. Journal of Medical Education, 50, 148-156.

Ware, J.E., & Williams, R.G. (1977). An extended visit with Dr. Fox: Validity of student ratings of instruction after repeated exposure to a lecture. American Educational Research Journal, 14, 449-457

Ware, J.E., & Williams, R.G. (1980). A reanalysis of the Dr. Fox experiments. Instructional Evaluation, 4, 15-18.

Widmeyer, W.N., & Loy, J.W. (1988). When you're hot you're hot! Warm-cold effects in first impressions of persons and teaching effectiveness. Journal of Educational Psychology, 80, 118-121.

John C Damron, PhD
DOUGLAS COLLEGE
P.O. Box 2503
New Westminster, British Columbia
Canada V3L 5B2 FAX: 527-5095
e-mail: john_damron@mindlink.bc.ca