HERDSA logo
[ HERDSA ] [ Proceedings Contents ]

Do the responses on a teaching evaluation questionnaire match what actually occurred in the evaluated teaching?

Malcolm G Eley and Erica J Stecher
PDC, Monash University
In work reported at HERDSA '94 and '95 student teaching evaluation questionnaires composed of behavioural observation form questions were found to yield better reliability and discriminative power than those composed of more traditional agree/disagree Likert questions. The present paper presents the findings from initial work to assess the validity of this behavioural observation form. Representative samples of the teaching of volunteer academics were video-recorded. At the conclusion of each volunteer's teaching, his or her students completed a behavioural observation form questionnaire on that teaching. The video recordings were independently scored for the occurrence of the teaching events that were the focus of the questionnaires. That the observations from the video-recorded samples essentially agreed with the students' questionnaire responses was taken as a measure of the basic validity of the behavioural observation questionnaire form. Also, in a separate validity assessment exercise, questionnaire responses were compared to those from another standard instrument (Marsh's ASEEQ) that was administered in parallel.

Introduction

Whilst student questionnaires should always be seen as but one source of information on the quality of an individual's teaching, they are nonetheless a common source. It is important therefore to ensure that the question formats used in such questionnaires be as methodologically sound as possible. There has been considerable discussion in the literature over at least 30 years (eg. Smith & Kendall, 1963) on the properties of a variety of question forms. The question forms investigated have included the full range from commonly used summated scale types (eg. a descriptive statement with Very good to Very poor) and simple trait rating scales (eg. a label like "cooperativeness" with Very cooperative to Very uncooperative) to quite detailed behaviourally anchored rating scales (eg. description of some performance aspect with a scale on which points are defined by unique and specific example performance descriptions) (eg. Kinicki, Bannister, Hom & DeNisi, 1985)

Of recent years the present authors have compared the psychometric properties of what is termed the behavioural observation question form (Borman, 1986; Latham & Wexley, 1977) to those of the more common agree/disagree Likert form (Eley & Stecher, 1994; 1995; 1996; Stecher & Eley, 1994). The behavioural observation form presents a statement descriptive of some teaching or learning event that the student could have observed or experienced, and requires a response that indicates the recalled frequency with which that event occurred in the evaluated teaching. The findings have essentially been that use of the behavioural observation form showed measurable advantages over the agree/disagree in inter-rater reliability amongst students and in the capability to distinguish amongst levels of teaching quality.

The present paper reports initial work focussed on assessing the validity of the behavioural observation form. For any teaching evaluation questionnaire to be valid would require as a minimum that the rating responses evoked by it correlate positively with the overall quality of the rated teaching. But with the behavioural observation form particularly, this requirement might perhaps be defined more tightly. The model underlying this form is that the responding student reports the recalled frequencies of occurrences of described events. This begs a very fundamental validity question of whether these recalled frequencies correlate with the actual frequencies.

Attempting to assess the match, or otherwise, between teaching evaluation questionnaire responses and the actual teaching that occurred is not a novel exercise. For instance, Murray (1983) had trained observers use a Teacher Behaviors Inventory to rate three month samples of actual teaching. He found that lecturers whose aggregated questionnaire ratings over a three year period were consistently high versus medium versus low exhibited systematically and significantly different teaching performances. Teachers who received higher rating from students did actually teach differently than those who received lower ratings.

In a general sense, studies like Murray's certainly attest to teaching questionnaires being valid indicators of teaching performance. However, the measures used are still indirect. The observational ratings made by Murray were subsequent to the teaching performances that gave rise to the questionnaire responses. The observations and the questionnaire responses were not compared relative to the same sample of teaching. Murray's findings could be interpreted dishearteningly as showing that the poor teaching evidenced by the questionnaires proved to be resistant to change.

In the present work, observational ratings of actual teaching performances were compared to questionnaire responses on the same teaching. Classes sampled from a teacher's teaching in a semester were video-recorded. At the end of each teacher's teaching, the students completed a behavioural observation form questionnaire on that teaching. The relativities between the frequency responses to the questionnaire questions were then compared to the relativities between parallel observations derived from the recorded samples.

Subjects

Four academic staff, in Business Law, Music, Statistics, and Accounting, each volunteered to have their teaching in one of their undergraduate subjects observed. Class sizes were 38, 54, 64, and 223 students respectively.

Procedure

During second semester 1995, representative samplings of the classes taught by each of the volunteers were visited, and the teaching was video-recorded. Each volunteer knew that their classes could at any time be visited, but they were given no forewarning of any particular visit. At the completion of each volunteer's teaching commitment in a subject, the class was administered a 31-question questionnaire on that teaching. The questions were all of behavioural observation form (eg. "The lecturer described what students were expected to learn from each lecture"; respond as to whether true for All or almost all, Most, About half, Only some, or Very few or none of the lectures). At the same time, each volunteer also completed the same questionnaire, but as a self-rating exercise. As part of other work unrelated to the present paper, each volunteer subsequently received a variety of forms of feedback on their teaching, including personal consultation, and each was given copies of their video-recorded teaching samples.

After the semester's end, the 7 to 10 hours of lecturing recorded for each volunteer were viewed and scored by two independent observers, each having considerable experience in social research. To determine the observers' scoring criteria, the 31 questionnaire questions were classified as to whether they could be sensibly responded to based on recorded material alone. Some questions related to events outside class (eg. availability, workload), to personal covert reactions within students (eg. prior knowledge structures, evoked interest levels), or to longer term relativities (eg. overall topic sequences), and thus could not be judged by the observers. However, 12 questions related to the organisation of presented material, the use of illustration and example, attention highlighting, the offering of learning suggestions, presentation pace, clarity and interpretability of explanation, enthusiasm, and responsiveness to questions, all of which could be judged directly from the recorded teaching samples. These 12 were the basis for the observers' scoring.

Three questions were scored by the observers making overall qualitative ratings for each recorded lecture. Four were scored by yes-no effectiveness judgments made on each relevant instance that occurred during a lecture. Five were scored by simple yes-no occurrence judgments made for each five-minute segment during a lecture. Preliminary between-observer scoring comparisons, and discussion of criteria ambiguities, established acceptable initial levels of scorer reliability.

Results and discussion

For each of the four volunteers, one observer's scores for the 12 questions across all scored lectures were correlated with those of the other observer. The resulting inter-rater reliabilities were 0.71, 0.89, 0.93, and 0.98. Such high overall levels of reliability indicate that the observer scores can be taken as reasonable measures of what had actually occurred during each recorded lecture. Single combined observer scores were therefore calculated for each of the 12 scored questions by linearly pooling parallel pairs. These combined scores were those used in subsequent analyses reported here.

The fundamental validity question of interest in the present analyses is whether the students' questionnaire responses reflected what had actually occurred in the evaluated teaching. The behavioural observation question form requires that students estimate the relative frequencies with which described events were observed or experienced. The students' responses here then should be "recalled estimates" of what the observer scores sampled. For each class there should be a positive relationship between the students' median responses and the combined observer scores on the 12 scored questions. Simple correlations were calculated, and for the four volunteers these were 0.43, 0.62, 0.68, and 0.89. Given that each correlation was calculated on only 12 pairs of measures, and that the observer scores were based on samples rather than all the teaching, these correlations can be taken as supportive. Indeed there was an exact parallel between correlation size and proportion of teaching sampled, with the 0.89 correlation resulting from a 78% sampling of the teaching and the 0.43 resulting from a 38% sampling.

To get a better appreciation of the match between the observer scores and the student responses, the former were transformed to be comparable to the 5-4-3-2-1 points weightings used for the questionnaire responses. Differences were then calculated between the observer score and the median student response for each of the 12 scored questions, for each volunteer. Treating the 48 differences as a single pool, better than 60% had the observer and student measures within the equivalent of a single response point of one another. So in simple absolute terms, the students' responses were close to the observers' scores. However, there was a bias in the distribution of these differences. At the "good" end of the response scale, the differences between observer score and student median were small, for all volunteers. But at the "poor" end, these differences were relatively larger, with observer score minimums consistently lower than student median minimums for all volunteers. This was born out by a correlation of 0.84 between observer score and difference (defined as observer score minus student median response), calculated over all 48 differences.

The picture that seems to emerge then is that student responses on these behavioural observation form questions do seem to match with what actually occurred in the teaching, that the empirical match found seems to be closer when the sample of teaching observed is proportionately larger, but that student responses might be less accurate when the teaching element evaluated is relatively poorer. This last finding fits with other research. Although in general terms behavioural observation questions have been found to be better discriminators or poor teaching than other more summated question forms (Eley & Stecher, 1996), they seem nonetheless still to be sensitive to general impression biassing or leveling (Murphy, Martin & Garcia, 1982).

As a further check on validity, correlations were also calculated for each volunteer teacher between student response medians and the volunteer's own self-ratings, and between the observer scores and those self-ratings. These correlations were 0.56, 0.65, 0.71, and 0.80 for students versus self, and 0.26, 0.53, 0.73, and 0.79 for observers versus self. Clearly, these generally corroborate the students and observers match discussed above. If behavioural observation form questions do evoke recalled frequencies of occurrences rather than judgments, then teachers, students, and observers should match; they would simply be independent observers of the same events.

A supplementary validity test

A more common test of the validity of an instrument is whether the measures that it generates agree with those generated by other related and established instruments. In a separate exercise this was also done here. Seven lecturers volunteered to have their students concurrently evaluate their teaching using both the 31-question behavioural observation form questionnaire already described and the Australian Students' Evaluation of Educational Quality questionnaire (ASEEQ; eg. Marsh & Roche, 1993). Class sizes ranged from 20 to 85 students.

A comparison of the present questionnaire with ASEEQ revealed 15 questions from each that could reasonably comprise parallel pairs. The present analyses were based on these pairs. A single sample of students was generated by randomly selecting 10 from each of the seven volunteered classes. The purpose in this was to have a sample of student responses that reflected an heterogeneous range of teaching quality rather than some common sample of teaching. The aim in the analyses was to compare the two questionnaires, and not to evaluate any particular example of teaching. This aim is statistically better served by a sample of responses in which there is variability rather than consistency amongst students.

Correlations were calculated using the 70 paired responses relating to each question pair (see Table 1). In general terms there seemed a positive relationship between responses to apparently matching questions. None of the question pairs yielded a negative correlation. The magnitude of each correlation would be constrained by a number factors. First, is the obvious constraint that while the pairs were chosen on similarity, none were exact duplicates. It is thus possible that the students did not treat them exactly equivalently when responding. Second, the degree to which the teaching reflected in the responses was of similar rather than varied quality would apply a mathematical constraint on the value that any of the correlations might potentially show. Correlations between variables in which the range of variability is truncated cannot be high. Third, the two questionnaires are of different form. The present questionnaire asks students to recall the estimated frequencies of occurrence of described events. ASEEQ asks students to make agree/disagree judgments of described teaching attributes. As already noted in this paper, there are psychometric differences between behavioural observation questions and such agree/disagree forms. The interpretation taken then is that the present correlations constitute a conservative test of the present questionnaire, and that it is reasonable to conclude that it samples the same general domain of phenomena as does the established ASEEQ.

Table 1: Correlations based on paired questions from the behavioural
observation questionnaire and the ASEEQ questionnaire.

.19.21.26*.31**.32**
.32**.34** .36**.36**.44***
.44***.54*** .57***.60***.65***
*two-tailed significance p<.05; **p<.01; ***p<.001

References

Borman, W.C. (1986). Behavior-based rating scales. In R. A. Beck (Ed), Performance assessment: Methods and applications. Baltimore: Johns Hopkins University Press. Pp. 100-120.

Eley, M. G. & Stecher, E. J. (1994). Comparison of an observationally- versus an attitudinally-based response scale in teaching evaluation questionnaires: II. Variation across time and teaching quality. Research and Development in Higher Education, 17. Proceedings of the 20th Annual Conference of the Higher Education Research and Development Society of Australasia. Canberra, ACT, 6 -10 July. Pp. 196-202.

Eley, M. G. & Stecher, E. J. (1995). The comparative effectiveness of two response scale formats in teaching evaluation questionnaires. Research and Development in Higher Education, 18. Proceedings of the 21st Annual Conference of the Higher Education Research and Development Society of Australasia. Rockhampton, Queensland, 4-8 July. Pp. 278-283.

Eley, M.G. & Stecher, E.J. (1996). A comparison of two response scale formats used in teaching evaluation questionnaires. Paper submitted for publication, April.

Kinicki, A.J., Bannister, B.D., Hom, P.W. & DeNisi, A.S. (1985). Behaviorally anchored rating scales vs. summated rating scales: Psychometric properties and susceptibility to rating bias. Educational and Psychological Measurement, 45, 534-549.

Latham, G.P. & Wexley, K.N. (1977). Behavioral observation scales for performance appraisal purposes. Personnel Psychology, 30, 255-268.

Marsh, H.W. & Roche, L. (1993). The use of students' evaluations and an individually structured intervention to enhance university teaching effectiveness. American Educational Research Journal, 30, 217-251.

Murphy, K.R., Martin, C. & Garcia, M. (1982). Do behavioral observation scales measure observation? Journal of Applied Psychology, 67, 562-567.

Murray, H.G. (1983). Low-inference classroom teaching behaviors and student ratings of college teaching effectiveness. Journal of Educational Psychology, 75, 138-149.

Smith, P.C. & Kendall, L.M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scales. Journal of Applied Psychology, 47, 149-155.

Stecher, E. J. & Eley, M. G. (1994). Comparison of an observationally- versus an attitudinally-based response scale in teaching evaluation questionnaires: I. Variation relative to a common teaching sample. Research and Development in Higher Education, 17. Proceedings of the 20th Annual Conference of the Higher Education Research and Development Society of Australasia. Canberra, ACT, 6 -10 July. Pp. 210-217.

Authors: M G Eley, Teaching & Learning Group, PDC, Monash University
E J Stecher, Teaching Evaluation Unit, PDC, Monash University

Please cite as: Eley, M. G. and Stecher, E. J. (1996). Do the responses on a teaching evaluation questionnaire match what actually occurred in the evaluated teaching? Different Approaches: Theory and Practice in Higher Education. Proceedings HERDSA Conference 1996. Perth, Western Australia, 8-12 July. http://www.herdsa.org.au/confs/1996/eley.html


[ HERDSA ] [ Proceedings Contents ]
This URL: http://www.herdsa.org.au/confs/1996/eley.html
Created 30 Oct 2001. Last revision: 24 May 2002.
© Higher Education Research and Development Society of Australasia Inc