Contact Information:

Beal Center
1030 N. Mountain Avenue
P.O. Box 210134
Tucson, AZ 85721-0134

Main Phone:
(520) 621-7807

Public Records:
(520) 621-1986

Fax:
(520) 626-1234


Teacher Course Evaluations (TCE)
MLK 200

TCE Phone:
(520) 621-7337

TCE Fax:
(520) 626-4375

 


Understanding TCE Results

Untitled Document

Overview

Understanding student ratings requires an understanding of statistical concepts related to sampling, significance, and precision, as well as an understanding of the characteristics of ratings as a measure of teaching performance. Because student ratings statistics do not have the precision typical of statistics in the sciences, it is always important to interpret them in the context of individual and unit patterns. OIRE offers workshops and consultation on interpreting these statistics and using TCE results appropriately, as well as on other aspects of evaluating teaching.

The primary units of analysis in TCE reports are individual student responses within individual sections. Many reports also show summaries of results for the same questions from sets of similar courses. This section describes the statistics used in the reports and offers suggestions for their interpretation, along with information about general characteristics of student ratings.

OIRE recommends a three-step procedure for reviewing TCE reports:

Step 1: Check the sample

Step 2: Review individual results

Step 3: Review comparison statistics

Checking the Sample

Before using ratings, it is important to know how representative the available data are. Standards for samples depend on how the ratings will be used: they should be most stringent when ratings are used in performance review.

Sample Quality within Sections

For responses in a section to be meaningful for decision-making purposes, they must be representative of the entire class. Information about the sample is printed at the top of each TCE report: number enrolled, number responding, and percent responding. Use Table 1 to decide whether enough students responded for the sample to be meaningful.

The higher the proportion of respondents to those enrolled, the more representative the results. In general, sections with less than a 50% response rate should not be used for performance appraisal. The smaller the class, the higher the percentage of responses needed to ensure that the sample is representative . If the non-response rate seems high, there may be a systematic reason for student absence that might bias results. For example, if ratings are administered the day of a review session when attendance is optional, students for whom instruction has been most effective may be excluded.

If only a small fraction of students respond, the responses can only be considered the opinions of those few students – even though it may be tempting to generalize if they are positive.

Table 1: Guidelines for Judging Samples Within Sections
Class size Recommended response %
5-20 at least 80%, more recommended
20-30 at least 75%, more recommended
30-50 at least 66%, 75% or more recommended
50-100 at least 60%,75% or more recommended
100 or more more more than 50%, 75% or more recommended

While the results from a single administration of a TCE questionnaire, particularly a long questionnaire, can provide useful information, such results apply to the course as one event in time only. Averaged results from comparable courses taken over several evaluations (each with an adequate sample of response) are more likely to fairly represent teaching ability. A minimum of five courses is recommended. It is also important to ensure that the courses selected are representative. If an instructor’s teaching load is half graduate courses and half undergraduate courses, the sample presented for review should be about half graduate and half undergraduate courses. Most importantly, no single score or set of scores from a single section should be used for judging teaching performance for performance appraisal.

Sample Quality of Comparison Groups

Questions to ask about comparison groups include:

1) Are the courses in the comparison group reasonably comparable in content, size, and instructional methods?

2) Are there enough courses in the comparison group?

3) Were a substantial number of courses that met the selection criteria for the comparison group not included because their instructors did not participate or because insufficient student response, lack of documented student monitoring, or other errors invalidated the data?

4) How many different instructors taught courses included in the comparison group?

Reviewing the Section Results

Frequencies and Percent of Valid Responses

For each question, the distribution of student responses across the possible response choices is given in frequency of responses per option and percent of valid responses per option. Interpreting the data is largely common sense – how many students "said" what, in terms of the available response options for each question. Usually, students are in fairly good agreement in their ratings and scores cluster around two or three adjacent options.

For positively-stated questions concerning effective teaching, it is desirable for responses to cluster in the first two options, "almost always" and "usually." If a substantial percentage of students respond "sometimes," "rarely," or "almost never," the question points to an area of teaching skill that likely needs attention. Responses should cluster similarly for questions with response scales worded "very useful" to "nearly useless."

For questions with normatively worded response options such as "among the best" to "among the worst," more caution is needed, as the basis for comparison is unknown. For example, if a student has taken only exceptionally well-taught courses, a moderately well-taught course might seem poor by comparison.

Means, Medians, and Standard Deviations

Means and medians are measures of central tendency, showing the "middle" of a set of scores. The standard deviation (SD) is a measure of how variable scores are, i.e. how spread out they are around that "middle." Means and SDs appear on all reports in both section data and comparison data. Medians appear only in comparison data. Means, medians, and standard deviations are in the same units as the original sample.

The mean for a question is the arithmetic average of student responses. For most TCE questions, means can range from 1 to 5. Most questions are reverse scaled: that is, the most positive option, "A," is scored as 5 points. The "Key" on each question tells how individual questions were scored.

The SD gives an approximate measure of agreement or disagreement among raters. Perfect agreement would yield an SD of 0. In a typical class, about two thirds of ratings fall within one rating point above or below the mean and the SD is 1.0 or less. If the SD for a question scaled with 5 points is higher than 1.2, the mean is not a good measure of student response.

High SDs occur when opinion in a class is strongly divided between very high and very low ratings, or when opinion is dispersed across the entire response scale. Because students and teachers vary, it is possible for a teacher to be "among the best" for some and "among the worst" for others. In such cases, the mean does not represent a "typical" student opinion in any meaningful sense. Consultation to explore the source(s) of consistently high SDs is available from OIRE.

Confidence Intervals

Most OIRE reports show a 95% confidence interval (CI) in parentheses to the right of the section means and comparison group means. While the SD gives an approximate measure of the amount of disagreement among students, the 95% CI shows the impact of the disagreement on the precision of the mean as a way of summarizing responses.

The 95% CI is similar to the "margin of error," a familiar feature of opinion polls which assigns a value, plus or minus, within which the "true" score occurs once all sources of error and disagreement are taken into account. There is a 95% chance that the true score for a question occurs somewhere in the interval between the two values.

Reviewing the Comparison Statistics

For spring 2000 and subsequent reports, comparison group statistics appear on the final page along with one or more graphics showing how results for the section compare with results for the comparison group. This page is titled “TCE Comparison Report.” For reports issued prior to spring 2000, statistics for the comparison group appear on the Short Report in the column labeled "Comparison Group" (between the section statistics and the columns showing T scores and Percentile Rank Groups (%Rank)).

Descriptive statistics for comparison groups include the number of sections in the comparison group, the grand mean and its 95% CI, and the median of section means for each question. A comparison group mean is the grand mean of a set of section means, not the mean of student responses pooled across the sections. Similarly, the comparison group SD is the deviation of the section means. The median is the halfway point: half of all the means in the comparison group fall above the median and the other half below.

Systematic Variation in Ratings

Although properly administered student ratings are quite dependable, research shows that there are predictable sources of systematic variation and bias which should be considered when comparing scores. To address potential concern about three factors known to cause systematic variation in ratings (disciplinary differences, course level and course size), we have based our comparison groups on these variables. As our database grows, other factors may be taken into account. However, research shows that taken together, all the sources of variation listed typically account for less than 5% of variation in overall instructor ratings.

Factors Likely to Cause Systematic Variation in Ratings

1. Disciplinary Differences

Significant differences between ratings of courses in different disciplines are well documented. For example, courses in the humanities and fine arts tend to be rated more highly than those in physical and applied sciences. For this reason, most sources agree that ratings should not be compared across disciplines. (If cross-disciplinary comparisons of faculty are necessary, faculty standings within their own comparison groups can be compared.) Unless faculty have recommended combining similar subject areas, our reports always restrict comparisons to the subject area defined by the course subject code, e.g., ANTH, MUSI, POL, etc.

2. Course Level

Lower division students tend to give the lowest ratings; graduate students tend to give the highest ratings.

3. Class Size

Small classes (fewer than 20 students) tend to receive the highest ratings, whereas large classes (40-100) tend to receive the lowest ratings. Classes of more than 100 students tend to receive intermediate ratings, which suggests that students may have different criteria for evaluating them.

4. Course Status

Students tend to give electives and courses in their majors slightly higher ratings than courses taken to fulfill a college or general education requirement.

5. Semester or Summer Session

Summer Session ratings, on average, are significantly higher than fall or spring ratings for comparable courses at UA. Thus, unless otherwise noted, comparison groups do not include Summer Session data.

6. Course Content

Differences in ratings are occasionally associated with course content. For example, courses with quantitative content may receive slightly lower ratings than other courses at the same level in the same subject area. Similarly, courses that challenge strongly held beliefs may receive lower ratings from some students.

7. Years of Teaching Experience

Instructors with less than one year of experience tend to receive the poorest ratings. Teachers with between three and twelve years experience tend to receive the best ratings, while those with more than twelve years tend to receive intermediate ratings.

8. Improper Administration of Questionnaires

Student ratings can be biased by failure to adhere to instructions for administering the questionnaire, such as failure of the instructor to leave the room during administration, failure to preserve student anonymity, administration of the evaluation during finals, and use of prejudicial introductory remarks. (The TCE monitoring system is a strategy to minimize such problems.)

Factors That Have Little Influence on Ratings

1. Scheduling Factors

Time of day and other scheduling factors appear to have little or no influence on ratings. However, systematic differences in who attends classes at particular times could theoretically have some impact on ratings.

2. Students’ Academic Ability

Academic ability, as measured by grade point average, has little relationship to student ratings. Evidently, poor students are just as appreciative of good teaching as good students, while good students are just as critical of poor teaching as less able students. However, when there is great variety in students’ prior learning and abilities in a course, the instructor may end up concentrating on one group of students to the exclusion of others. In such a situation, the actual quality of teaching varies within the class and will probably be reflected in the ratings.

3. Gender

Researchers looking for correlations between ratings and gender have found significant variation, but in both directions. That is, some studies show female faculty receiving higher ratings while others show male faculty receiving higher ratings. In either case, the differences are typically trivial, accounting for less than 2% of the variation in ratings. Female students tend to give slightly higher ratings than male students and some studies have found correlations based on whether student and teacher gender are the same. At UA, female instructors tend to receive higher ratings in most subject areas. If you suspect a systematic pattern of gender bias in ratings for a particular course, please contact OIRE.

4. Perceived Difficulty, Workload, and Expected Grades

The relationship between grades and ratings is complex. The preponderance of research evidence shows a very small positive correlation between ratings and expected grades. There is also some evidence that students will tend to give lower ratings when they expect grades lower than they usually get in other courses.

A meta-analysis (Cohen, 1981) explored the relationship between overall instructor ratings and student achievement as measured by scores on an independently-graded final exam in multiple sections of the same class taught by different instructors. Cohen found that students who received high scores on the final tended to rate their instructors highly (regardless of the instructor), suggesting that successful students tend to credit their instructors for their success.

Centra, J.A. (1975) Colleagues as raters of classroom instruction. Journal of Higher Education, 46: 327-337.

Cohen, P.A. (1981) Student ratings of instruction and achievement: a meta-analysis of multisection validity studies. Review of Education Research, 1981, 51, 281-309.

 
Questions? Comments?
All contents copyright © 2008 Arizona Board of Regents