David Steele answers questions about the Hillsborough Gates evaluations
As the Hillsborough County School District prepared to deliver its first comprehensive teacher assessments under the Gates-funded Empowering Effective Teachers, Gradebook interviewed David Steele, Gates project director for the district. In what has become a state and national model, Hillsborough is replacing the old single-source evaluation with one broken down into three components: The principal’s observations, a peer evaluator’s report, and a value-added component that measures student improvement and other data. Teachers already have the “written” assessments that give them up to 60 points. The data-driven portion is worth up to 40 points. They’ve seen their students’ raw scores. But the teachers’ scores – and as how they stack up to their coworkers – had yet to be revealed when Steele spoke with reporter Marlene Sokol on Sept. 2.
Q. So this is the moment we’ve all been waiting for, right? You’re getting ready to add performance data in and calculate the scores?
A. That’s right. In fact, we were just on the phone with the University of Wisconsin people this morning. We talk with them every Friday.
Q. You work with the University of Wisconsin?
A. Right. Previous to the time of the grant, we had a bonus program called MAP; it’s the state’s merit award program. But we always did the calculations ourselves. When we got the grant, we now had the means to afford some of the things we would have liked to have afforded. And the way we do the calculations, the value-added, is just much more complicated and inclusive than what we were able to do in-house. We wanted somebody who had done it numerous times. You want it to be as accurate, as fair and as consistent as you can.
Q. And that statistical component is student performance data based on FCAT and on improvement?
A. Right. It’s based on student growth, but also of students like yourself. One of the concerns you get from teachers is, “I have a lot of Level 1 students in my class.” That’s okay because each student’s expected growth is compared to students like them. If they’re a special education student, or an English language learner, or a highly mobile student who maybe moved schools three times, or some of the other things we take into account. Are they too old for their grade level? Or too young for their grade level? We put all those things in to get the growth and that’s something we’ve never been able to do before. Before, it was basically where were you on the pre-test and where were you on the post-test.
Q. So the science of figuring out how you factor these things in, I guess that is a lot of complex statistics.
A. It’s one of the difficulties. You want to be as transparent as you possibly can be, but you have this line between transparency and accuracy. And unfortunately, if you wanted to take a student score and figure out how that converted to a teacher score, you would need a Ph.D. in mathematics to be able to do it.
Q. So when a teacher tells me he is afraid a student might Christmas-tree the FCAT just to get back at him, it’s not that easy, is it?
A. No, and it’s not just FCAT either. There are many more tests that are in the model. One of the things we do that no other district yet in the country does that I know if is, we include every single teacher. So if you’re the third grade art teacher, your kids have had a third grade art test. In high school we’re kind of ahead of the curve because we’ve always had exams. Since the eighties. That’s why we have many more data points than just FCAT.
Q. So teachers get results in a couple of weeks? Will they get it in email? Paper form?
A. They get what is called a MAP report because of the merit award program. They’ll get their score because it’s 40 percent of their evaluation. Every teacher will get a score between zero and 40, but they will also get a student by student rundown of comparatively how did that student do.
Q. They can really look and say, 'I helped Jimmy but I didn’t help the other one?'
A. Yes. We emphasize validating their rosters, almost to death. When they first came back to school, they got their roster that had the pre-test and the post-test information on it so they had the opportunity to look at it and say 'hey, none of my science kids have their pre-tests showing.”
The report that they’ll get in September, they’ve already seen the scores. We want to make it as easy to read as possible. We’re thinking of doing it maybe the way they do movie ratings. Like, “this is a five-star kid for you” and “this is a two-star kid,” so now they can see the relative gain of each student.
Q. So they’ve already seen the raw numbers, seen the lists of kids in order to correct those glitches. If they are very strong in statistics, they may or may not be not be able to anticipate. Have they already seen the numerical scores from the principal’s evaluation and the peer review?
A. They’ve already seen that and they’ve already seen how relatively they compare to other teachers. We made them, almost like a grade distribution that you would see. The score on that part would go from 0 to 60, so we showed them this many people made between a 50 and a 52, this many people made between a 52 and a 54, so they could see where they stood.
It used to be that a teacher could go from 0 to 144. That was our old evaluation system. What we found is, like 3,100 people, it was around that, made a 144 last year. So when it came time to reward them, through the bonus program, there was really no differentiation. In our new evaluation there is this new level that we call “exemplary.” So what that did was, it took those 3,000 people and kind of scattered them out so now we can differentiate. We actually had only two who made 60 points on their written.
Q. Those 3,100 are what percent of the teachers?
A. It’s about 30 percent. There’s 11,000 or 12,000 teachers.
But one of the difficulties for teachers is, this score numerically is 40 of the 60 points. So these people, who are used to getting a perfect score, now have to understand that in the new system, what used to be perfect is the same group people that are between 40 and 60. That was why we published that frequency table for them. Because when you tell somebody that’s always had a 144 that you now have 40 out of 60, the first thing they think is, “you marked me down.” No, we didn’t mark you down. We just have a system that’s designed to capture that top end.
Q. So there’s a higher barre.
A. And it won’t be this year where we actually convert them to levels. One of the important things we decided was, we want at least two years of scores like this before we try to draw the line. We want to make sure we are as accurate as possible. We’re not under any race to do it.
Q. It’s three years of data before pay is affected?
Q. And it’s two years before you divide them into levels?
Q. And I would think this is also a period of time when you’re refining every step of the way. Because one of the things we’re interested in, and you probably are too, is when you look globally at all of this and you look at elementary schools versus high schools, science teachers versus English teachers, are there going to be some uneven areas?
A. That’s what we’re going to look at when we get the scores. I want to see that distribution for each individual group to see how it compares.
Q. So if, for example, everybody was doing wonderfully in second grade but terribly in middle school, you would have to look at maybe refining the instrument?
A. That’s what you’ve got to do. To a certain extent there’s some judgment involved. When groups don’t look exactly alike, then you say, “would we expect them to look exactly alike, or do we think there a problem?” That’s when we’ll have to delve into that. For example, high school math teachers and high school English teachers. Should there really be much of an evaluation difference between those two groups? My experience as a principal is, probably no.
And we’ll break it down, which I think is what you’re getting at. How did they do compared to other groups on the student growth part, but then how did they do compared to other groups on this written part of the evaluation? We’ll look for differences there. Then the other thing is, how good is the correlation between the written evaluation and the value added? We would not expect it to be perfect because if there was a perfect correlation, you wouldn’t need to use both parts.
In general, the teachers with the highest written evaluation should be about he same as the group with the highest scores.
We’ve already looked at that and found on the four-point scale, about 75 percent of the time, the principal and the peer marked exactly the same thing. But less than one percent of the time were they more than one apart. Which is good. It means the peers are looking at things and the principals are looking at things and they’re consistent.
Q. We’re looking forward to comparing this year’s data with those of previous years, to the extent that we can.
A. There’s more of a spread, which we’re trying to do through evaluation. To reward your highest performers, you need to be able to separate all of those who previously were getting the same score. There’s a stratification here that we were missing before that we’re not missing now.
And you’ve got to be able to give meaningful feedback too. That’s one of the things I like about the system. Every time you’re observed now, we’re very strict with our people that you need to sit down with the teacher and you need to have that a 30-45 minute conference about what went on in the classroom. Actually, part of the teachers’ evaluation is how they reflect on their own teaching.
We realized that as important as it is, we had never had anything in our system that evaluated someone on how well are they actually processing what happened. Whenever our observers, whether it is the principal or the peer, whenever they do that post- conference, one of the first things they ask is, “what would you do differently if you were teaching that lesson?”