For-profit standardized testing industry can't be trusted

Published May 19, 2012

The scores for the writing portion of this year's FCAT plummeted so precipitously that the abilities of Florida's student writers aren't even being called into question. The validity of the scoring statistics are. While I don't want to say "I told you so" regarding the dubiousness of those statistics, I did tell you so, as my 2009 book highlighted in detail all the ways the numbers produced by the for-profit standardized testing industry cannot be trusted.

Take the stats produced at Pearson scoring centers around the country, where I worked for the better part of 15 years. On the first project I worked scoring student essays, I had to pass a qualifying exam to stay on the job. When I failed that qualifying exam (twice), I was unceremoniously fired. So were half the original hundred scorers who had also failed the tests. Of course, when Pearson realized the next morning they no longer had enough scorers to complete the project on time, they simply lowered the "passing" grade on the qualifying test and put us flunkies right back on the job.

Yes, those of us considered unable to score student essays 12 hours before were welcomed back into the scoring center with open arms, deemed qualified after all.

Such duplicity was not an aberration in my experience either. For a decade and a half I saw every sort of corporate chicanery and statistical tomfoolery. The test-scoring industry seemed focused on getting deadlines met, projects completed and scores put on tests, but only then did any thought seem to be given to meaningful scores being put on them.

I regularly saw unqualified people (myself included, apparently) keep their jobs scoring student responses even when they were altogether no good at the job, either when the acceptable qualifying grades were dropped so low that anyone could meet them, or when the correct answers to the qualifying exams were handed out even before the tests were taken.

I regularly saw statistics get doctored to make group reliability numbers (agreement between the scorers) look better than they really were, as high reliability stats were necessary to convince customers how standardized a job was being done and how "valid" the work really was. I regularly saw distribution numbers fixed to make score results look however a client might have wanted.

Once I attended a range-finding meeting with other test-scoring experts and English professors from around the country, the bunch of us trying to figure out how to score writing samples for a national test. After that group of experienced test scorers and esteemed writing teachers had hammered out some consensus regarding the writing rubric and writing samples we'd been reviewing, we were told we were scoring "wrong." We test-scoring experts and writing teachers were told our scoring wasn't matching the predictions of the omniscient psychometricians (statisticians/testing gurus), and we were told we had to match those predictions even though the pyschometricians had never actually seen the student responses.

When the next year I read in the New York Times that student writing scores had ended up exactly in the middle of the psychometricians' predictions, I can't say I was surprised: We had made sure they did.

And that's the thing: In my experience, the for-profit test-scoring industry could produce results on demand. There was no statistic that couldn't be doctored, no number that couldn't be fudged, no figure that couldn't be bent to our collective will. Once, when a state Department of Education (it wasn't Florida's) didn't like the distribution of essay scores we'd been producing over the first two weeks of a project, we simply followed its instruction to give more upper level scores. "More 3's!" became our battle cry on that project, even if randomly giving more 3's was fundamentally unfair to all the students whose essays had been assessed differently in the days before.

Spend your days with Hayes

Spend your days with Hayes

Subscribe to our free Stephinitely newsletter

Columnist Stephanie Hayes will share thoughts, feelings and funny business with you every Monday.

You’re all signed up!

Want more of our free, weekly newsletters in your inbox? Let’s get started.

Explore all your options

In the end, I guess I'm saying you probably needn't worry too much about this year's falling FCAT scores, because they're only a number. If you want a different number next year, just ask; surely Pearson will just make more.

Todd Farley is the author of "Making the Grades: My Misadventures in the Standardized Testing Industry." His opinions have been published in the New York Times, Washington Post and Education Week.