Standardized tests are a great way to do a cross-group comparison of skill sets.
The first time you give them.
After the first time, you are no longer primarily measuring the skill set. You are primarily measuring the test prep. Because motivated test takers will always beat the test...and it will become easier to learn the test than the material, roughly by the second round of the test. Insofar as only a few people do test-prep, you might still have a decent test. However, once any reasonable number of folks participate in test-prep, as per all the IT certifications, No Child Left Behind, etc...then what you're really measuring is mostly (~50%) how much time the student spent on test-prep.
If you are trying to compare between groups that have identical levels of test-prep, then again, the tests become useful, but only with potentially high sunk costs. So what does the SAT/GRE measure? Mostly, it measures a combination of parent and student motivation/discipline if you're willing to do test prep. If you're not willing to do test-prep, or you opt not to, then it measures IQ pretty well.
Now...what ought colleges to be looking for? Ought they to be chasing folks who are disciplined test-preppers? If they should...then the SAT should be weighted highly. If they shouldn't...then perhaps not, though it is a much better predictor of college success than high school grades. The interesting questions to me is what happens when one group reliably uses test prep at levels far above what other groups do? What is the sane thing to do when looking across groups? What is the right thing to do? Are they related?
I think those are the interesting questions that need to be addressed around Murray & Unz. I think their claims need a broader audience, but they're highly non-controversial among the HBD crowd. "Is this news?"
The virtue of excellence
Subscribe to:
Post Comments (Atom)
5 comments:
Is it possible that standardized tests just aren't good enough yet? If I'm worried that my important code feature is going to be broken because another developer relies on our incomplete test suite to check his commits, I could give up on testing and try to teach every other developer (and myself) to take a more holistic view of the more subtle implications of every change... or I could just add more subtests to our suite until it adequately covers the feature I care about.
Even if this way the tests grow to the point where you can't always afford the time to evaluate them all, choosing random subsets of a more complete test is probably still better at the individual level (and much better at the group level) than choosing a hand-picked subset.
Roy
I am speaking of all known one day computer scored tests. Not all possible tests. But realize that the value to the folks who want to game the system is higher than the value to the folks who want a working system. It's almost the same problem as financial regulation in an ideal world.
You should expect that most of the time most short tests should experience that problem because of asymmetric incentives. Cost to do a good test is too high compared to the cost of a decent test that is Game-able
The effectiveness of test prep should theoretically be limited if the test is highly g-loaded. This is borne out by a National Association for College Admission Counseling report which found that gains due to paid coaching on the SAE were about 20 points in math and 10 points in verbal. Your comments would be true enough for an achievement, rather than an intelligence test. But teaching the test is not necessarily bad. It should ensure that some minimum level of material is learned.
There's two ways to get g-loading on a test. The first way is to test something totally novel to the population that you're testing---for instance, Sudoku in the US around 1940 or so would be a grand IQ test of this variety---the ordering of performance will approximate the ordering of g pretty decently. Most of your so-called culture neutral tests USED to be this kind of test. But because the culture has evolved in their direction, partly for gaming said tests, they're becoming closer to the 2nd kind of test. Which is:
Test something that nearly everyone in the tested population is very familiar with, in a format that is not novel to them. Tests like the SAT are of this format. A test like this is in practice less gameable than the 1st type, because everyone should already be in the region of diminishing returns on gaming that kind of test already in practice---how many such tests do kids take these days anyway?
Tests of that form don't have quite as high a correlation with g, but they're still huge correlations, especially for anyone used to 'social sciences levels' of correlation. Even groups with an incentive to inflate their claims (test-prep suppliers), usually can't get more than 25-50 points on average, and given that a lot of people who do the test prep are doing so for a retake, there's the fact that they perceive that they could have done better/had a bad day/etc to consider---some of that improvement might just be regression to the mean.
Where test prep is tremendously valuable is when the test is a kind you've got zero familiarity with---for instance, not many second sigma kids have much experience with a test that is timed and hard enough that the fact that it is timed is a real limiting factor---for instance, the old ASVAB computation section, where they tell you they don't expect you to finish. Second sigma types and above rarely learn testing triage unless they do academic competitions since their normal experience is they've got power and to spare.
And yes, most of the Flynn effect is on tests of the 1st type---sometimes called 'fluid g' as opposed to 'crystallized g' as in the 2nd type. A moment a reflection would demonstrate that a sudoku puzzle is a much much worse IQ test now than it would have been if sprung with surprise on kids in the 1940s. Fluid g tests are, well, fluid.
the value to the folks who want to game the system is higher than the value to the folks who want a working system
Obvious as soon as you mention it. This belongs in the OP.
Post a Comment