Re: On Overclocking - READ THIS!

andrew meggs (insect@antennahead.com)
Tue, 6 May 1997 19:21:39 -0400


>If you are frustrated, chill out.

I'm not "flying off the handle" here. I detected this problem on Friday,
but didn't say anything until now because I hadn't made absolutely,
unequivocally sure of it.

> You have pronounced thousands of machines "Unfit for DESCAHLL
>Consumption" based on your experience with a bad PowerPC chip. Have you
>isolated the problem?

Yes. See below.

> Reduced the clock speed?

Yes. It works then. It also works with the fast clock chip in place if the
machine is still cold and hasn't been running for more than a few hours.
This particular machine is one of my own computers, and I can say that it
has never exhibited any ill effects due to overclocking or overheating
until this particular instruction sequence. It has, in fact, been the most
reliable and stable machine in my organization.

> Substituted another chip?

I can't do that because the chip is surface-mounted, but other machines
have been tried, and it worked correctly.

>Have you entertained the possibility that your code is buggy?

There are some bugs still in the code, but we're talking about a sequence of
a very few instructions:

r3 = r3 xor r4
r5 = r5 and r6
r7 = r7 and r8

produces a different result in r3 from:

r5 = r5 and r6
r3 = r3 xor r4
r7 = r7 and r8

r3, r4, r5, r6, r7, and r8 are all independent registers. There are no
branch targets within the block.

After isolating the problem in this amount of detail, I could reach no
conclusion other the one I did -- that this was a timing or synchronization
problem as a result of overclocking. The catch is that every other
diagnostic available said the processor was perfect. I had created the one
instruction sequence that set it off. So I reordered the sequence and it
worked. But how do we know that the new sequence isn't the one that sets
off someone else's overclocked processor? How do we know that an
instruction sequence in the current, "safe" compiler generated client isn't
the one that sets off yet another person's overclocked CPU?

The section of the code to detect the correct key and report it gets
thoroughly tested before being released, but not to a great extent on your
particular machine once you download and run the client. Your client could
easily appear to be working fine churning away keys because the correct key
was never enountered and that code never ran, or even worse the key was
encountered and that code failed because it was exactly what your
overclocked CPU couldn't handle.

We're talking about a one-in-a-million chance. In most situations, one
off-color pixel in Doom or one crash more or less in Internet Explorer
isn't going to even be noticed. But there's only one key out of
72,057,594,020,000,000 that needs to be processed incorrectly for all of
DESChall to fail. The fact that it happened on this machine proves that
such a thing can occur.

> Regarding errors on Intel and other platforms: other chips are
>engineered in different ways by different companies and will encounter
>different effects from overclocking.

Other PowerPC's are engineered in the same way by the same company and made
on the same fab, yet exhibit different effects from overclocking. In the
end, all of those effects for all chips reduce to three words: operations
performed incorrectly. If you go outside any chip's tolerances the chip
will produce the wrong answer in some way on some calculations. What this
incident shows is that those tolerances for one particular sequence of
instructions might be lower than for the rest of the chip, and an
overclocked chip will appear to work for, literally, years until it tries
to execute that one sequence.

How much are we willing to gamble that the machine that gets the winning
keyblock doesn't find its one bad sequence in the code to detect the
correct key? It happened once, so it can happen again. We got lucky because
it happened to one of the deschall developers runnning a controlled test.
If it happens again to someone who's just letting a client run and trusting
it to do the right thing, or if it's already happened, we'll never know
until Solnet claims the prize on a keyblock we thought didn't contain the
key.

Bottom line -- I think the client needs more stringent self tests. Correct
code can and will be executed incorrectly on overclocked CPU's even when
they appear to be fine, and until there's a client that can catch this in
every case I think it would be best to only run the client at your rated
clock speed.

____________________________________________________________________________
Andrew Meggs, content provider Antennahead Industries, Inc.
<mailto:insect@antennahead.com> <http://www.antennahead.com>