Re: your mail

Ronald Van Iwaarden (rrt0136@ibm.net)
Tue, 15 Apr 1997 22:09:34 -0400


In article <199704152207.QAA06959@dopey.verser.frii.com>, you wrote:
>> From: rrt0136@ibm.net (Ronald Van Iwaarden)
>
>> I am curious as to whether a generic
>> optimized version might work better on the Cyrix platform.
>
>Ron and other Cyrix users:
>
>The "486" optimization is essentially a "generic" assembly version.
>I can almost guarantee that the non-assembly code generated right out
>of the C compiler would perform 1/3 to 1/2 less well.

Sounds fine.

>The "P5" version is optimized to make maximum use of the dual
>instruction pipes. This code is so highly honed that it is executing
>about 375 machine instructions in 200 clock cycles.

The Cyrix also has a dual pipeline structure so it should also have the
possibility of performing as well as an Intel Pentium. It sounds like the
internal architecture is probably a bit different and that is what is
causing the problems.

>The keycrunching code contains no floating point operations.

That is what I thought.

>> This is terrible performance for the Cyrix which usually
>> performs at the same speed as an Intel P150 unless intense FP is used.
>
>I don't know the architecture of the various Cyrix chips. If they
>have a single instruction queue, you should expect performance no
>better than about half of a Pentium of the same speed.

That then indicates exactly what is going on since my P150+ acts like a P75
or P60.

>If it is highly subject to "stalls", or if they are especially costly,
>you should expect even less.

The 6x86 has a number of pipeline stages so this should reduce the chance of
stalls...

Here is some generic info on the 6x86:

Superscalar architecture
Provides two pipelines to execute multiple instructions in parallel for
faster processing and higher performance.
Superpipelining
Increases the number of pipeline stages to avoid execution stalls and
keep information flowing faster for higher frequency
scalability.
Register Renaming
Provides temporary data storage for instant data availability
without waiting for the CPU to access the on-chip cache or main
system memory.
Data Dependency Removal
Provides instruction results to both pipelines simultaneously so
that neither pipeline is stalled.
Multi-Branch Prediction
Boosts processor performance by predicting with high accuracy the next
instructions needed.
Speculative Execution
Allows the pipelines to continuously execute instructions following a
branch without stalling the pipelines.
Out-of-Order Completion
Lets the faster instruction exit the pipeline out of order, saving
processing time without disrupting program flow.
80-bit Floating Point Unit (FPU)
Provides high performance by speculatively executing FPU and integer
instructions in parallel.
16-KByte Unified Write-Back Cache
Stores the most recently used data and instructions for single-cycle,
on-chip access.

I don't know if you are interested in hand coding something for the Cyrix
but I don't think it would be terribly difficult given your work on the
Intel pentium. Much of the ideas are probably similar but the Cyrix just
needs some slightly different tweaking. You can get more info on the 6x86
at http://www.cyrix.com/process/prodinfo/6x86/6x86.htm where you can get
some PDF files on the hardware design.

I know that I, for one, wouldn't mind seeing a 2+time speed increase for a
future Cyrix client!

--Ron
o Ronald Van Iwaarden | Work to live;
/\ Hope College | Live to bike;
_`\ `_<=== Holland MI 49423 | Bike to work!
__(_)/_(_)___.-._ voice : (616)355-7120 | http://www.cs.hope.edu/~rvaniwaa/