# Cayenne Spices Up Cyrix's 6x86MX New Design Improves FP, MMX Performance, Adds Parallel FP Instructions



## by Linley Gwennap

Addressing weaknesses in the current 6x86MX, Cyrix will deploy a modified CPU design, code-named

Cayenne, in 2H98. According to Cyrix VP Robert Maher, who spoke at the recent Microprocessor Forum, the key addition is a dual-issue pipelined FP/MMX unit that should match and in some cases exceed the performance of Pentium II on many FP and multimedia applications. Cyrix also has aggressive plans to move its chips into 0.25-micron technology next year, attempting to match the performance of a 400-MHz Pentium II.

Cayenne's FP/MMX unit will support new MMX-like instructions to handle parallel single-precision floating-point operations. These Cyrix-defined instructions, which the company calls MMXFP, are intended mainly to accelerate 3D geometry and lighting calculations, providing up to a  $5\times$  performance increase on some algorithms. AMD (see MPR 10/27/97, p. 19) and Centaur have developed similar but incompatible instructions, and we expect Intel to deploy its own set of incompatible instructions, called MMX2, in early 1999.

In most other respects, Cayenne is identical to the current 6x86MX design, known as the M2 (see MPR 10/28/96, p. 23). As Figure 1 shows, it retains the same dual-pipeline integer core with

limited out-of-order execution, a 64K unified on-chip cache, and a large two-level TLB. Cayenne does add two "Appendix H" enhancements: 4M pages and virtual-mode extensions. The large pages make the TLB more efficient when handling large structures (such as the operating system or the frame buffer). The virtual-mode extensions improve performance on legacy DOS programs that make heavy use of virtual 8086 mode.

### Pipelining Speeds FP Code

The 6x86MX can issue only one FP instruction per cycle and execute at most one every two cycles, because its FPU is not fully pipelined. These restrictions prevent it from matching the performance of Pentium II or even Pentium/MMX on many floating-point benchmarks (see MPR 9/15/97, p. 18).

As Figure 1 shows, Cayenne can issue two FP instructions per cycle into an eight-entry queue. Instructions are dispatched from the queue in order and can be paired only if one



Cyrix VP of Engineering Robert Maher describes how Cayenne will improve 3D performance.

of the instructions is FXCH. Because FXCH simply exchanges one FP register with the top of stack, it is handled via register renaming in Cayenne. Floating-point operations such as add or multiply are directed to the FPU itself.

The new FP unit is pipelined for most basic operations, including add, subtract, load, store, and single-precision multiply. As Table 1 shows, the exception is double-precision multiply, which has a three-cycle dispatch rate because it requires two passes through the multiplier. Thus, the Cyrix chip should fare well on most PC applications, including 3D games, but will not do as well as Pentium II on some CAD programs and other scientific applications that perform lots of double-precision calculations.

> The new FPU compares well with Pentium II's, which has the same throughput on most operations but slightly better latencies. Cayenne has an advantage on the critical single-precision multiply instruction, which is not fully pipelined on Pentium II. The Intel chip, however, has a key benefit in its ability to pair FP loads with FP arithmetic; Cayenne cannot do this.

> Other factors make it difficult to compare the two processors. On many FP applications, Pentium II will gain from its greater instruction-reordering capability and better L2-cache bandwidth. For programs with smaller data sets, however, Cyrix's 64K onchip cache will provide better performance than Pentium II's 32K of cache. For computebound applications, Pentium II benefits from

its higher clock speed at a given PR level. Benchmark results will be required to evaluate Cyrix's ability to match Pentium II's performance on standard FP code, but the new FPU is clearly a significant improvement over Cyrix's current one.

## MMX Unit Pairs Instructions

The 6x86MX has a single-issue MMX unit that was created quickly to give Cyrix chips MMX compatibility. Cayenne's MMX unit reflects a higher-performance design that matches many of the capabilities of Pentium II.

Like FP instructions, MMX instructions can be issued at a rate of two per cycle into the FP/MMX queue. The MMX unit has a broad array of resources, including two MMX ALUs, that allows it to execute two instructions per cycle from the queue, provided the pair contains no more than one shift, one multiply, and one load/store instruction. These pairing restrictions are essentially the same as for Pentium/MMX or Pentium II.

|                     | Cayenne  |          | Pentium II |           |
|---------------------|----------|----------|------------|-----------|
|                     | Thruput  | Latency  | Thruput    | Latency   |
| FP Add              | 1 cycle  | 4 cycles | 1 cycle    | 3 cycles  |
| FP Load/Store       | 1 cycle  | 4 cycles | 1 cycle    | 3 cycles  |
| FP Multiply (SP)    | 1 cycle  | 4 cycles | 2 cycles   | 5 cycles  |
| FP Multiply (DP)    | 3 cycles | 6 cycles | 2 cycles   | 5 cycles  |
| MMX Add             | 1 cycle  | 1 cycle  | 1 cycle    | 1 cycle   |
| MMX Load/Store      | 1 cycle  | 1 cycle  | 1 cycle    | 1 cycle   |
| MMX Multiply        | 1 cycle  | 2 cycles | 1 cycle    | 2 cycles  |
| Dual FP Add*        | 1 cycle  | 3 cycles | 2 cycles   | 4 cycles  |
| Dual FP Multiply*   | 1 cycle  | 3 cycles | 4 cycles   | 7 cycles  |
| Dual FP Reciprocal* | 3 cycles | 5 cycles | 36 cycles  | 36 cycles |
| Dual FP Recip Sqrt* | 3 cycles | 5 cycles | 84 cycles  | 84 cycles |

Table 1. Cayenne's FP and MMX throughput is similar to that of Pentium II, but latencies are slightly longer in some cases. \*Cayenne uses new MMXFP instructions; Pentium II uses two standard FP instructions to achieve the same result. (Source: vendors)

All MMX instructions are fully pipelined. As Table 1 shows, all execute in a single cycle except for multiplies, which have a two-cycle latency. Like the 6x86MX, Cayenne implements a single-cycle context switch from FP to MMX modes. Although Pentium II has improved the context switch time from Pentium/MMX, it still requires seven cycles. Few applications switch modes frequently, however.

#### MMX Additions Aid 3D Graphics

While the improved FPU will aid all FP software, programmers willing to use Cyrix's proprietary MMXFP instructions will see an even greater boost in performance. These instructions operate on a new data type that crams two singleprecision FP values into a 64-bit MMX register, as Figure 2(a) shows. In this way, Cayenne can complete four floatingpoint operations per cycle using its dual MMX units, delivering 1 GFLOPS of peak performance at 250 MHz.

Some accommodations are made in this mode. All IEEE 754 denorms and exceptions are processed properly, but instead of supporting the full IEEE standard, only one rounding mode (chop) is supported, and gradual underflows are not supported. Applications that require full compliance must use the standard FP instructions. For 3D geometry, the calculations are tolerant of minor inaccuracies, so these simplifications are not a problem.

MMXFP includes several instructions to operate on the paired FP data type, such as add, subtract, multiply, convert, and compare. As Table 1 shows, Cayenne also has low-latency reciprocal and reciprocal-square-root  $(1/\sqrt{x})$  instructions. These operations complete in just 5 cycles, versus at least 36 cycles (for two FP divides) on Pentium II. Lighting calculations for 3D images make frequent use of these operations.

Going all out, Cyrix added scatter/gather instructions and a motion-estimation instruction to MMXFP. The scatter/gather instructions, as Figure 2(b) shows, rearrange packed 32-bit data. Since MMX can already load or store the low 32 bits of a register to memory or to any half of an MMX register, Cyrix simply added corresponding instructions to



Figure 1. Cyrix's Cayenne processor is nearly identical to the 6x86MX (M2) except for the new FP/MMX unit and minor changes to the MMU.

load or store the high 32 bits. These instructions are useful for pairing data points at the vertices of triangles, generating more opportunities for parallel computation. They can also be used for matrix operations.

The motion-estimation instruction is similar to Sun's PDIST instruction. It operates on two sets of packed 8-bit pixels, subtracting them, taking the absolute value of the results, and calculating the sum of the differences. This operation calculates the error term during motion estimation, a key component of many video-compression algorithms. This instruction allows Cayenne to perform video conferencing using real-time MPEG-1 or H.324 compression.

According to Cyrix's Maher, with these new optimizations, a 250-MHz Cayenne processor can perform geometry and lighting calculations at a peak rate of 10 million meshed triangles per second. This rate is at least twice what a 266-MHz Pentium II can deliver today using standard FP



Figure 2. (a) Cyrix's MMXFP combines two single-precision FP operands in one MMX register. (b) The scatter/gather instructions can arbitrarily reorder 32-bit data within the 64-bit MMX registers.

## Price & Availability

The first processors using the Cayenne core are expected in 2Q98; Cyrix did not disclose pricing. For more information on the 6x86MX product line, access the Web at www.cyrix.com/process/prodinfo/prodin-p.htm.

instructions. Although Cayenne won't be able to achieve this number in a real system due to software and memory overhead, it could still deliver a significant performance advantage on 3D games compared with Pentium II. The trick will be gaining the required software support for its proprietary extensions (see MPR 10/27/97, p. 35).

All told, the MMXFP instructions consume 15 unused opcodes. If Intel were to use any of these opcodes for a future x86 extension (e.g., MMX2), implementing MMXFP in future Cyrix processors would be difficult, since it would have to support the new Intel instructions as well. Once Intel brings out MMX2, Cyrix will support the new instructions and may phase out the older MMXFP instructions.

#### Socket 7 and Beyond

Like AMD, the company is working with the remaining non-Intel chip-set vendors to improve the infrastructure surrounding Socket 7. Although Intel doesn't plan to develop any new Socket 7 chip sets, other vendors will continue to add new features to their chip sets. According to Maher, forthcoming Socket 7 products will support AGP, FireWire, SDRAM, and ultimately SDRAM II. These chip sets will also support a 100-MHz bus (see sidebar, MPR 10/27/97, p. 20).

Cyrix is counting on rapid process shrinks to boost the clock speed of its processors and thus their overall performance. By increasing the bus speed as well, the company expects application performance to scale along with the CPU clock speed.



Figure 3. Cyrix's roadmap includes faster 6x86MX parts, followed by parts based on the Cayenne core. Cyrix's next-generation CPU, code-named Jalapeno, is due in 1999. (Source: MDR estimates)

Cyrix is currently shipping a PR233 product (nominally equivalent to a Pentium II-233 on integer code) using a 188-MHz core and a 75-MHz bus. This part is built in IBM's 0.35-micron CMOS-5X process. In 1Q98, a 10% linear shrink will boost performance to PR266 using a 208-MHz core and a 83-MHz bus. As Figure 3 shows, this part will be followed by a PR300 version that uses a 250-MHz core built in CMOS-6S2, a hybrid process with 0.25-micron transistors but wider metal layers.

The first Cayenne parts should appear around 3Q98. Built in CMOS-6X, a true 0.25-micron process with five metal layers, they will measure less than 70 mm<sup>2</sup>, compared with 197 mm<sup>2</sup> for the 0.35-micron 6x86MX. The new features have little impact on die size, raising the transistor count by just 250,000 to 6.8 million total. The MDR Cost Model estimates Cayenne will cost just \$45 to build.

Cayenne will be Cyrix's first design to use IBM's C4 packaging technology, eliminating the pad ring and contributing to the small die size. If National's fab plans (see MPR 8/25/97, p. 1) are on schedule, it should be able to build Cayenne as well.

Maher would not comment on whether Cayenne will use the standard Socket 7 interface but agreed that even the 100-MHz bus will have problems keeping up with the performance of the new core, which is expected to match a 400-MHz Pentium II. Cayenne might add a separate L2 cache bus while maintaining compatibility with Socket 7 chip sets, add an on-chip L2 cache, or move to Slot 1.

Maher reported the next-generation processor design, previously known as the M3 and now called Jalapeno, is progressing well. He expects the faster chip to debut sometime in 1999 but provided no details on its design. If Cayenne adopts a new system interface, we expect Jalapeno will maintain compatibility with Cayenne's interface.

#### Cayenne Takes On Pentium II

Over the next year, the 6x86MX must move from competing with Pentium/MMX to competing with Pentium II. The improvements in the Cayenne core are necessary to address weaknesses in the current design, bringing FP and MMX performance to Pentium II standards and beyond, if the new MMXFP instructions are used. Equally important to Cyrix's ability to compete are the rapid process shrinks, bringing the Cyrix product line to process parity with Intel's 0.25-micron Pentium II by 2H98.

If Cyrix delivers on this plan, it will have a chip that offers nearly the same performance as Pentium II while carrying a 25% lower manufacturing cost, due to its tiny die. This enviable position would allow the company to continue to underprice Intel at the heart of its product line, as it does today, while actually making a reasonable profit. The small die, combined with National's production capacity, could rapidly boost Cyrix's processor output. Several potential pitfalls remain, but adroit footwork could allow Cyrix to give AMD a run for its market share.