Smart Memories







                              Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency, yet also target the high volumes and low costs of widely applicable general-purpose designs. To address these conflicting requirements, we propose a modular re-configurable architecture called Smart Memories, targeted at computing needs in the 0.1m technology generation. A Smart Memories chip is made up of many processing tiles, each containing local memory, local interconnect, and a processor core. For efficient computation under a wide class of possible applications, the memories, the wires, and the computational model can all be altered to match the applications. To show the applicability of this design, two very different machines at opposite ends of the architectural spectrum, the Imagine stream processor and the Hydra speculative multiprocessor, are mapped onto the Smart Memories computing substrate. Simulations of the mappings show that the Smart Memories architecture can successfully map these architectures with only modest performance degradation

The continued scaling of integrated circuit fabrication technology will dramatically affect the architecture of future computing systems. Scaling will make computation cheaper, smaller, and lower power, thus enabling more sophisticated computation in a growing number of embedded applications. However, the scaling of process technologies makes the construction of custom solutions increasingly difficult due to the increasing complexity of the desired devices. While designer productivity has improved over time, and technologies like system-on-a-chip help to manage complexity, each generation of complex machines is more expensive to design than the previous one. High non-recurring fabrication costs (e.g. mask generation) and long chip manufacturing delays mean that designs must be all the more carefully validated, further increasing the design costs. Thus, these large complex chips are only cost-effective if they can be sold in large volumes. This need for a large market runs counter to the drive for efficient, narrowly focused, custom hardware solutions. At the highest level, a Smart Memories chip is a modular computer. It contains an array of processor tiles and on-die DRAM memories connected by a packet-based, dynamically routed network (Figure 1). The network also connects to high-speed links on the pins of the chip to allow for the construction of multi-chip systems. Most of the initial hardware design works in the Smart Memories project has been on the processor tile design and evaluation, so this paper focuses on these aspects.

The organization of a processor tile is a compromise between VLSI wire constraints and computational efficiency. Our initial goal was to make each processor tile small enough so the delay of a repeated wire around the semi-perimeter of the tile would be less then a clock cycle. This leads to a tile edge of around 2.5mm in a 0.1m technology. This sized tile can contain a processor equivalent to a MIPS R5000, a 64-bit, 2-issue, in-order machine with 64KB of on-die cache. Alternately, this area can contain 2-4MB of embedded DRAM depending on the assumed cell size. A 400mm2 die would then hold about 64 processor tiles, or a lesser number of processor tiles and some DRAM tiles. Since large-scale computations may require more computation power than what is contained in a single processing tile, we cluster four processor tiles together into a “quad” and provide a low-overhead, intra-quad, interconnection network. Grouping the tiles into quads also makes the global interconnection network more efficient by reducing the number of global network interfaces and thus the number of hops between processors. Our goal in the tile design is to create a set of components that will span as wide an application set as possible. In current architectures, computational elements are somewhat standardized; today, most processors have multiple segmented functional units to increase efficiency when working on limited precision numbers. Since much work has already been done on optimizing the mix of functional units for a wide application class, efforts on creating the flexibility needed to efficiently support different computational models requires creating a flexible memory system, flexible interconnection between the processing node and the memory, and flexible instruction decode.

Continued technology scaling causes a dilemma -- while computation gets cheaper, the design of computing devices becomes more expensive, so new computing devices must have large markets to be successful. Smart Memories addresses this issue by extending the notion of a program. In conventional computing systems the memories and interconnect between the processors and memories is fixed, and what the programmer modifies is the code that runs on the processor. While this model is completely general, for many applications it is not very efficient. In Smart Memories, the user can program the wires and the memory, as well as the processors. This allows the user to configure the computing substrate to better match the structure of the applications, which greatly increases the efficiency of the resulting solution.

Our initial tile architecture shows the potential of this approach. Using the same resources normally found in a superscalar processor, we were able to arrange those resources into two very different types of compute engines. One is optimized for stream-based applications, i.e. very regular applications with large amounts of data parallelism. In this machine organization, the tile provides engine was very high bandwidth and high computational throughput. The other optimized for applications with small amounts of parallelism and irregular memory access patterns. Here the programmability of the memory was used to create the specialized memory structures needed to support speculation. However, this flexibility comes at a cost.


The overheads of the coarse-grain configuration that Smart Memories uses, although modest, are not negligible; and as the mapping studies show, building a machine optimized for a specific application will always be faster than configuring a general machine for that task. Yet the results are promising, since the overheads and resulting difference in performance are not large. So if an application or set of applications needs more than one computing or memory model, our re-configurable architecture can exceed the efficiency and performance of existing separate solutions.


0 comments: