Feature
Optimize memory-system design for multimedia applications
The convergence of video and communications in inexpensive unified-memory architectures has made DRAM the most important and the highest-performance target in any system.
By David Lautzenheiser and Agha Hussain, Silistix -- EDN, 9/4/2008
DRAM efficiency has become a severe challenge for video-processing-SOC (system-on-chip) designers. This evolution is the result of many factors. Continued advances in process technology have enabled higher integration levels. Dominant consumer-pricing pressures have replaced higher-margin communications infrastructure and high-end computing as the market drivers. In the consumer markets, the adoption of much-higher-bandwidth-consuming standards, such as HD (high-definition) video, in consumer equipment has created enormous bandwidth requirements. At the same time, however, low-cost chip-to-chip bandwidth generally lags behind the on-chip speed increases that on-chip process technology and architectural advances enable.
All of these pressures eventually focus on the SOC-to-DRAM interface. With the convergence of video and communications in inexpensive unified-memory architectures, DRAM has become the most important and the highest-performance target in any system.
As SOCs have integrated more functions on a single chip, the additional cost along with the loss of digital-logic performance associated with integrating DRAM on that chip has forced consumer-multimedia-device manufacturers to employ DRAM as one or more separate chips. As this trend continues, it highlights the cost of the DRAM as a major component of overall system cost. To minimize system cost for a given performance level, efficient use of DRAM becomes critical. If, through intelligent SOC design, a system can use slower and therefore lower-cost DRAM or fewer DRAM devices, then that system can significantly increase its performance-to-cost figure of merit. Consequently, in recent years, DRAM-access performance and use have pushed past 50% to more than 80% in consumer digital devices.
Maximizing DRAM efficiencyYou can increase DRAM efficiency, but only by considering a number of complex interactions in the overall system design. Key among these interactions are communication with and the information that flows between the various system functions that use DRAM, as well as interactions with the controller logic that manages DRAM operation.
Figure 1 shows how the key processing functions that supply and manipulate DRAM data communicate with the DRAM controller. In traditional SOCs, this communications network is usually a hierarchy of clock-based, processor-oriented buses. In more modern SOC architectures, the interconnect is a separate system for managing traffic, using architectures such as a synchronous-star crossbar, clock-based NOC (network on chip) or a clockless, asynchronous NOC.
High DRAM efficiency is important for multimedia processing, particularly for devices that process HD data streams. A number of factors affect the efficiency of this processing. Though these factors vary from system to system, you need to carefully consider them when developing the architecture of a multimedia system.
The DRAM controller needs maximum visibility of the IP (intellectual-property) core making a request. The controller cannot always infer the needed information from the request: The system must explicitly communicate the information to the controller, allowing the controller to determine the importance of the new request relative to other requests already under way.
In older bus-based systems, differentiation among potential requesters was not part of the request. The controller had to infer it by some other means. For example, in AHB (advanced-high-performance-bus)-based systems, multiple AHBs allow the controller to properly weigh one request against another. These multilayer systems can use various QOS (quality-of-service) algorithms at the DRAM controller to extract the highest efficiency from the DRAM channel and still satisfy the needs of the system.
In more modern protocols, such as OCP (Open Core Protocol) or AXI (Advanced Extensible Interface), various fields in the request packet identify the requester and, potentially, the specific task from the requester, using OCP tags or AXI identifiers. Tags let the on-chip communications and its targets reorder responses to nonconflicting memory addresses within a single thread, ensuring that the system respects write ordering. Tagged transactions are useful for advanced CPU architectures. To maximize DRAM efficiency, the interconnect between requester and DRAM controller must convey all possible information about the request.
Modern DRAM controllers have command queues that hold outstanding requests from the multiple initiators a system might contain. To ensure that the controller has maximum flexibility in selecting command sequences and other important information, the controller queue should always contain multiple requests from which to make the best scheduling decision. For the on-chip network, there should be no barriers to delivering all possible requests as soon as possible, again with the maximum information about the requests.
One way to manage system traffic is to add traffic-shaping and traffic-ordering logic to the interconnect in front of the DRAM controller to assist the controller in DRAM management. However, as only the DRAM controller knows the state of the DRAM devices and pending requests from other system functions, request scheduling should be the job of the DRAM subsystem. Any attempt to reorder or tailor requests in any way can confuse the DRAM controller. A request that arrives early and that the system does not recognize as a priority request wastes storage. If the system recognizes it as a priority request, then it may disturb the scheduling and reduce DRAM efficiency. The network should be invisible to the DRAM and deliver requests as quickly as possible.
The best approach is to deliver the raw state of the system without overwhelming the DRAM controller, so there is a single point of resolution between the state of the DRAM devices and the needs of the system. Any other approach may introduce an opportunity for losing information or for parts of the system to make the wrong decisions, both of which reduce DRAM efficiency.
In some systems, the interconnect may modify the nature of the request such that when it arrives at the DRAM controller, it is late for an opportune scheduling window, or it may lose or subvert some aspect of the request. For example, in some time-multiplexed systems, the nature of the allocation algorithms causes a large transfer to break into a series of smaller transfers. At any point in servicing one of the smaller requests, the DRAM controller could have an opportunity to manipulate the DRAM controls to achieve additional efficiency if it knew that another, similar request was coming. Without that insight, the DRAM controller might close a bank or allow an intervening read to reverse the bus state, reducing efficiency.
For many DRAM controllers, the ability to reorder memory-access requests, even from a single requester, is critical for efficient operation. For maximum efficiency, the SOC architecture must incorporate some mechanism at the target that allows for this reordering across initiators, even if the requesting IP core does not support it.
Addressing DRAM-efficiency issuesOne way to deal with the need to fully inform the DRAM controller is to synthesize, either through a formal manual process or using an automated tool, an interconnect structure you base on the requirements of the individual requester blocks. In this instance, think of synthesis in the general sense as a translation from one level of abstraction to another and not in the specific sense as producing a gate-level netlist from an RTL (register-transfer-level) description. Such an automated tool may synthesize self-timed interconnect networks from a high-level, architectural description with several attributes that help optimize DRAM operation.
|
Interconnect synthesized in this way can inherently carry the identification of the requesting IP core, even if the protocol at the requesting end does not explicitly support such identification. For example, an AHB initiator, which does not have an explicit identification capability, may receive an SID (source identification) during interconnect synthesis. The interconnect carries this SID from the AHB initiator to the DRAM controller, and the controller can then use the SID in its algorithms to maximize efficiency. For protocols such as those that have explicit identification fields, the system may carry the ID information intact to the DRAM controller. If the system architect wishes to include additional system-level priority or similar information about the requester, the interconnect-synthesis process may provide optional fields for this information. This situation is more desirable than dealing with buses that require the chip designer to translate all information into a form that the bus protocol can understand.
If the synthesis creates self-timed communications networks, the bandwidth of these structures is generally much greater than that of the DRAM controller. In addition, because of the nature of the aggregation that might occur in the path to DRAM, such a system can deliver all possible incoming requests bound for DRAM at maximum wire speed. This method ensures that any request arrives at the DRAM controller as soon as possible and is not, for example, caught in the interconnect waiting for a clock edge.
In general, requests to the interconnect should be completely contiguous. The system should deliver a long, unmodified request from an initiator to the DRAM controller in whatever form it began. The interconnect should not modify any of the information.
In some cases, the initiator, rather than the interconnect, should modify requests. For example, the system designer may choose to incorporate burst chopping during a write request to avoid blocking or eliminate the need for large buffers to absorb a large data burst at the DRAM controller. Such a modification to the request stream needs to happen at the initiator to match the burst sizes the target DRAM can handle. In this approach, however, the interconnect should not be a factor in the burst-chopping decision and should carry whatever request the initiator sends.
Along with the communications network, the synthesis process should also generate adapters—logic that services the needs of endpoints—with optional reordering capability. This approach allows a mixed system, such as one with both AHB and OCP initiators, to fully support reordering, allowing the DRAM controller to operate as efficiently as possible. For an endpoint, such as one that uses AHB, a reordering adapter manages the order of requests, making them appear to the endpoint as always in order, even if the DRAM controller, for efficient operation, chooses to reorder particular requests.
Table 1 summarizes these DRAM-efficiency issues and how interconnect synthesized by a formal process can addresses them.
A 2-D block transferAn example of a function that can improve DRAM efficiency is a 2-D block transfer. This transfer is a special form of an SRMD (single-request-multiple-data) command that passes across an interface from an initiator, through the interconnect, to a receiving target. With respect to OCP, the only publicly available protocol that currently supports this capability, the 2-D block transfer is a special burst type, MBurstSeq=BLCK.
In addition to the starting address and length (in the case of 2-D, the length of each line), this 2-D transfer request transmits a height (the number of lines in the block) and a stride (the offset from the beginning of one line to the beginning of the next) at the same time. Figure 2 shows the structure of a 2-D block transfer as the OCP 2.2 protocol defines it.
At the receiving end—typically the DRAM controller but possibly a video display in a write-only video-processing system—special hardware stores the additional information for the 2-D burst request, manages the incrementing movement from one line to the next based on the stride, and determines the end of the burst based on the height.
Figure 3 shows two implementations of a 2-D block transfer that are receiving support from a self-timed interconnect fabric of the type synthesized by Silistix tools. Each implementation shows a single path, from the initiator to the DRAM controller; however, the initiator needs to access other paths in the system, including, possibly, other endpoints.
In Figure 3a, the key system flow is between an OCP 2.2-capable initiator that can issue an MBurstSeq=BLCK command and a similarly OCP 2.2-compliant DRAM controller that can directly accept the MBurstSeq=BLCK command. Thus, the DRAM controller has the registers and logic to store and manipulate the burst for its stride and height. In this implementation, the adapters at each of the critical endpoints would need to be OCP 2.2-compatible.
When it receives the MBurstSeq=BLCK burst type, the initiator adapter packs that request into the internal format that the synthesized interconnect uses and delivers it intact with the additional fields to the target adapter. The target adapter unpacks the information from the internal format and supplies it across the OCP 2.2 interface exactly as the initiator adapter received it. As the DRAM controller supplies data, the target adapter associates that data with that request and delivers it back to service the command.
In Figure 3b, the same OCP 2.2-capable initiator makes the request, but there is no OCP 2.2-capable DRAM controller. The DRAM controller in this case might have an AXI interface or an older OCP-compliant interface that does not support the MBurstSeq=BLCK burst type. The controller might also be a customer-specific core that uses a native interface instead of a standard protocol. In either case, the operation of the implemented circuitry is the same.
When the initiator adapter receives an OCP 2.2 MBurstSeq=BLCK burst type, the adapter determines that the target is not OCP 2.2-capable and stores the stride and height information in additional hardware. The system architect would specify this hardware during interconnect synthesis, based on a desire to support the OCP 2.2 MBurstSeq=BLCK burst type for the initiator despite not having an OCP 2.2-compliant DRAM controller. The initiator adapter would decompose the 2-D block transfer into the proper number of conventional burst transfers and begin issuing them across the interconnect to the DRAM controller. If the DRAM controller uses a nonstandard interface, the interconnect can alert the DRAM controller that multiple burst requests of this type are inbound. The system could also unroll 2-D bursts at the target and automatically deliver them to the endpoint DRAM controller, which has the same effect. Depending on the algorithms the controller uses, this information alone may improve the efficiency of servicing an MBurstSeq=BLCK type of burst.
For systems in which the OCP 2.2-capable initiator needs to issue 2-D block transfers to both OCP 2.2-compatible and non-OCP 2.2-compatible endpoints, the same options apply. The initiator adapter should be able to identify, from the nature of the request and its knowledge of the endpoint types, which endpoints can directly accept the 2-D block command and which ones must use the optional hardware. This approach involves slightly more hardware overhead to service both types with OCP 2.2 MBurstSeq=BLCK burst types but allows efficient management of mixed systems.
Design teams must consider various trade-offs when evaluating the two implementations. Table 2 shows these trade-offs relative to the two 2-D block transfer in Figure 3.
For either option, it is important to model the traffic interactions relative to the DRAM controller in a more abstract form. The Silistix tools can generate an OSCI- or CoWare-compatible SystemC model of the synthesized interconnect and a timed or an untimed version of the model for early system verification.
Using 2-D block transfers improves both DRAM and network efficiency; it improves network efficiency by aiding traffic flow on the network. These types of transfers improve the efficiency of both synchronous and asynchronous, self-timed networks but are particularly beneficial for the asynchronous variety. It is possible, by examining the kinds of transfers the blocks in an SOC require, to synthesize a network that can enable a DRAM controller to most efficiently mediate between the needs of the blocks and the behavior of DRAM chips. It can happen nearly independently of the individual characteristics of the blocks and the controller, as long as the architecture respects the separation between the processing and the transport functions.
| Author Information |
David Lautzenheiser is the vice president of marketing at Silistix. Before joining Silistix, he was in private practice, assisting innovative small companies with company and product-strategy issues and planning and executing company and product launches. Previously, Lautzenheiser successfully launched new companies and products as vice president of marketing at both Sonics and LightSpeed Semiconductor. Lautzenheiser began his marketing career at Xilinx, where he led the introduction of the first FPGAs. He holds a bachelor's degree in electrical engineering from Washington University (St Louis). |
Agha Hussain recently joined Silistix as the company's chief system architect. Previously, he was an application architect at Sonics, working in the digital-media and wireless areas for applications such as DVD, DTV, cell phones, and WiMax. In addition, he was a co-founder and vice president of hardware platforms for Network Utilities and worked in a variety of roles at Integrated Device Technology. His interests and experience include synthesis, timing and performance analysis, chip interconnect, DDR memory, and silicon-bus protocols. Hussain has a bachelor's degree from Nardirshaw Edulji Dinshaw University of Engineering and Technology (Karachi, Pakistan) and a master's degree in engineering from the University of Southern California (Los Angeles). |















David Lautzenheiser is the vice president of marketing at Silistix. Before joining Silistix, he was in private practice, assisting innovative small companies with company and product-strategy issues and planning and executing company and product launches. Previously, Lautzenheiser successfully launched new companies and products as vice president of marketing at both Sonics and LightSpeed Semiconductor. Lautzenheiser began his marketing career at Xilinx, where he led the introduction of the first FPGAs. He holds a bachelor's degree in electrical engineering from Washington University (St Louis).
Agha Hussain recently joined Silistix as the company's chief system architect. Previously, he was an application architect at Sonics, working in the digital-media and wireless areas for applications such as DVD, DTV, cell phones, and WiMax. In addition, he was a co-founder and vice president of hardware platforms for Network Utilities and worked in a variety of roles at Integrated Device Technology. His interests and experience include synthesis, timing and performance analysis, chip interconnect, DDR memory, and silicon-bus protocols. Hussain has a bachelor's degree from Nardirshaw Edulji Dinshaw University of Engineering and Technology (Karachi, Pakistan) and a master's degree in engineering from the University of Southern California (Los Angeles).