News and New Products
MP SoCs: Where are the Tools?
By Ann Steffora Mutschler -- Electronic News, 8/25/2006
Electronic News sat down with Simon Davidmann, president and CEO of Imperas Inc.; Chris Rowen, founder, president and CEO of Tensilica Inc.; Ian Mackintosh, president of the OCP/IP Association; Jeff Jussel, VP marketing and Americas general manager at Celoxica Ltd.; and Tom Grebinski, founder and chairman of Oasis Tooling to talk about multi-processor system-on-chips, including how good current tools are for creating them, as well as what the industry needs to move forward. What follows are excerpts of the discussion.
Electronic News: There has been an increasing amount of discussion surrounding multi-processor system on chips. Do we know how to build them?
Davidmann: Absolutely, we do know how to do it; but today it is very hard, very complex, very manual, very time-consuming, very error-prone, very inefficient and very hit-and-miss. But yes, we can do it – the same way we built chips out with schematic entry tools. It’s OK when they are small, but when they start getting complex and you start getting other effects, I don’t believe there’s much technology out there that can help. And the added complexity of MP SoC is that it’s all about software and actually software is still in the dark ages when it comes to embedded tools. So, yes, we can, but we are a long way from making it easy.
Macintosh: There’s a danger here of straying into the ESL domain. If we are talking about multi-processor architectures and multi-processor SoCs, I did a little survey on the implementation of those. What I’m hearing is that we are already doing them – they already exist. I also asked what tools are needed and it turns out that the same tools are needed [as in the ESL space], but the challenges are different. We’re not doing them as well as we might like, which could mean more tools or more integration. I particularly asked what the challenges were [in building MP SoCs] from both the technical perspective and from the point of view of the programmer doing the SoC – it’s a hardware implementation issue. From the technical viewpoint, I heard that it is the same old physical and logical integration; performance validation; memory architectures are a huge problem; software development on the individual processors; and domain management. From a programmer’s perspective it is almost one issue: unification in terms of having a single program for diverse and disperse development teams, having a methodology to attack that problem, and unification in terms of the SoC operating as a single system. The real issue and the real challenge is the spin issue because people are looking for variants on central-themed products. Therein lies the need for new tools because being able to spin things the way people are doing it today is not easy.
Davidmann: On the tool side of things, if you are doing a DSP and a RISC, then you can pretty much do that with the tools you get from ARM and the DSP vendor and the tools are built the old way so that you get two debuggers, for example, and you have to single step them and it’s pretty complex. If you put six [processors] in there, you find yourself spending weeks trying to get the stuff to work. If you put 50 down there, the old tools are not going it any use at all. We’re just getting by with band-aids on the tools today. In next to no time, when people are putting 20, 40 or 60 cores down, you can’t use each debugger and have 60 or 70 windows open, and start and stop simulators. There’s got to be retooling. There’s got to be a new generation of tools and technologies. People are doing multi-processor SoCs, but they are very low-end, like two or three. There are companies that have done 400 and had to build their own tools because the tools that are out there are not adequate at all.
Rowen: I think the important experience in building these very large scale MPs, not just in debug, but in a number of dimensions -- of performance analysis, in modeling of interconnect -- is you have to think a little bit more application specific. An application-specific processor, for example, is just a starting point and an enabler for thinking about application-specific interconnect, application-specific modeling, and application-specific debug. I would love to believe that there is some universal solution that is exactly right for the thousand-processor system but in fact, the way that architects want to look at MP debug and interconnect is actually through a lens which is very much about their application. As a result, it’s partly about the infrastructure you create and partly about the environment you create that allows them to do the many core debug that makes sense for their application because if you ask , ‘What’s universal about debug?’ you end up with the 400 windows, unless you have some higher-level model of what’s going on in the computation. That application-focus percolates through to all of the higher order views that you provide whether it’s debug or modeling or interconnect or performance analysis or energy estimation. That’s why it is quite important to think about the software development infrastructure not being just a question of how did I develop the application but also to be one that asks how the software development infrastructure was developed at the next level.
Electronic News: What are the different pieces of the infrastructure that we need, what’s available today and how soon are we going to make it easier on the architect?
Macintosh: OCP/IP is working on a number of areas right now because of this very problem. We started workgroups last year in three areas: cache coherence, a unified debug socket specification and network-on-chip [NOC] benchmarking.
Rowen: I mentioned the importance of software development environments and the fact that many people want to have many common elements, but there’s something about their application that is different. It turns out that the Eclipse software development environment, which is an open-source environment that is now the most widely adopted embedded software integrated development environment, is highly modular and extensible and is a very attractive foundation for people who are thinking about software in the context of their particular application. So, quite a wide range of the major vendors of RTOSes and software development tools across a wide range of architectures support it. IBM is probably the biggest sponsor of its development. But it probably is the best candidate from a software view of how you go application-specific. Another thing that is very important is MP system modeling. [Tensilica] is working with a number of parties like CoWare, Summit and Mentor that put quite a focus on the ESL system modeling perspective. Also, in a complementary way, there is a big issue in MP APIs. How do you think about partitioning an application, how do you think about programming at this new level which has to be closely tied in with the interconnect, with the modeling – all of those things are fairly closely tied together. Lastly, the other thing of particularly important focus given the realities of, if not the end of, Moore’s Law, the left turn of Moore’s Law is that energy has become a really critical issue from a system architecture/system design perspective, so MP energy analysis is very important. Investment is needed in these areas, and in a number of important cases, investment is taking place.
Electronic News: How far are we away from seeing some of this come to reality?
Rowen: It moves out into the light step by step. There are important pieces of it which are out there today. Most of our customers do MP design. The center of mass is around four, five or six processors. But there are people who are way out there doing hundreds of processors per chip and hundreds of thousands of processors per system. That’s the fringe, but it’s the right answer for an increasing range of problems and if it’s the right answer, people will find a way.
Jussel: Too often we have a modular escalation of, ‘you put two [processors] on [a chip], well, I put a thousand on,’ without any regard to overcoming Amdahl’s Law or getting any type of performance improvement of my system.
Grebinski: We are running eight-vector processors using the new [IBM] Cell broadband engine and we’re using them concurrently and getting tremendous results from those processors. We have to manage the DMA and the SIMD operations very carefully and it takes a lot of programming work and effort to do that. The core of the issue is with multi-core processors is how good you are at programming and managing those processors.
Davidmann: It’s hard to program, but it is programmable. A lot of MP chips have failed in the past because only a few people could program them and so they never got adopted. That is one of the challenges of the Cell processor – can it get widespread adoption? Can Sony actually get the Playstation 3 out? Can the games work? The challenge for MP SoC design is not just getting the silicon to work, it’s being able to make that silicon programmable, which is an architectural choice you have to make right up front. Just by spraying down 40 processors doesn’t help much – you have to worry how you get the data in and out. It’s clearly very complex.
Jussel: I don’t think that tool exists yet for doing that analysis.
Rowen: In high volume applications, that is the dominant mode. Moore’s Law says you are going to integrate a lot of different subsystems and in some cases those subsystems will be somewhat MP. There are a number of things that will be MP just because by virtue of the fact that as you step from 130 to 90 to 65[nm], you have the opportunity to put them together, and there are compelling cost and product reasons why you want it all together on one chip, but it didn’t really change the system partitioning. At the other extreme, we’ve been working on an application that is way out at the radical fringe that gives a picture of what it may be like for truly large-scale MP: In the world of high-performance computing [supercomputing] there are a handful of problems that probably can benefit and must move to peta-scale computing these are truly the grand challenge problems. Things like climate modeling. We’ve been working with some of the leading experts in climate modeling algorithms at Lawrence Berkeley National Laboratory and we have started to work on a speculative idea, which requires both a breakthrough in the climate modeling structure itself and a breakthrough in the energy efficiency of the machine. But it appears there might be a solution which is around about four million processors in one system. Fortunately not four million on one chip. It is about an order of magnitude more efficient than what you could do by putting together some number of Opterons or some number of Cell because of the intrinsic characteristic down to the computational level, but it turns out it requires you to reconfigure the algorithms in order to be able to get that degree of parallelization. It requires you to rethink the memory systems very dramatically because if you’re going to scale to that level, you need a huge amount of bandwidth – more than what anybody has ever put into a computer system before. It requires you to think in terms of the interconnect and communication in a way that lowers the overhead fairly dramatically. And you have to make these extremely energy-efficient chips to be able to do it. But then it pushes the limits on how you debug -- that’s something near and dear to these climatologists and algorithm experts and it pushes the boundaries on what kind of programming support you put in there to be able to exploit the key characteristics: lots of processors and huge amounts of memory bandwidth. I’m not suggesting that embedded systems or that your cell phone is going to have four million processors any time soon, but if we can solve the four million processor system, we can probably solve the 400- or 40- and certainly the four-processor system fairly easily.
Jussel: Four million is a big number, but it may be less impressive if you define what you mean by processor. For example, I could consider a small block of an FPGA a processor, and I could program an FPGA for four million processors all running together. That level of parallelization still gives me the 300x performance improvement and it could be considered multi-processing. But again, it is application-specific. Where do you go to a bigger processor with four million of them, when you go with 200 Opterons? It really depends. What’s really missing at the higher level is something that helps people make that decision.
Rowen: It’s very clear what the answer is about what is a processor. It is what the climatologists consider a processor.
Electronic News: In that case, would they care as long as their application was running to their specification?
Rowen: They care enormously. They have existing code that they want to run.
Jussel: All they want is for it to run faster. They don’t care if it is a processor, they don’t care if it is a chip, they don’t care what it is.
Rowen: So long as they have the ability to dynamically compile and load these hundred-thousand line programs and have them run fast.
Jussel: That’s exactly where we end up in the same space because we take that C code and map it to an FPGA as a coprocessor. So it looks like millions of little processors running together. It isn’t as impressive, nobody’s going to write any big papers on it.
Rowen: But if you can achieve the energy characteristics and the programmability characteristics, I don’t think the climatologists care that much as long as it walks like a processor and it talks like a processor – it’s a processor as far as they are concerned.













