What’s the Heterogeneous Point? The HSA Foundation provides an answer

What’s the Heterogeneous Point? The HSA Foundation provides an answer

// May 30th, 2013 // MultimediaProcessors
Our industry is littered with over-used terms whose meaning becomes ever more jaded as more people use them. We’re in danger of “heterogeneous” being another of them. But I sincerely hope we can live with that, because heterogeneous systems are going to be with us for a very long time to come. One of the first steps towards this is the recent announcement of the ratification of the HSAIL language specification from the HSA Foundation.
HSA Foundation
For as long as I’ve been involved in semiconductor processors – well over 30 years now – the desire for us to remove the bottlenecks of sequential processing has been insatiable. It hasn’t been solved until now for the simple reason that any traditional processor architecture has always suffered from the basic problem that any platform, no matter how clever, was always too “niche” and low volume for the mainstream software community to consider adopting in any meaningful way. And as we all know, the software community dwarfs the hardware community, and, more importantly, implements the code that enables our hardware brilliance to come to life for real end users. And software developers need high platform volumes for their software to be profitable. Hence, nothing ended up happening – until now.
At last we can see a way forward, thanks to the mass market adoption of traditional sequential CPUs combined with high-performance parallel processor-based GPUs in billions of mobile phones, tablets and other mass market products. At last, the software industry can move on (profitably) from the limitations of sequential processing into a world where processing scales linearly with silicon nodes, where processing efficiency per mW and per mm2 leaps to new heights, and where the sheer breadth of processing power at any one time from low-end to high-end is measured in orders of magnitude – all at mobile power consumption levels.

Heterogeneous processing – the killer combination of processors

This CPU + GPU combination is a true heterogeneous processor: multiple datapath architectures, each very different in ISA and capabilities, but working together under the control of a single application. But how do we program these new beasts? Well, we’re doing it today: we use graphics APIs at higher levels of abstraction to talk to the GPU – and it works extremely well. Not surprising really – haven’t we actually been writing these heterogeneous applications for decades already, thanks to our games consoles and GPU-enriched PCs?
HSA Foundation infographic
The HSA Foundation was formed as an open industry standards body to unify the computing industry around a common approach*
But now we want to do more. We want to use all that processing power in the GPU not just for graphics, but for other things like image processing, database searching, fluid dynamics – all sorts of things requiring processing horsepower way beyond what the best mobile CPUs can hope to deliver. How do we write applications like that? Do we need new languages?
No – we just need new abstractions and APIs such as Khronos’ OpenCL to help us. But we need more: since these are performance-driven applications, often with demanding real-time constraints such as user interactivity measured in milliseconds, we need to ensure every part of our heterogeneous processor is being used effectively.

Enter the HSA Foundation

That’s where technologies such as HSA come in. The HSA Foundation was created by industry leaders including AMD, Imagination, MediaTek, Qualcomm and Samsung, to ensure that applications can manage their execution not only on the CPUs and GPUs in a system, but also the infrastructure connecting them. By creating an open standard around how to connect CPUs to GPUs and other processors, we break the dependence of such advanced apps on any particular chip or CPU or GPU architecture. This enables the silicon industry to innovate by allowing multiple vendors to create competing solutions – fuelling the innovation the semiconductor industry is so famous for.
HSA Foundation
Since such heterogeneous SoCs (Systems on Chip) are expected to often be in mobile platforms with limited memory bandwidth and tight power budgets, how we schedule low-level tasks and assign them to various parts of a GPU or one of the CPU cores in the system is critical to an application utilising the full capabilities of any heterogeneous SoC. The APIs associated with HSA (and other forms of heterogeneous processing) will be key to enabling this.
Applications will also need to get ever more sophisticated in their “discovery” phase during startup, where they explore the system they are running on to find out what processing resources they have, what bandwidths are available, and much more.
Heterogeneous applications targeting tomorrow’s SoCs have a lot to cope with – problems HSA and other heterogeneous approacheswill help to solve. But by solving them in an open, standards-based way, we’ll end up with a software industry that is highly motivated to deliver apps using these open standards. These apps will not only achieve functionality we’ve never dreamt of; they’ll also adapt in ever more ingenious ways to whatever resources they have available to them.
It’s another exciting decade ahead for the world of computing! Have a look at the HSA Foundation website and their recent blog articlesto find out more and make sure you follow us on Twitter (@ImaginationPR@GPUCompute and @HSAFoundation) to stay updated on the developments behind this exciting partnership.
 
* Image courtesy of AMD, all rights reserved.

Tony King-Smith (2 Posts)

Tony joined Imagination Technologies in 2006 and is the company’s EVP of Marketing, responsible for all segment and technology marketing, communications, OEM relationships and ecosystems. He has extensive experience in product and segment marketing including many blue chip corporate relationships. Prior to Imagination, Tony held senior engineering and marketing positions with Panasonic, Hitachi (now Renesas), LSI Logic and INMOS.
 
http://withimagination.imgtec.com/index.php/multimedia/whats-the-heterogeneous-point-the-hsa-foundation-provides-an-answer

First delivery from Heterogeneous Systems Architecture Foundation

 
Yesterday the Heterogeneous Systems Architecture (HSA) Foundation released Version 0.95 of its Programmer’s Reference Manual. This release is the first yield from the Foundation and is the product of an entire years’ collaboration between leading companies throughout the entire length of the heterogeneous computing value chain, from silicon to IP to ISV.

Attached Image

Of course I have written here before about the founding of the HSA Foundation, its values and ARM’s early commitment to join it in defining an appropriate set of standards for the industry. Since that post, the HSA Foundation has striven to generate the defining standards of heterogeneous computing – so this Manual is the result of much hard work all-round. I am therefore naturally very pleased to see this release announced by the Foundation. This Reference Manual will now be the building block upon which ARM’s critical ecosystem of software developers, from tools to middleware, can design their products and, through this, develop the field of HSA, allowing the Foundation to succeed in its aim of combining the best capabilities of CPU, GPU and accompanying technologies.
ARM has contributed to the Foundation’s working groups many of its top experts from the fields of computer processing, graphics processing, interconnect and compiler technology and software. These colleagues of mine are working alongside HSA ecosystem partners, ensuring that what the Foundation delivers is based on the knowledge and experience of the entire breadth of the industry: an example of the ARM partnership model at its finest.
Version 0.95 of the Programmer’s Reference Manual is the initial output in a line of specifications that the HSA Foundation will be releasing over the months to come.
Hopefully, the next to appear will be the Hardware System Architecture Specification…!
Jem is an ARM Fellow and likes to think of himself as “The Godfather” to technical talent in ARM. After spending some time in his youth writing software for satellites and traffic-lights among other fascinating things, Jem spotted the technical inflection point of the mobile industry: graphics, video and other visual computing. As VP of technology in the Media Processing Division of ARM, Jem is busy with a lot of projects involving the future of cool ARM technology, which will revolutionise how people experience and interact with digital devices.

 
All company and product names appearing in the ARM Blogs are trademarks and/or registered trademarks of ARM Limited per ARM’s official trademark list. All other product or service names mentioned herein are the trademarks of their respective owners.

 
http://blogs.arm.com/multimedia/973-first-delivery-from-heterogeneous-systems-architecture-foundation/?sf13357134=1

HSA Foundation has just released version 0.95 of the Programmer’s Reference Manual, which we affectionately refer to as “the HSAIL spec

The HSA Foundation has just released version 0.95 of the Programmer’s Reference Manual, which we affectionately refer to as “the HSAIL spec”.  This has been in development for more than year, and I’m proud to finally be able to share our work with the external world.  My role in the process was the working-group spec editor for the 0.95 version.  The spec also benefitted significantly from the contributions of Norm Rubin (who wrote the original draft), and Tony Tye (who polished the final one), as well as the contributions from the many architects and experts from the companies in the working-group.
 
The spec describes HSAIL (HSA Intermediate Language, pronounced “H-Sale”).  HSAIL is a low-level intermediate language for a wide variety of parallel processor architectures (including GPUs) supported by members of the HSA Foundation.   HSAIL is a preferred target for library writers and back-end compiler developers who want to target HSA compute devices, and who want to deliver their own optimizations and control the compilation.  HSAIL is architected such that register allocation and other complex compiler optimizations are done before HSAIL is generated, which leads to a robust and fast translation from HSAIL to the device instruction set.  HSAIL also includes a well-defined relaxed memory model including load.acquire, store.release, barrier, and fine-grained barrier operations.   HSAIL is designed to support a wide variety of high-level programming models such as OpenCL™, OpenMP™, C++, and Java.  HSAIL also defines a binary format called “BRIG” that can be embedded in executable files alongside the code for the host CPU instruction-set.
 
Writing in HSAIL is similar to writing in assembly language for a RISC CPU : the language uses a load/store architecture, supports fundamental integer and floating point operations, branches, atomic operations, multi-media operations, and uses a fixed-size pool of registers.   HSAIL also contains built in support for function pointers, exceptions and debugging information.  Additionally, HSAIL defines group memory, hierarchical synchronization primitives, and wavefronts that should look familiar to programmers of GPU computing devices.
 
Today, many accelerator devices have separate address spaces that require cumbersome copy operations, and prevent complex pointer-containing data structures from being used on the accelerator and host.  HSA platforms address this challenge by requiring that all HSA Components can access the same shared, coherent memory space with high performance.    HSAIL helps to address the other major challenge with programming heterogeneous computing devices:  today accelerator programmers typically have to use a dedicated “compute language” such as OpenCL™ or CUDA™ to access the power of the accelerator.   A primary goal of HSA is to bring the power of these compute devices to the programming languages that developers are already using, in a natural and easy-to-use manner.  We are seeing this already with the introduction of C++ AMP, Java Aparapi, and Bolt – programmers are able to access parallel compute resources using programming models that are no more complex than those used for multi-core CPUs.   HSAIL adds the benefits of a portable IR, yet still low-level enough to give language and compiler vendors control over the code generation and associated optimizations.   Additionally, HSAIL is a royalty-free open standard – and open standards spur innovation , eliminate reliance on a single vendor, and always win over time.   The HSA Foundation will be providing publicly available assemblers and disassembler tools for HSAIL, and will additionally provide a code generator back-end for the popular LLVM compiler infrastructure.  LLVM already contains front-end parsers for many popular programming models.  Combined with language constructs to identify parallel regions (some of which already exist), these can be naturally extended to leverage HSAIL and the power efficiency and performance benefits of heterogeneous computing.
 
It has been a long journey and we are excited to share the next steps with the broader development community!
 
-Ben sander
AMD Fellow & Architect for HSA
Main Spec Editor for HSA Programer Reference
 

HSA Programmer Reference: The Formation Of The New Specification

HSA foundation was founded on June 12, 2012, and discussions among board members from founding companies commenced. We quickly reached a consensus to set the work of specifying HSA in motion, and to form the Programmer’s Reference Manual (PRM) working group as early as possible. We also strove to leverage the examples set by successful standard bodies. We picked Khronos as our role model. The first two meetings of the working group, held on Aug 24 and Aug 31 of 2012, produced a Statement of Work, a meeting format, a schedule and meeting frequency. The first working group of the young organization embarked on her journey.  The atmosphere of the working group was extremely friendly and cooperative. In a couple of weeks, we were able to associate the voices with the names, thru some trials and errors, and, of course, friendly reminders. Our original plan was to complete the work in 9 weeks, so that we could submit the spec for ratification by the end of year 2012. Finishing the work in 9 weeks proved to be too ambitious. 3 additional months were needed to produce a version deemed ready for the public.
 
The ultimate mission of HSA is to advance Parallel Computing with GPU or any other kind of programmable devices, to the next level in terms of ease of programming and power efficiency. We needed to repeatedly remind ourselves to strike a balance between current state of the art, and forward-looking ideas beyond the current, conventional way of programming  GPUs, or for that matter any SIMD style processors. Also by looking at use cases that do not yet exist in the market place, we needed to revisit some common themes in computing, such as precision, cache coherency, memory consistency again and again. The goal is to create a standard that is not only practical for wide industry-wise adoption, but also for future innovation and differentiation.
 
The PRM, or commonly referred to as the HSAIL (HSA Intermediate Language) spec,  plays a central role for such a revolution in Parallel Computing. It provides a reference for HSAIL, which is intended to decouples software development from hardware one. One key and differentiating feature of HSAIL is that it is positioned as a virtual ISA for any programmable computing device participating in a HSA-compliant system. Programmers can assume that there is  HSAIL virtual machine supporting HSAIL, and all practical concerns and issues regarding performance and power can be addressed with respects to such a “machine”. Hardware designers  can build their HSA-compliant computing devices with a goal to execute HSAIL code, thru efficient Just-in-Time compilation,  as close to the metal as possible. The HSAIL virtual machine is essentially a load/store architecture, supporting fundamental integer and floating point operations, branches, atomic operations, multimedia operations, and using a fixed size pool of registers.  Additionally, the machine supports group memory, hierarchical synchronization primitives, and wavefronts which, though looks familiar to programmers of GPUs, could potentially be leveraged in non-GPU computing devices as well.
 
For middleware, library and compiler developers, HSAIL is a perfect target due to its low-level nature, and stability and universality compared to native hardware ISAs. They can invest in R&D on top of HSAIL, and be sure that they would get the return thru the HSAIL ecosystem. The application developers, can optimize their code manually in HSAIL, and/or leverage the third-party HSAIL development tools or environments, and be confident that the real-world performance and efficiency of the applications developed this way would match their expectations. Such an assurance is achieved thru hardware vendors striving to optimize their HSA-compliant devices for HSAIL. Since HSAIL defines a virtual machine, not a physical one, hardware companies can innovate and differentiate in their native ISAs and micro-architectures. One of the coolest things about HSAIL is that it can potentially enable an ecosystem in which advances in Parallel Computing can happen independently and synergistically between software and hardware companies.
 
Completing the task of releasing a spec within 6 months from a young foundation is truly an amazing feat. Although AMD provided an initial draft that was nearly complete in terms of features, many of these features required careful reexamination and re-specification. Foundation members sent their best architects to participate, with the mandate to give this work priority. Because of the high quality of collaboration, most issues were resolved through consensus. Only 2 issues had to be resolved by ballot. One ballot question decided whether we should treat FP64 as optional in the base profile. The second ballot question could not be avoided: it was the vote for ratification!  Among issues resolved by consensus, naming and specifying the profiles was the most sticky. Due to different views on technology roadmaps, historical backgrounds and market positioning, the working group could not reach an agreement, and we asked the Board of Directors to arbitrate. And just as the US Supreme Court will sometimes return a case to the lower courts, the board sent the issue back to the working group! The working group reconsidered the issue and found a consensus.
 
The work of the PRM WG continues. There are many cross-group issues, for example, linkage, where the PRM WG plays a necessary role. Additionally, features continue to be examined and tuned. We have also turned our attention to enablement of implementation, by providing the HSAIL grammar and syntax in EBNF format. And we are correcting for consistency: the textual specification, programming examples, EBNF, and BRIG definitions are effectively four different ways of describing a feature.
 
As it happens, several participants are working in different working groups, and often considering the same issue, from the perspective of the PRM, then from the point of the view of system architecture, then in the context of the runtime, … We joked that we show split personalities when participating in different groups.
 
We have an outstanding team of processor, compiler and system architects. I am confident that what we have produced and will continue to produce will be superior to any proprietary solution. With such a great team, and the great companies behind it, I can proudly and confidently say that the future of Heterogeneous Parallel Computing is being shaped and defined here.
http://www.mediatek.com/_en/03_news/01-2_newsDetail.php?sn=1111&p=1
Chien-Ping Lu
Working Group Chair  Programer Reference Manual
Sr. Director, Corporate Technology Office
MediaTek USA Inc.

HSA Foundation announces first specification

HSA Foundation announces first specification
Programmer’s Reference Manual establishes framework for HSA ecosystem
 
Beaverton, OR, May 29, 2013 – The HSA Foundation has released Version 0.95 of its Programmer¹s Reference Manual. The HSA (Heterogeneous System Architecture) Foundation is a not-for-profit consortium dedicated to developing architecture specifications that unlock the performance and power efficiency of the parallel computing engines found in most modern devices. This is the first output from the HSA Foundation, who have been collaborating on this project since its founding in June 2012. It represents an important step in the development of the HSA Foundation¹s ecosystem because it enables software partners to develop libraries, tools and middleware and to code high performance kernels.
 
The Programmer¹s Reference Manual provides a standardized method of accessing all available computing resources in HSA-compliant systems. This enables a wide range of system resources to cooperate on parallelizable tasks. It has been specifically designed to perform in the most energy efficient way without compromising on performance. The goal is to enable a heterogeneous architecture that is easy to program, opens up new and rich user experiences and improves performance and quality of service, whilst reducing energy consumption.
 
The programming architecture detailed in the HSA Programmer¹s Reference Manual calls out features specifically exposed to programmers of the HSA architecture. HSA devices will typically include a broad class of devices, including GPUs and DSPs and support a number of key hardware features that enable easier developer programmability. These include shared coherent virtual memory, platform atomics, user mode queuing and GPU self-queuing.
 
These features, in conjunction with the correct software stack make programming all devices in an HSA architecture as easy as programming a CPU, and because of this, closer interlinking of processing on all devices is made possible. HSA abstracts away the native instruction set of the parallel processor through the HSA Intermediate Language (HSAIL). This language has been designed for parallel processing and can be translated on-the-fly to many native instruction sets, supporting innovations in different underlying hardware implementations through consistent HSAIL-compiled programs.
 
The HSA architecture also benefits existing APIs such as OpenCL and Renderscript through avoidance of wasteful copies, low-latency dispatch, improved memory model and shared virtual memory between all HSA devices.
 
The HSA Foundation continues to work on the next specifications which will detail the hardware system architecture, run-time details and compliance requirements.
See blog by HSA Foundation PRM Working Group Chair, Chien-Ping Lu, for a very personal and insightful commentary on the PRM odyssey and what it and HSA can mean to the community.
 
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a not-for-profit consortium for SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs whose goal is to make it easy to program for parallel computing. HSA members are building a heterogeneous compute ecosystem, rooted in industry standards, for combining scalar processing on the CPU with parallel processing on the GPU while enabling high bandwidth access to memory and high application performance at low power consumption. HSA defines interfaces for parallel computation utilizing CPU, GPU and other programmable and fixed function devices, and support for a diverse set of high-level programming languages, thereby creating the next foundation in general purpose computing.
 
Quotes:
“AMD is pleased to see that the HSA Foundation is strongly united around making it natural, easy and fun for programmers to utilize the capability of heterogeneous platforms and to innovate in creating modern application with tremendous performance at low power,” says Manju Hegde, corporate vice president, Heterogeneous Solutions, AMD
 
ARM believes that we can tackle industry issues only by working together in partnership. ARM has collaborated with members of the HSA Foundation since early 2011 to help define standards for heterogeneous computing. This is the Foundation’s first publication and we hope that it is one of many steps forward towards realizing fully optimized applications,” said Jem Davies, vice president of Technology, Media Processing Division and Fellow, ARM. “The Manual enables many organizations to benefit from access to this information, which means that the software ecosystem will be able to create exciting new applications for a range of form factors and devices in energy-constrained systems.”
 
“Heterogeneous processing architectures represent the future of computing. As a founder member of the HSA, Imagination is delighted to play a role in driving APIs and tools that will help SoC designers create future computing platforms. The ratification of the HSAIL language specification is another step toward making heterogeneous processing usable by a far broader app developer community.” – Tony King-Smith, EVP Marketing, Imagination Technologies
 
“The latest version of the PRM by the HSA Foundation is an important first step in providing a standardized way to access a wide range of system resources. It will contribute greatly toward achieving higher system performance in smart devices.” — Dr. Seung-jong Choi, Senior Vice President SIC Lab, LG Electronics Inc.
 
“Mediatek is a staunch supporter of heterogeneous system architecture and very pleased with the public release of HSAIL.  Opening and standardizing the interface between CPU and GPU allows for parallel operation of these 2 key processors in mobile chipsets, and most importantly, creates portability of high-level software applications.”, said Mohit Bhushan, VP & GM, MediaTek.
 
“As a leading supplier of low power, multicore GPU technologies for smaller, faster, cooler products, we are pleased to contribute to Version 0.95 of the HSA Programmer’s Reference Manual. This milestone builds the foundation for heterogeneous architectures to be productized in leading silicon solutions that have the most stringent die area and low power requirements,” stated Wei-Jin Dai, President and CEO of Vivante.
 

###

Media Contact:
Morgan Fricke
HSA Foundation
P: 503.619.0663
E: press@standards.hsafoundation.com
 

Good Article from our Founding Member ARM: Flipping the FLOPS – how ARM measures GPU compute performance

 

It’s time we dealt with the measurement of compute performance in GPUs. In another in a series of ARM blogs intended to enlighten and reduce the amount of confusion in the graphics industry, I’d like to cover the issue of Floating-point Operations Per Second (FLOPS, or GFLOPS or TFLOPS).

Attached Image

In the past, Tom Olson talked about triangles per second (“the chocolate teapot of graphics processor performance metrics”), Ed Plowman talked about pixels per second (“Of Philosophy and When is a Pixel Not a Pixel?”), Sean Ellis addressedfloating-point precision (“At Home on the Range – Why Floating Point Formats Matter in Graphics”) and hopefully we managed to amuse people as well as educate. Today let’s look at compute performance – it’s a useful measure.
Competition is good…
… But open and honest competition is better. The market for GPUs is very competitive, with a number of companies supplying IP as well as those who make their own, for inclusion in SoCs. I love competition; how else can you win if you don’t have competition? Or, as one of the most competitive people I know said to me: “What is the point in competing if you don’t win?” (she was a runner, but suffice to say there are a lot of people round here who want to win at anything they commit to). In this competitive environment, we know that our partners can sometimes struggle to understand performance metrics for GPUs. They need to compare the offerings from multiple suppliers and pick the right product for their needs. This can be a complex subject, but it doesn’t have to be as complex as some try to make it. I want to win on honest, open metrics…
Graphics is compute
Graphics is a really computationally intensive problem – you have to do lots of arithmetic in it, which is one reason people have been interested in utilising those capabilities for more than “just” graphics. To draw stuff, we start off by describing some objects in a three-dimensional space by dividing them into a number of triangles and listing the co-ordinates of each vertex of the triangles. We can argue about why we use triangles, and some have, but a triangle is simple, and the three points in it are guaranteed to form a plane. We then define some light sources and give them types and positions; we define the projection model (the camera) and give that a position; we define the colours and surface detail of the objects (made up of those triangles). Sometimes we add lots more detail; sometimes we animate the objects and make them move. After all that, we try to work out what a picture from the camera would look like, if it were projected onto a two-dimensional screen. As you can imagine, there are lots of 3-D equations to solve, and lots of trigonometry. Most of the numbers we use are floating-point numbers, so the rate at which we can perform floating-point arithmetic has a big effect on our graphics performance. It’s not the only thing, of course, but it is important. It is certainly good to understand it.

Attached Image

First describe the problem
In our GPUs (and lots of others) we have floating-point operations performed in all the places I described above. Some are in fixed-function units and some are in programmable units. Some examples may help here: when you load a value from a texture, the texture unit will calculate a memory address, based on the co-ordinates within the texture that you specify, and then possibly interpolate between several values in memory to produce the texture you want, possibly bi-linearly filtering between some adjacent values. And, if the texture was in a compressed format like ASTC, the values will have to be uncompressed as part of that process as well. That’s a lot of calculation (integer and floating-point). It’s very good for graphics, but utilising those units for more general-purpose compute is somewhere between a bit hard and impossible.
Some GPUs “just” do graphics and do not do general purpose compute. The Mali-400 family for example, was designed for OpenGL ES 2.0, which has low precision requirements. Some operations need to be performed at 32-bit precision, some 24-bit and some 16-bit. OpenCL on NEON on the ARM CPU can be used as a compute companion.
Some GPUs do graphics and compute. For example, the Mali-T600 family of GPUs use the Midgard architecture (described by me in a previous blog). In that architecture, we have arithmetic pipelines that execute instructions like ADD and MUL. We have a balanced mix of scalar and vector (SIMD) units, so we can do multiple operations like that in parallel (e.g. four FP32, 8 FP16). We also have dot productinstructions and a bunch of trigonometry instructions (like sin, cos, tan etc.). How should you express the number of floating-point operations in a trigonometric function like sin()? 
The Mali-T600 series was designed for compute and the newest graphics APIs like OpenCLOpenGL ES 3.0, and Microsoft DirectX11 so it supports full 32-bit precision floating-point operations conformant withIEEE-754-2008. We also do double-precision (64-bit floating-point) and as an aside, we can also do a wide variety of integer operations including 64-bit as well (traditionally GPUs lack good integer capabilities).
To summarise, we have some GPUs with differing performance levels of integer and floating point arithmetic and differing precisions, with differing levels of usability from code.
Then define your metric

Attached Image

Now comes the thorny problem of how to define a metric that measures how much arithmetic is going on in a GPU: what to measure? Now here at ARM, we like to be inclusive: partnership is one of our big things, after all. So, I’m prepared to go as far as this: it doesn’t matter so much what you do, as long as you show your working (as UK teachers would say to students, i.e. explain the method you are using). However, anyone who doesn’t explain their numbers (in small print, even) must be trying to hide something, and that just won’t do. So, in the spirit of openness, how do we produce our numbers? Well, the headline is about FLOPS, so for the time being, we’re going to ignore integer arithmetic. Here are ARM’s rules:

  • ARM includes only directly-programmable arithmetic operations: classical arithmetic operations exposed to the shader programmer such as ADD, MUL, and vector versions of those.
  • We count the number of ADDs, MULs etc. (including those in dot product operations) that we can execute in one cycle, from a real piece of code in a computeshader. This is our architectural FLOPS rate (measured in FLOPS per cycle).
  • Although we can do some functions (like trig) really efficiently we don’t add anything into the mix for these – that way lies madness.
  • From a real, fully laid-out, placed-and-routed synthesis, using real physical IP libraries (e.g. TSMC 28nm HPM, specifying channel lengths etc.), we get a maximum operating frequency. We openly specify in what conditions (e.g. slow-slow silicon corner, Vdd at -10% of Vnom etc.). This is not just a PowerPoint number: our partners should easily be able to achieve this frequency. For most partners, who would use more “typical” parameters, they should easily exceed it. If you want to implement on a higher-speed process that burns more power, you can definitely exceed it. This is what we believe is right for an IP supplier. Silicon manufacturers will quote whatever frequency they guarantee their chips at.
  • We multiply the number of FLOPS per cycle by the number of arithmetic pipelines per core, then the number of cores, then by the frequency. That gives you a number of FLOPS. It’s a big number, so usually we specify a number of GFLOPS (gigaflops), but soon we’ll be using teraflops – we have teraflop cores being developed for delivery this year.
  • For the Mali-T600 series, the headline number is single-precision (32-bit floating-point). We quote a second number which is double-precision (64-bit) FLOPS. For most “graphics” GPUs, that 64-bit number is smaller. For a GPU we would target at high-performance computing or supercomputers, (and we have been asked) it might be the same, or even bigger.
  • We’ll also show shader code that actually manages to include all those operations. We’ll show any difference between real code run on real silicon and the architectural FLOPS rate. Currently we can achieve 97% of the architectural GFLOPS rate on real silicon. We believe that’s a very high percentage number compared to others. Perhaps you know better?
  • We also run benchmarks. If you need to know the execution speed of real code, this is probably more useful information to you than looking at architectural numbers! ARM likes independent, third-party benchmarks and there are a host of them to measure performance achieved (rather than architectural numbers). Common ones used for compute-intensive numerical applications areSAXPY and SGEMM originally from the LINPACK and LAPACK BLAS libraries, although recently companies have been starting to look at GPU computing on consumer devices, e.g. withCLBenchmark from Kishonti. This is a large subject and is really best left to a later blog.

What we don’t do
ARM does not include FLOPS from fixed-function units, or things only available from graphics, e.g. texture units, blending units, varying interpolation, triangle setup, Z-culling etc.

  • We don’t include any relaxed precision operations. We only include full IEEE-compliant ops. The subject of IEEE compliance, precision and rounding modes is complex and there is room for significant confusion here. Explaining and demystifying this is best left to a later blog.
  • We don’t make any assumptions about how many operations were involved in calculating any of the library functions that might be implemented as instructions.
  • We don’t quote a theoretical maximum frequency that we cannot justify from a real layout/synthesis. We can provide the EDA tools report to back up our claims.
  • We don’t quote a maximum frequency for ridiculously hot, leaky processes that cannot be sensibly used by most of our partners.
  • We don’t multiply the number we come up with by the ZIP code of our office in San Jose, or shift left by the telephone number of our HQ…:)


And finally

I have described how we define and produce our architectural FLOPS numbers. It should give you all the ammunition you need to go and question your supplier about how they calculate theirs. Hopefully that will lead to useful, productive conversations. Maybe we need a standard. Maybe it will lead to us changing the way we define our numbers to match others’ methods. That’s OK, as long as we’re open about it.
I’ve also indicated the role that benchmarks need to play in describing real-world performance. We need to get industry agreement about which benchmarks matter. Too many benchmarks can lead to confusion.
Like our method? Hate it? Think we’re wrong? Want to suggest anything different? Got any amusing tales to tell about how some others do it? Let us know. Feel free to comment to this blog.
Jem is an ARM Fellow and likes to think of himself as “The Godfather” to technical talent in ARM. After spending some time in his youth writing software for satellites and traffic-lights among other fascinating things, Jem spotted the technical inflection point of the mobile industry: graphics, video and other visual computing. As VP of technology in the Media Processing Division of ARM, Jem is busy with a lot of projects involving the future of cool ARM technology, which will revolutionise how people experience and interact with digital devices.

Direct Link to article.
http://blogs.arm.com/multimedia/950-flipping-the-flops-how-arm-measures-gpu-compute-performance/

Tensilica Joins HSA Foundation to Help Establish Standards for Embedded Heterogeneous Computing

Tensilica Joins HSA Foundation to Help Establish Standards for Embedded Heterogeneous Computing

 
SANTA CLARA, Calif. – March 19, 2013 –Tensilica®, Inc  today announced that it has joined the HSA (Heterogeneous System Architecture) Foundation, a not-for-profit consortium dedicated to  developing  architecture specifications that will unlock the performance and power efficiency of parallel computing engines found in many modern devices. Tensilica will contribute its years of experience assisting customers in bringing heterogeneous multicore SoC (system-on-chip) designs to market to the development and promotion of standards for parallel computing.
 
“Tensilica is a long-established leader in multicore technology, delivering unique solutions that enable both control plane and compute-intensive dataplane functions,” stated Steve Roddy, Tensilica’s vice president of product marketing and business development. “Tensilica customers today use multiple Tensilica processors for diverse functions such as audio offload, wireless baseband, image processing and general purpose control. We welcome the efforts and ambitions of the HSA to bring standards to the market that will greatly facilitate innovation in embedded applications.”
 
“Tensilica is a recognized industry pioneer in dataplane processor technology and multicore solutions, and we look forward to their valued contributions to the HSA Foundation,”,said Greg Stoner, vice president and managing director of the HSA Foundation. “Tensilica’s dataplane processors are widely used by the world semiconductor leaders, and by embracing the standards established by the HSA Foundation, can reduce time-to-market while improving both performance and power efficiency.”
Tensilica’s DPUs (dataplane processing units) are used in chip designs for smartphones, digital televisions, tablets, personal and notebook computers, and storage and networking applications. These DPUs are most often used to offload and accelerate the compute-intensive tasks from the main CPU. Therefore, developing an efficient heterogeneous system architecture is of critical importance to designers using Tensilica’s DPUs.
 
About Tensilica
Tensilica, Inc. is the leader in dataplane processor IP core licensing with over 200 licensees. Dataplane processors (DPUs) combine the best capabilities of DSPs and CPUs while delivering 10 to 100x the performance because they can be optimized using Tensilica’s automated design tools to meet specific and demanding signal processing performance targets. Tensilica’s DPUs power SOC designs at system OEMs and seven out of the top 10 semiconductor companies for designs in mobile wireless, telecom and network infrastructure, computing and storage, and home and auto entertainment. Tensilica offers standard cores and hardware/software solutions that can be used as is or easily customized by semiconductor companies and OEMs for added differentiation. For more information on Tensilica’s patented, benchmark-proven DPUs visit www.tensilica.com.
 
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a not-for-profit consortium for SoC IP vendors, OEMs, academia, SoC vendors, OSVs and ISVs whose goal is to make it easy to program for parallel computing. HSA members are building a heterogeneous compute ecosystem, rooted in industry standards, for combining scalar processing on the CPU with parallel processing on the GPU while enabling high bandwidth access to memory and high application performance at low power consumption. HSA defines interfaces for parallel computation utilizing CPU, GPU and other programmable and fixed function devices, and support for a diverse set of high-level programming languages, thereby creating the next foundation in general purpose computing. For more information, visit www.hsafoundation.com.

# # #

Tensilica is a registered trademark belonging to Tensilica, Inc. All other company and product names mentioned are trademarks and/or registered trademarks of their respective owners.
 
Paula Jones – Director, Corporate Communications, Tensilica
Phone: 408-327-7343     Fax: 408-986-8919  Cell: 650-279-8997
Email: paula@tensilica.com  Web: www.tensilica.com  Facebook   Twitter
 

Happy Holidays From HSA Foundation

Looking back at the last six month we have had exceptional acceptance of the bring HSA forward truly as standard.  We are now at 22 companies and 5 academic members with still more coming in 2013.
It was just a little over year ago when I stepped into AMD to see how we could move HSA beyond a vision Phil Rogers and his team had at AMD to becoming a Industry Standard that truly scaled from deeply embedded devices, Smartphones, Smart TV’s, PC’s and also the way up to HPC class systems.  We are now on the path to truly make this happen.
We have strong involvement form the  best engineers and  innovator from Apical, AMD, ARM, Arteris, Ceva,  Codeplay, DMP, Fabric Engine,  Imagination Technologies, LG, Marvell,  MediaTek,  MultiCoreWare, Qualcomm, Samsung,  Sonic, ST, ST Ericsson, Symbio, Tensilica, TI, and Vivante all driving forward with single vision to drive innovation around heterogeneous computing.
I am also proud to say we have also started to attract some of best minds in academia to bring HSA to the next level.  Feel good to be starting 2013 right.

  • Professor Simon McIntosh-Smith University of Bristol, Microelectronic Group
  • Professor Michael O’Boyle – University of Edinburgh Director of Institute for Computing Systems Architecture
  • Professor Sarita Adve – University of Illinois at Urbana-Champaign ( Like to thank Hans Boehm for introduce us to Professor Adve )
  • Professor JenqKuen Lee NTHU Programing Language Lab
  • Professor Yeh-Ching Chung NTHU Systems Software Lab

One last thing, I am also happy to report we close on bringing our first specification to ratification:  HSA Programer Reference Guide.  After it is ratified we will be making this spec public sometime in Q1/2013.
 
Looking forward to the what 2013 has instore for HSA
 
Happy Holidays
Gregory Stoner
Managing Director
HSA Foundation

GPU Science Articles: Heterogeneous System Architecture: Purpose and Outlook

Great Article on GPU Science on HSA Foundation based on Moor Insights and Strategy White Paper.
——————————————————————————————————————-
Moor Insights and Strategy was commissioned by the AMD to produce a report  “Heterogeneous System Architecture (HSA): Purpose and Outlook”.
The HSA (Heterogeneous System Architecture) Foundation, known as the “HSAF”, is an open, industry standard consortium founded to define and deliver open standards and tools for hardware and software to fully take advantage of high performance of parallel compute engines, and do so in the lowest possible power envelope. This new environment will enable rich new user experiences never been seen before, and done at incredibly low power.
read more at
http://gpuscience.com/cs/heterogeneous-system-architecture-purpose-and-outlook/