Good Article from our Founding Member ARM: Flipping the FLOPS – how ARM measures GPU compute performance

 

It’s time we dealt with the measurement of compute performance in GPUs. In another in a series of ARM blogs intended to enlighten and reduce the amount of confusion in the graphics industry, I’d like to cover the issue of Floating-point Operations Per Second (FLOPS, or GFLOPS or TFLOPS).

Attached Image

In the past, Tom Olson talked about triangles per second (“the chocolate teapot of graphics processor performance metrics”), Ed Plowman talked about pixels per second (“Of Philosophy and When is a Pixel Not a Pixel?”), Sean Ellis addressedfloating-point precision (“At Home on the Range – Why Floating Point Formats Matter in Graphics”) and hopefully we managed to amuse people as well as educate. Today let’s look at compute performance – it’s a useful measure.
Competition is good…
… But open and honest competition is better. The market for GPUs is very competitive, with a number of companies supplying IP as well as those who make their own, for inclusion in SoCs. I love competition; how else can you win if you don’t have competition? Or, as one of the most competitive people I know said to me: “What is the point in competing if you don’t win?” (she was a runner, but suffice to say there are a lot of people round here who want to win at anything they commit to). In this competitive environment, we know that our partners can sometimes struggle to understand performance metrics for GPUs. They need to compare the offerings from multiple suppliers and pick the right product for their needs. This can be a complex subject, but it doesn’t have to be as complex as some try to make it. I want to win on honest, open metrics…
Graphics is compute
Graphics is a really computationally intensive problem – you have to do lots of arithmetic in it, which is one reason people have been interested in utilising those capabilities for more than “just” graphics. To draw stuff, we start off by describing some objects in a three-dimensional space by dividing them into a number of triangles and listing the co-ordinates of each vertex of the triangles. We can argue about why we use triangles, and some have, but a triangle is simple, and the three points in it are guaranteed to form a plane. We then define some light sources and give them types and positions; we define the projection model (the camera) and give that a position; we define the colours and surface detail of the objects (made up of those triangles). Sometimes we add lots more detail; sometimes we animate the objects and make them move. After all that, we try to work out what a picture from the camera would look like, if it were projected onto a two-dimensional screen. As you can imagine, there are lots of 3-D equations to solve, and lots of trigonometry. Most of the numbers we use are floating-point numbers, so the rate at which we can perform floating-point arithmetic has a big effect on our graphics performance. It’s not the only thing, of course, but it is important. It is certainly good to understand it.

Attached Image

First describe the problem
In our GPUs (and lots of others) we have floating-point operations performed in all the places I described above. Some are in fixed-function units and some are in programmable units. Some examples may help here: when you load a value from a texture, the texture unit will calculate a memory address, based on the co-ordinates within the texture that you specify, and then possibly interpolate between several values in memory to produce the texture you want, possibly bi-linearly filtering between some adjacent values. And, if the texture was in a compressed format like ASTC, the values will have to be uncompressed as part of that process as well. That’s a lot of calculation (integer and floating-point). It’s very good for graphics, but utilising those units for more general-purpose compute is somewhere between a bit hard and impossible.
Some GPUs “just” do graphics and do not do general purpose compute. The Mali-400 family for example, was designed for OpenGL ES 2.0, which has low precision requirements. Some operations need to be performed at 32-bit precision, some 24-bit and some 16-bit. OpenCL on NEON on the ARM CPU can be used as a compute companion.
Some GPUs do graphics and compute. For example, the Mali-T600 family of GPUs use the Midgard architecture (described by me in a previous blog). In that architecture, we have arithmetic pipelines that execute instructions like ADD and MUL. We have a balanced mix of scalar and vector (SIMD) units, so we can do multiple operations like that in parallel (e.g. four FP32, 8 FP16). We also have dot productinstructions and a bunch of trigonometry instructions (like sin, cos, tan etc.). How should you express the number of floating-point operations in a trigonometric function like sin()? 
The Mali-T600 series was designed for compute and the newest graphics APIs like OpenCLOpenGL ES 3.0, and Microsoft DirectX11 so it supports full 32-bit precision floating-point operations conformant withIEEE-754-2008. We also do double-precision (64-bit floating-point) and as an aside, we can also do a wide variety of integer operations including 64-bit as well (traditionally GPUs lack good integer capabilities).
To summarise, we have some GPUs with differing performance levels of integer and floating point arithmetic and differing precisions, with differing levels of usability from code.
Then define your metric

Attached Image

Now comes the thorny problem of how to define a metric that measures how much arithmetic is going on in a GPU: what to measure? Now here at ARM, we like to be inclusive: partnership is one of our big things, after all. So, I’m prepared to go as far as this: it doesn’t matter so much what you do, as long as you show your working (as UK teachers would say to students, i.e. explain the method you are using). However, anyone who doesn’t explain their numbers (in small print, even) must be trying to hide something, and that just won’t do. So, in the spirit of openness, how do we produce our numbers? Well, the headline is about FLOPS, so for the time being, we’re going to ignore integer arithmetic. Here are ARM’s rules:

  • ARM includes only directly-programmable arithmetic operations: classical arithmetic operations exposed to the shader programmer such as ADD, MUL, and vector versions of those.
  • We count the number of ADDs, MULs etc. (including those in dot product operations) that we can execute in one cycle, from a real piece of code in a computeshader. This is our architectural FLOPS rate (measured in FLOPS per cycle).
  • Although we can do some functions (like trig) really efficiently we don’t add anything into the mix for these – that way lies madness.
  • From a real, fully laid-out, placed-and-routed synthesis, using real physical IP libraries (e.g. TSMC 28nm HPM, specifying channel lengths etc.), we get a maximum operating frequency. We openly specify in what conditions (e.g. slow-slow silicon corner, Vdd at -10% of Vnom etc.). This is not just a PowerPoint number: our partners should easily be able to achieve this frequency. For most partners, who would use more “typical” parameters, they should easily exceed it. If you want to implement on a higher-speed process that burns more power, you can definitely exceed it. This is what we believe is right for an IP supplier. Silicon manufacturers will quote whatever frequency they guarantee their chips at.
  • We multiply the number of FLOPS per cycle by the number of arithmetic pipelines per core, then the number of cores, then by the frequency. That gives you a number of FLOPS. It’s a big number, so usually we specify a number of GFLOPS (gigaflops), but soon we’ll be using teraflops – we have teraflop cores being developed for delivery this year.
  • For the Mali-T600 series, the headline number is single-precision (32-bit floating-point). We quote a second number which is double-precision (64-bit) FLOPS. For most “graphics” GPUs, that 64-bit number is smaller. For a GPU we would target at high-performance computing or supercomputers, (and we have been asked) it might be the same, or even bigger.
  • We’ll also show shader code that actually manages to include all those operations. We’ll show any difference between real code run on real silicon and the architectural FLOPS rate. Currently we can achieve 97% of the architectural GFLOPS rate on real silicon. We believe that’s a very high percentage number compared to others. Perhaps you know better?
  • We also run benchmarks. If you need to know the execution speed of real code, this is probably more useful information to you than looking at architectural numbers! ARM likes independent, third-party benchmarks and there are a host of them to measure performance achieved (rather than architectural numbers). Common ones used for compute-intensive numerical applications areSAXPY and SGEMM originally from the LINPACK and LAPACK BLAS libraries, although recently companies have been starting to look at GPU computing on consumer devices, e.g. withCLBenchmark from Kishonti. This is a large subject and is really best left to a later blog.

What we don’t do
ARM does not include FLOPS from fixed-function units, or things only available from graphics, e.g. texture units, blending units, varying interpolation, triangle setup, Z-culling etc.

  • We don’t include any relaxed precision operations. We only include full IEEE-compliant ops. The subject of IEEE compliance, precision and rounding modes is complex and there is room for significant confusion here. Explaining and demystifying this is best left to a later blog.
  • We don’t make any assumptions about how many operations were involved in calculating any of the library functions that might be implemented as instructions.
  • We don’t quote a theoretical maximum frequency that we cannot justify from a real layout/synthesis. We can provide the EDA tools report to back up our claims.
  • We don’t quote a maximum frequency for ridiculously hot, leaky processes that cannot be sensibly used by most of our partners.
  • We don’t multiply the number we come up with by the ZIP code of our office in San Jose, or shift left by the telephone number of our HQ…:)


And finally

I have described how we define and produce our architectural FLOPS numbers. It should give you all the ammunition you need to go and question your supplier about how they calculate theirs. Hopefully that will lead to useful, productive conversations. Maybe we need a standard. Maybe it will lead to us changing the way we define our numbers to match others’ methods. That’s OK, as long as we’re open about it.
I’ve also indicated the role that benchmarks need to play in describing real-world performance. We need to get industry agreement about which benchmarks matter. Too many benchmarks can lead to confusion.
Like our method? Hate it? Think we’re wrong? Want to suggest anything different? Got any amusing tales to tell about how some others do it? Let us know. Feel free to comment to this blog.
Jem is an ARM Fellow and likes to think of himself as “The Godfather” to technical talent in ARM. After spending some time in his youth writing software for satellites and traffic-lights among other fascinating things, Jem spotted the technical inflection point of the mobile industry: graphics, video and other visual computing. As VP of technology in the Media Processing Division of ARM, Jem is busy with a lot of projects involving the future of cool ARM technology, which will revolutionise how people experience and interact with digital devices.

Direct Link to article.
http://blogs.arm.com/multimedia/950-flipping-the-flops-how-arm-measures-gpu-compute-performance/

Tensilica Joins HSA Foundation to Help Establish Standards for Embedded Heterogeneous Computing

Tensilica Joins HSA Foundation to Help Establish Standards for Embedded Heterogeneous Computing

 
SANTA CLARA, Calif. – March 19, 2013 –Tensilica®, Inc  today announced that it has joined the HSA (Heterogeneous System Architecture) Foundation, a not-for-profit consortium dedicated to  developing  architecture specifications that will unlock the performance and power efficiency of parallel computing engines found in many modern devices. Tensilica will contribute its years of experience assisting customers in bringing heterogeneous multicore SoC (system-on-chip) designs to market to the development and promotion of standards for parallel computing.
 
“Tensilica is a long-established leader in multicore technology, delivering unique solutions that enable both control plane and compute-intensive dataplane functions,” stated Steve Roddy, Tensilica’s vice president of product marketing and business development. “Tensilica customers today use multiple Tensilica processors for diverse functions such as audio offload, wireless baseband, image processing and general purpose control. We welcome the efforts and ambitions of the HSA to bring standards to the market that will greatly facilitate innovation in embedded applications.”
 
“Tensilica is a recognized industry pioneer in dataplane processor technology and multicore solutions, and we look forward to their valued contributions to the HSA Foundation,”,said Greg Stoner, vice president and managing director of the HSA Foundation. “Tensilica’s dataplane processors are widely used by the world semiconductor leaders, and by embracing the standards established by the HSA Foundation, can reduce time-to-market while improving both performance and power efficiency.”
Tensilica’s DPUs (dataplane processing units) are used in chip designs for smartphones, digital televisions, tablets, personal and notebook computers, and storage and networking applications. These DPUs are most often used to offload and accelerate the compute-intensive tasks from the main CPU. Therefore, developing an efficient heterogeneous system architecture is of critical importance to designers using Tensilica’s DPUs.
 
About Tensilica
Tensilica, Inc. is the leader in dataplane processor IP core licensing with over 200 licensees. Dataplane processors (DPUs) combine the best capabilities of DSPs and CPUs while delivering 10 to 100x the performance because they can be optimized using Tensilica’s automated design tools to meet specific and demanding signal processing performance targets. Tensilica’s DPUs power SOC designs at system OEMs and seven out of the top 10 semiconductor companies for designs in mobile wireless, telecom and network infrastructure, computing and storage, and home and auto entertainment. Tensilica offers standard cores and hardware/software solutions that can be used as is or easily customized by semiconductor companies and OEMs for added differentiation. For more information on Tensilica’s patented, benchmark-proven DPUs visit www.tensilica.com.
 
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a not-for-profit consortium for SoC IP vendors, OEMs, academia, SoC vendors, OSVs and ISVs whose goal is to make it easy to program for parallel computing. HSA members are building a heterogeneous compute ecosystem, rooted in industry standards, for combining scalar processing on the CPU with parallel processing on the GPU while enabling high bandwidth access to memory and high application performance at low power consumption. HSA defines interfaces for parallel computation utilizing CPU, GPU and other programmable and fixed function devices, and support for a diverse set of high-level programming languages, thereby creating the next foundation in general purpose computing. For more information, visit www.hsafoundation.com.

# # #

Tensilica is a registered trademark belonging to Tensilica, Inc. All other company and product names mentioned are trademarks and/or registered trademarks of their respective owners.
 
Paula Jones – Director, Corporate Communications, Tensilica
Phone: 408-327-7343     Fax: 408-986-8919  Cell: 650-279-8997
Email: paula@tensilica.com  Web: www.tensilica.com  Facebook   Twitter
 

Happy Holidays From HSA Foundation

Looking back at the last six month we have had exceptional acceptance of the bring HSA forward truly as standard.  We are now at 22 companies and 5 academic members with still more coming in 2013.
It was just a little over year ago when I stepped into AMD to see how we could move HSA beyond a vision Phil Rogers and his team had at AMD to becoming a Industry Standard that truly scaled from deeply embedded devices, Smartphones, Smart TV’s, PC’s and also the way up to HPC class systems.  We are now on the path to truly make this happen.
We have strong involvement form the  best engineers and  innovator from Apical, AMD, ARM, Arteris, Ceva,  Codeplay, DMP, Fabric Engine,  Imagination Technologies, LG, Marvell,  MediaTek,  MultiCoreWare, Qualcomm, Samsung,  Sonic, ST, ST Ericsson, Symbio, Tensilica, TI, and Vivante all driving forward with single vision to drive innovation around heterogeneous computing.
I am also proud to say we have also started to attract some of best minds in academia to bring HSA to the next level.  Feel good to be starting 2013 right.

  • Professor Simon McIntosh-Smith University of Bristol, Microelectronic Group
  • Professor Michael O’Boyle – University of Edinburgh Director of Institute for Computing Systems Architecture
  • Professor Sarita Adve – University of Illinois at Urbana-Champaign ( Like to thank Hans Boehm for introduce us to Professor Adve )
  • Professor JenqKuen Lee NTHU Programing Language Lab
  • Professor Yeh-Ching Chung NTHU Systems Software Lab

One last thing, I am also happy to report we close on bringing our first specification to ratification:  HSA Programer Reference Guide.  After it is ratified we will be making this spec public sometime in Q1/2013.
 
Looking forward to the what 2013 has instore for HSA
 
Happy Holidays
Gregory Stoner
Managing Director
HSA Foundation

GPU Science Articles: Heterogeneous System Architecture: Purpose and Outlook

Great Article on GPU Science on HSA Foundation based on Moor Insights and Strategy White Paper.
——————————————————————————————————————-
Moor Insights and Strategy was commissioned by the AMD to produce a report  “Heterogeneous System Architecture (HSA): Purpose and Outlook”.
The HSA (Heterogeneous System Architecture) Foundation, known as the “HSAF”, is an open, industry standard consortium founded to define and deliver open standards and tools for hardware and software to fully take advantage of high performance of parallel compute engines, and do so in the lowest possible power envelope. This new environment will enable rich new user experiences never been seen before, and done at incredibly low power.
read more at
http://gpuscience.com/cs/heterogeneous-system-architecture-purpose-and-outlook/

Ceva and Tensilica are new HSA Foundation Members

We welcome CEVA and Tensilica to the HSA Foundation.   They both add a new dimension to the Foundation, as leaders in acceleration solution for mobile, networking,  automotive and digital home.  We look forward to having CEVA and Tensilica  in conjunction with Qualcomm and TI  this truly branches HSA into another acceleration domain with support of non-GPU compute devices.

Techcon Keynote 2012: Sensor Integration and Improved User Experiences at Even Lower Power – HSA

Sensor Integration and Improved User Experiences at Even Lower Power – HSA

Speaker: Phil Rogers

HSA Foundation President and AMD Corporate Fellow, AMD

Phil Rogers is an AMD Corporate Fellow and President of the HSA Foundation. At AMD, Phil is the lead architect for the Heterogeneous System Architecture, focused on drastically reducing the power consumed when running modern applications, and enabling the software ecosystem for heterogeneous computing.Sessions

  Sensor Integration and Improved User Experiences at Even Lower Power – HSA

Location: Grand Ballroom C
Thursday, November 1, 2012, 1:30 PM-2:20 PM

HSA is a new computing platform architecture being standardized by the HSA Foundation which has as Founding members, AMD, ARM, Imagination, TI, Mediatek, Samsung and Qualcomm. HSA is intended to make the use of heterogeneous programming widespread by making purpose built architectures as easy to program as modern CPUs are. We start off by doing this with the GPU, the most widely deployed companion processor to the CPU and one which especially complements the CPU in low power and performance workloads. This requires some hardware architecture changes, that we have been working on for some time (in particular those that enable user mode scheduling, unified address space, unified shared memory, compute context switching, etc.) and which we have encapsulated into the spec currently under review by the HSA Foundation.
In short, HSA codifies the hardware architecture changes that are needed to enable mainstream programmers to develop heterogeneous application with the same facility that they do CPU only applications by seamlessly integrating the sequential programming capability of the CPU with the parallel compute capability of the GPU. We describe the software stacks that are needed for HSA, the benefits that accrue to both developers as well as end users, and describe our vision of the how HSA will help unify the ecosystems of the smartphone and tablet platforms as well as bring it closer to that of the traditional PC market. We will provide analysis of several examples which arise in applications and present data to validate the performance per watt benefit of HSA.

SMP, Asymmetric Multi- processing And The HSA Foundation

When we hear the term “multiprocessing,” we often associate it with “symmetric multiprocessing (SMP).” This is because of SMP’s initial prevalence in the high-performance computing world, and now in x86/x64 servers and PCs. However, it’s been known for years that SMP’s ability to scale performance as the number of cores increases is poor. (For more information on SMP’s inability to scale well, read Jack Ganssle’s 2008 embedded.com article, “The Nulticore Effect,” or the IEEE Spectrum/Sandia Labs article, “Multicore is Bad News for Supercomputers: Adding cores slows data-intensive applications.”)
see more on this Article by Kurt Shuler VP of Marking at Arteris a member of the HSA Foundation  at http://chipdesignmag.com/sld/shuler/2012/09/27/smp-asymmetric-multiprocessing-and-the-hsa-foundation/
 

Arteris Joins Heterogeneous Systems Architecture (HSA) Foundation, Contributing NoC Interconnect Expertise

Arteris Joins Heterogeneous Systems Architecture (HSA) Foundation, Contributing NoC Interconnect Expertise

Network-on-chip pioneer’s protocol-agnostic technology and experience to help accelerate creation and adoption of standardized heterogeneous programming model
SUNNYVALE, California – August 31, 2012 – Arteris Inc., the inventor and leading supplier of network-on-chip (NoC) interconnect IP solutions, today announced that it has joined the recently announced HSA Foundation as a Supporter member and will be actively involved in numerous working groups. The organization offers an open, standards-based approach to heterogeneous computing, seeking to provide a set of common hardware and software specifications allowing software developers to more easily take advantage of modern heterogeneous processors. Arteris joins the HSA Foundation to bring to realization its vision of true “plug-and-play” heterogeneous IP core integration on systems-on-chip (SoCs). This will finally enable hardware designers to provide software developers the flexibility to make coordinated optimizations for speed, power consumption, and cost. For more information on the HSA Foundation, go to www.hsafoundation.com
Arteris will contribute its decade of experience optimizing NoC technology to connect multiple CPU, GPU, and other IP cores on SoCs. In addition, Arteris system IP technology enables on-chip visibility at runtime for real-time system performance monitoring and tuning. This gives higher level application programming interfaces (APIs) and operating system kernels the information to make energy efficient thread scheduling decisions. Arteris’ technical know-how, coupled with other HSA member companies’ technology, will unlock the performance and power efficiency of the computing engines found in heterogeneous processors.
“The next era of heterogeneous computing will be enabled by industry leaders coming together around a standards-based approach and programming model,” said Greg Stoner, vice president and managing director of the HSA Foundation. “HSA is built on technological leadership and innovation from companies like Arteris that are creating SoC IP focused on delivering unique user experiences and bringing greater efficiencies in the broad set of devices extending from mobile well into the cloud.”
“The heterogeneous processor SoC market is expanding rapidly and our industry is on the verge of making significant technical strides in multicore computing,” said K. Charles Janac, President and CEO of Arteris. “Arteris will contribute key pieces of this puzzle, not only with our interconnect technology but also system IP that provides critical visibility to optimize system performance. We look forward to working with other HSA members to achieve a standardized heterogeneous programming model.”
About Arteris
Arteris, Inc. provides Network-on-Chip interconnect IP and tools to accelerate System-on-Chip semiconductor (SoC) assembly for a wide range of applications. Results obtained by using the Arteris product line include lower power, higher performance, more efficient design reuse and faster development of ICs, SoCs and FPGAs.
 
Founded by networking experts and offering the first commercially available Network-on-Chip IP products, Arteris operates globally with headquarters in Sunnyvale, California and an engineering center in Paris, France. Arteris is a private company backed by a group of international investors including ARM Holdings, Crescendo Ventures, DoCoMo Capital, Qualcomm Incorporated, Synopsys, TVM Capital, and Ventech. More information can be found at www.arteris.com.
About Heterogeneous System Architecture (HSA)
Developers will benefit from the open standard programming of HSA for both the CPU and GPU, which allows the two processors to work cooperatively and directly in system memory. Additionally, HSA provides a single architecture across multiple operating systems and hardware designs. By maximizing the full compute capabilities of systems with both CPUs and GPUs, users can see performance and energy efficiency boosts across a variety of applications.
About the HSA Foundation 
The HSA (Heterogeneous System Architecture) Foundation is a not-for-profit consortium for SoC IP vendors, OEMs, academia, SoC vendors, OSVs and ISVs whose goal is to make it easy to program for parallel computing. HSA members are building a heterogeneous compute ecosystem, rooted in industry standards, for combining scalar processing on the CPU with parallel processing on the GPU while enabling high bandwidth access to memory and high application performance at low power consumption. HSA defines interfaces for parallel computation utilizing CPU, GPU and other programmable and fixed function devices, and support for a diverse set of high-level programming languages, thereby creating the next foundation in general purpose computing.
 
Arteris, FlexNoC and the Arteris logo are trademarks of Arteris. All other product or service names are the property of their respective owners.

###

For more Arteris information, contact:
Kurt Shuler
Arteris, Inc.
+1 408-470-7300
kurt.shuler@arteris.com

We are counting down to Phil Roger IFA 2012 Keynote

Reminder if your at IFA 2012 to go see HSA Foundation President to Deliver Opening Day Keynote.  See Phil Rogers, HSA Foundation President and AMD Corporate Fellow, reveal the next era of computing innovation –

Phil Rogers  is one of two featured keynote speakers on opening day of IFA 2012, Friday, Aug. 31. Phil’s keynote titled, “The Next Era of Computing Innovation,” will examine the rapidly evolving computing and consumer electronics markets and show how new technologies are rapidly changing the way users interact with their electronics.
From the use of gestures to augmented reality and biometric recognition to the ever expanding video-centric world that we live in, this keynote will not just ask audience members to imagine a place, but will show them innovative applications that are available today as well as what’s coming. The HSA Foundation invites IFA 2012 attendees to join Phil and other industry leaders in the International Keynote Area, Hall 6.3 on Friday from 3:00 to 3:45 p.m.
Supporting Resources

About the HSA Foundation  
The HSA (Heterogeneous System Architecture) Foundation is a not-for-profit consortium for SoC IP vendors, OEMs, academia, SoC vendors, OSVs and ISVs whose goal is to make it easy to program for parallel computing. HSA members are building a heterogeneous compute ecosystem, rooted in industry standards, for combining scalar processing on the CPU with parallel processing on the GPU while enabling high bandwidth access to memory and high application performance at low power consumption. HSA defines interfaces for parallel computation utilizing CPU, GPU and other programmable and fixed function devices, and support for a diverse set of high-level programming languages, thereby creating the next foundation in general purpose computing. Learn more about HSA and the HSA Foundation at www.hsafoundation.com.