Bringing C++AMP Beyond Windows via CLANG and LLVM

We are happy to report after some great work of MultiCoreWare in conduction with support from AMD and Microsoft today we are releasing a  C++ AMP compiler based on CLANG/LLVM so we can bring C++ AMP  to multiple platforms.  We want to bring this out early so we could work with the community to make sure we get there input prior to make this 1.0.  So we calling all developers who are looking for heterogeneous C++ compiler to help with finding bug, driving feature, creating optimization as well building applications and libraries drive new class of applications.
You can get access to the compiler at the Bitbucket repository link:
https://bitbucket.org/multicoreware/cppamp-driver/
We also have Samples:
https://bitbucket.org/multicoreware/cxxamp_sandbox
FEATURES:
* Compiles C++AMP to OpenCL C  and Khronos Group Provisional SPIR 1.2 for Linux. Works across major GPU platforms.
* Leverages GMAC for CPU-GPU synchronization on non-HSA GPUs.
TODOs/Ongoing works:
* Fix SPIR code generation issue. Right now system headers do not flow thru SPIR path and that causes host code to fail compilation.
* HSAIL code generation and HSA-optimized layout
* Passing MS C++AMP conformance suite
* Async API
* Better address space support — right now small changes to user code are required when taking/passing a pointer to local memory buffers. See samples for details.
* Merge into official Clang main line
 
Remember C++ AMP already has rich set of libraries which Microsoft has released under Apache License.

  1. C++ AMP Algorithms Library (STL-style Algorithms)
  2. C++ AMP RNG Library (Random Number Generator)
  3. C++ AMP FFT Library (Fast Fourier Transform)
  4. C++ AMP BLAS Library (Basic Linear Algebra Subroutines)
  5. C++ AMP LAPACK Library (Linear Algebra Package)

Asymmetric Multiprocessing with Heterogeneous Architectures: Use the Best Tool for the Job

Asymmetric Multiprocessing with Heterogeneous Architectures: Use the Best Tool for the Job   Featured
Contributor: Arteris SA
 Printer friendly
 E-Mail Item URL

September 6,2013 — Often, the term “multiprocessing” is associated with tightly-coupled symmetric multiprocessing (SMP) architectures, due in large part to SMP’s prevalence in high-performance computing, x86/x64 servers, and PCs. Unfortunately, SMP’s incremental performance scaling for most applications decreases significantly with increasing numbers of cores. This lack of scalability has prompted many processor companies to avoid purely SMP solutions for their mobile and consumer electronics applications. Instead, they have implemented asymmetric multiprocessing (AMP) architectures to make more efficient use of silicon.An example of AMP is a mobile phone’s modem baseband SOC, containing an ARM processor and a DSP to handle control and signal processing, respectively. AMP architectures are also found in mobile phone application processors, which have multiple CPU cores and separate discrete graphics cores, video cores, audio cores and imaging cores. Heterogeneous architectures also dominate in most embedded consumer applications, such as digital TVs, set-top boxes, and automotive infotainment.
 
 

Figure 1. The Qualcomm Snapdragon 800 is an example of system-on-chip that implements an asymmetric processing (AMP) architecture with multiple processing units optimized for different functions. Source: Qualcomm.

 

Heat and power drive architecture decisions

Mobile applications face significant design constraints because of battery size and heat dissipation. As a result, processor designers are forced to use “the best core for the job.” So architectures in mobility have always been created from a baseline expectation of heterogeneous core AMP.
Server and PC chips have relatively unlimited power consumption and heat dissipation capabilities, making an SMP architecture tolerable. In these applications, it is often easier to add more cores of the same type, connect them using cache coherency, and reuse the legacy software to run on top. Comparatively little attention has been paid to heat dissipation and power consumption.
But PCs are becoming smaller and mobile. And server farms are eyeing power consumption as well, forcing designers to reconsider SMP architectures. For example, for server farms that power the likes of Google and Facebook, power consumption and heat dissipation have become huge cost and environmental issues. And in the PC space, we have run into a “gigahertz wall” where the only way to have a step function increase in performance is to have different cores optimized for different workload types.

AMP architectures struggle to break into PC/server applications

Why don’t AMP architectures dominate PC and server applications? Because it’s hard to implement!
In mobile designs, each heterogeneous processing core, whether graphics, audio, DSP, etc., typically has a custom firmware and software stack associated with it. This software must be integrated to communicate with the CPU cores’ operating system, requiring coding work in the OS hardware abstraction layer and drivers. In addition, these heterogeneous cores do not have a single view of system memory, so complicated synchronization schemes are usually implemented in hardware and software. Context switching and preemption are difficult to implement. Adding to the challenge, each of these cores requires an expert programmer, conversant in a particular core’s instruction set and tool chains, to code it.
These barriers have forced AMP to remain in the mobile and consumer electronics realm, which is closed to low-level, close-to-the-hardware software developers. Alternatively, SMP has flourished in the wide-open world of PCs and servers, aided by the ease of programming.
Heterogeneous system architectures (HSA) can span the chasm between mobile/ consumer applications and PC/ server applications, easing the design burden while delivering performance, scalability, improved heat dissipation and reduced power consumption.
Recently, a number of companies, including AMD, ARM, Imagination, MediaTek, Qualcomm, Samsung and Texas Instruments, founded the HSA Foundation. HSA defines interfaces for parallel computation utilizing CPU, GPU, and other programmable and fixed-function devices, and support for a diverse set of high-level programming languages, thereby creating the next foundation in general-purpose computing.
Its goals are to:

  • Make heterogeneous programming easy and a first-class pervasive complement to CPU computing.
  • Continue to increase the power efficiency of heterogeneous systems (AMP), keeping it the platform of choice from smartphones to the cloud.
  • Bring to market strong development solutions (tools, libraries, OS run-times) to drive innovative advanced content and applications.
  • Foster growth of heterogeneous computing talent through HSA developer training and academic programs to drive both learning and innovation.

The HSA approach requires a technical framework and architecture

There are several issues that must be addressed to successfully bring these two worlds together:

  • Unified programming model – Today, CPU and GPU (or other accelerator) cores are programmed separately, with the accelerator treated as a remote processor. To make the maximum use of hardware resources while balancing ease of programming, heterogeneous architectures should allow developers to target the CPU or GPU by writing in task-parallel languages, like the ones they use today when writing for multicore CPUs.
  • Unified address space – HSA supports virtual address translation amongst the heterogeneous cores with an HSA-specific memory management unit (HMMU). HSA compute engines will use the same pageable virtual address space as used by CPUs today.
  • Queuing – CPUs, GPUs and other cores can queue tasks to each other and to themselves through an HSA run-time. Queuing can be managed in hardware to avoid OS system calls and enable very low latency communication between cores.
  • Preemption and context switching – HSA enables job preemption, job scheduling and fault handling capabilities to overcome potential problems created by rogue or faulted processes.

HSA Foundation provides key tools for unlocking heterogeneous programming

Today, CPUs and GPUs do not share a common view of system memory, requiring an application to explicitly copy data between the two devices. In addition, an application running on the CPU that wants to add work to the GPU’s queue must execute system calls that communicate through the CPU operating system’s device driver stack, and then communicate with a separate scheduler that manages the GPU’s work. This adds significant run-time latency, in addition to being very difficult to program.
HSA addresses the need for easy software programming of GPUs to take advantage of their unique capability to crunch parallel workloads much more efficiently than x86 or ARM CPUs.

HSA solution stack: Abstracting away hardware specifics

To enable easier programming, HSA allows developers to program at a higher abstraction level using mainstream programming languages and additional libraries. This HSA solution stack includes several components.
The key to enabling one language for heterogeneous core programming is to have an intermediate run-time layer that abstracts hardware specifics away from the software developer, leaving the hardware-specific coding to be done once by the hardware vendor or IP provider. The core of this intermediate layer is the HSA Intermediate Language or “HSAIL.”
 
 

Figure 2. The HSA Intermediate Language (HSAIL) is an intermediate run-time layer that abstracts hardware specifics away from the software developer. Source: AMD.

 
The HSA run-time stack is created by compiling a high-level language such as C++ with the HSA compilation stack. HSA’s compilation stack is based on the LLVM infrastructure, which is also used inOpenCL from the Khronos Group.
Creation of HSAIL can occur prior to run-time or during run-time. Here are two examples: The OpenCL run-time includes the compiler stack and is called at run-time to execute a program that is already in data-parallel form. Alternatively, Microsoft’s C++ AMP (C++ Accelerated Massive Parallelism) uses the compiler stack during program compilation rather than execution. The C++ AMP compiler extracts data-parallel code sections and runs them through the HSA compiler stack, and passes non-parallel code through the normal compilation path.
Figure 3 shows the HSA compilation stack, where programming code is compiled into HSAIL using the LLVM compilation infrastructure:
 
 

Figure 3. The HSA compilation stack creates the HSA Intermediate Language (HSAIL) prior to or during run-time. Source: AMD.

 

The hardware-specific HSA Finalizer is a key component

A key role is played by the hardware-specific “finalizer” which converts HSAIL to the computing unit’s native instruction set. Hardware and IP vendors are responsible for creating finalizers that support their hardware. The finalizer is lightweight and can be run at compile time, installation time or run-time depending on requirements.
Figure 4 shows the HSAIL and its path through the HSA run-time stack:
 
 

Figure 4. The hardware-specific components of the HSA run-time stack are the HSA Finalizer and the hardware driver. Source: AMD.

 
The HSA Finalizer is the point at which the specifics of different heterogeneous computing units are addressed. Initial HSA implementations will most likely support GPU compute with finalizers from GPU vendors such as AMD, Imagination, ARM, and Qualcomm. The quality and features of each vendor’s HSA Finalizer will help determine how software developers take advantage of each hardware element’s computing capabilities.

Benefiting from heterogeneous architectures requires smart scheduling

In addition to GPUs, many existing heterogeneous architectures have additional discrete processing units for functions such as audio (digital signal processing or stream processing), image and video processing (SIMD frame processing), and security. As HSA matures, hardware and IP vendors creating these processing units may want to enable HSA programmability on their hardware by creating hardware-specific finalizers.
Having multiple heterogeneous processing units will complicate workload scheduling from a system perspective. The harsh reality is that existing workload scheduling and OS scheduling algorithms are relatively simple and generally only take into account local activity on a processing unit or a cluster of homogeneous processing units (see the Linux Completely Fair Scheduler for one example of how scheduling is implemented: ).

Interconnect fabric-assisted scheduling is required to implement scalable HSA systems

Existing OS and middleware scheduling algorithms do not take into account the existing traffic throughout the system, nor a view into other processing units. This lack of a global perspective for scheduling virtually guarantees there will be contention and stalling as processing units wait for access to precious system resources, especially the DRAM. It’s like looking out the front door of your house to determine how bad the traffic will be on your commute to work: You are missing very relevant information that could help you determine the optimal route to take.
Probing current run-time data flows at critical points throughout a system’s SOC interconnect fabric can provide critical information to enhance workload scheduling. This information can then be used to assign priorities to workloads, and workloads to processing units. These priorities and assignments can be optimized based on performance requirements or power consumption requirements, as required for a particular use case. As heterogeneous processing becomes the norm, and more processing units are added to a system, this type of interconnect-assisted scheduling will be required.
In other words, the hardware interconnect is a key enabler to putting the heterogeneous into HSA.

Resources

For more guidance on heterogeneous system architectures, visit the HSA Foundation or the Arteriswebsites.
Heterogeneous System Architecture: A Technical Review” whitepaper by George Kyriazis, (AMD), HSA Foundation, August, 2012.
The HSA Compilation and Run-time Stack diagrams are from the whitepaper by George Kyriazis cited above.
 

By Kurt Shuler
Kurt Shuler is Vice President of Marketing, Arteris, Inc.
 
Go to the Arteris SA website to learn more.
http://www.soccentral.com/results.asp?EntryID=41133

Keywords: computer system design, genera

HOT CHIPS 2013- HSA Foundation Presented Deeper Detail on HSA and HSAIL

 

Wanting to find out more about HSA,  at Hot Chips 2013, Phil Rogers ( AMD) , Ben Gaster ( Qualcomm),  Ian Bratt ( ARM), and Ben Sander ( AMD)presented on HSA, HSA Memory Model, HSA Queueing Model and HSAIL this last Sunday.  We now have the presentations posted in our developer publications page (http://107.170.238.52/publications/)  and media presentations (http://107.170.238.52/pubs-presos/)  as well as on HSA Foundation Slideshare. (http://www.slideshare.net/hsafoundation)

Dig into the material and see if you want join the exciting future of HSA enabled devices.

[one_half]

[/one_half][one_half_last]

[/one_half_last][one_half]

[/one_half][one_half_last]

[/one_half_last]

HSAIL: Write-Once-Run-Everywhere for Heterogeneous Systems – IEEE article

Ben Sander of AMD and  Chien-Ping Lu MediaTek HSA Foundation Working group leader for HSA Programer Reference Manual pen a nice article on HSAIL and HSA technology
 
“Power efficiency has emerged as a primary design goal for modern silicon chips.  Accelerators such as GPUs have well-known advantages in compute density per-watt and per-mm^2 – note for example that the systems at the top of the latest Green500 (http://www.green500.org/) and Top500 (http://www.top500.org/) lists are now based on heterogeneous designs.
However, these systems have traditionally been difficult to program, due to two challenges.  First, many accelerators support only dedicated address spaces that require cumbersome copy operations and prevent the use of pointer-based data structures on both the accelerator and the host processor.   Second, accelerator programming has traditionally required a specialized language such as OpenCL™ or CUDA™.  Some of these specialized languages are only supported by a single hardware vendor, which further constrains their adoption.
An intermediate language called HSAIL is helping to address some of the challenges. One of the benefits of HSAIL is its portability across multiple vendor products.  Compilers that generate HSAIL can be assured that the resulting code will be able to run on a wide variety of target platforms. HSAIL also provides existing programming languages with an efficient parallel intermediate language that runs on a wide variety of hardware.  This provides the underlying infrastructure and brings the benefits of heterogeneous computing to existing, popular programming models such as Java™, OpenMP™, C++, and more”. ………..  read more at this link bellow
http://www.computer.org/portal/web/computingnow/software%20engineering/content?g=53319&type=article&urlTitle=hsail%3A-write-once-run-everywhere-for-heterogenous-systems

What’s the Heterogeneous Point? The HSA Foundation provides an answer

What’s the Heterogeneous Point? The HSA Foundation provides an answer

// May 30th, 2013 // MultimediaProcessors
Our industry is littered with over-used terms whose meaning becomes ever more jaded as more people use them. We’re in danger of “heterogeneous” being another of them. But I sincerely hope we can live with that, because heterogeneous systems are going to be with us for a very long time to come. One of the first steps towards this is the recent announcement of the ratification of the HSAIL language specification from the HSA Foundation.
HSA Foundation
For as long as I’ve been involved in semiconductor processors – well over 30 years now – the desire for us to remove the bottlenecks of sequential processing has been insatiable. It hasn’t been solved until now for the simple reason that any traditional processor architecture has always suffered from the basic problem that any platform, no matter how clever, was always too “niche” and low volume for the mainstream software community to consider adopting in any meaningful way. And as we all know, the software community dwarfs the hardware community, and, more importantly, implements the code that enables our hardware brilliance to come to life for real end users. And software developers need high platform volumes for their software to be profitable. Hence, nothing ended up happening – until now.
At last we can see a way forward, thanks to the mass market adoption of traditional sequential CPUs combined with high-performance parallel processor-based GPUs in billions of mobile phones, tablets and other mass market products. At last, the software industry can move on (profitably) from the limitations of sequential processing into a world where processing scales linearly with silicon nodes, where processing efficiency per mW and per mm2 leaps to new heights, and where the sheer breadth of processing power at any one time from low-end to high-end is measured in orders of magnitude – all at mobile power consumption levels.

Heterogeneous processing – the killer combination of processors

This CPU + GPU combination is a true heterogeneous processor: multiple datapath architectures, each very different in ISA and capabilities, but working together under the control of a single application. But how do we program these new beasts? Well, we’re doing it today: we use graphics APIs at higher levels of abstraction to talk to the GPU – and it works extremely well. Not surprising really – haven’t we actually been writing these heterogeneous applications for decades already, thanks to our games consoles and GPU-enriched PCs?
HSA Foundation infographic
The HSA Foundation was formed as an open industry standards body to unify the computing industry around a common approach*
But now we want to do more. We want to use all that processing power in the GPU not just for graphics, but for other things like image processing, database searching, fluid dynamics – all sorts of things requiring processing horsepower way beyond what the best mobile CPUs can hope to deliver. How do we write applications like that? Do we need new languages?
No – we just need new abstractions and APIs such as Khronos’ OpenCL to help us. But we need more: since these are performance-driven applications, often with demanding real-time constraints such as user interactivity measured in milliseconds, we need to ensure every part of our heterogeneous processor is being used effectively.

Enter the HSA Foundation

That’s where technologies such as HSA come in. The HSA Foundation was created by industry leaders including AMD, Imagination, MediaTek, Qualcomm and Samsung, to ensure that applications can manage their execution not only on the CPUs and GPUs in a system, but also the infrastructure connecting them. By creating an open standard around how to connect CPUs to GPUs and other processors, we break the dependence of such advanced apps on any particular chip or CPU or GPU architecture. This enables the silicon industry to innovate by allowing multiple vendors to create competing solutions – fuelling the innovation the semiconductor industry is so famous for.
HSA Foundation
Since such heterogeneous SoCs (Systems on Chip) are expected to often be in mobile platforms with limited memory bandwidth and tight power budgets, how we schedule low-level tasks and assign them to various parts of a GPU or one of the CPU cores in the system is critical to an application utilising the full capabilities of any heterogeneous SoC. The APIs associated with HSA (and other forms of heterogeneous processing) will be key to enabling this.
Applications will also need to get ever more sophisticated in their “discovery” phase during startup, where they explore the system they are running on to find out what processing resources they have, what bandwidths are available, and much more.
Heterogeneous applications targeting tomorrow’s SoCs have a lot to cope with – problems HSA and other heterogeneous approacheswill help to solve. But by solving them in an open, standards-based way, we’ll end up with a software industry that is highly motivated to deliver apps using these open standards. These apps will not only achieve functionality we’ve never dreamt of; they’ll also adapt in ever more ingenious ways to whatever resources they have available to them.
It’s another exciting decade ahead for the world of computing! Have a look at the HSA Foundation website and their recent blog articlesto find out more and make sure you follow us on Twitter (@ImaginationPR@GPUCompute and @HSAFoundation) to stay updated on the developments behind this exciting partnership.
 
* Image courtesy of AMD, all rights reserved.

Tony King-Smith (2 Posts)

Tony joined Imagination Technologies in 2006 and is the company’s EVP of Marketing, responsible for all segment and technology marketing, communications, OEM relationships and ecosystems. He has extensive experience in product and segment marketing including many blue chip corporate relationships. Prior to Imagination, Tony held senior engineering and marketing positions with Panasonic, Hitachi (now Renesas), LSI Logic and INMOS.
 
http://withimagination.imgtec.com/index.php/multimedia/whats-the-heterogeneous-point-the-hsa-foundation-provides-an-answer

First delivery from Heterogeneous Systems Architecture Foundation

 
Yesterday the Heterogeneous Systems Architecture (HSA) Foundation released Version 0.95 of its Programmer’s Reference Manual. This release is the first yield from the Foundation and is the product of an entire years’ collaboration between leading companies throughout the entire length of the heterogeneous computing value chain, from silicon to IP to ISV.

Attached Image

Of course I have written here before about the founding of the HSA Foundation, its values and ARM’s early commitment to join it in defining an appropriate set of standards for the industry. Since that post, the HSA Foundation has striven to generate the defining standards of heterogeneous computing – so this Manual is the result of much hard work all-round. I am therefore naturally very pleased to see this release announced by the Foundation. This Reference Manual will now be the building block upon which ARM’s critical ecosystem of software developers, from tools to middleware, can design their products and, through this, develop the field of HSA, allowing the Foundation to succeed in its aim of combining the best capabilities of CPU, GPU and accompanying technologies.
ARM has contributed to the Foundation’s working groups many of its top experts from the fields of computer processing, graphics processing, interconnect and compiler technology and software. These colleagues of mine are working alongside HSA ecosystem partners, ensuring that what the Foundation delivers is based on the knowledge and experience of the entire breadth of the industry: an example of the ARM partnership model at its finest.
Version 0.95 of the Programmer’s Reference Manual is the initial output in a line of specifications that the HSA Foundation will be releasing over the months to come.
Hopefully, the next to appear will be the Hardware System Architecture Specification…!
Jem is an ARM Fellow and likes to think of himself as “The Godfather” to technical talent in ARM. After spending some time in his youth writing software for satellites and traffic-lights among other fascinating things, Jem spotted the technical inflection point of the mobile industry: graphics, video and other visual computing. As VP of technology in the Media Processing Division of ARM, Jem is busy with a lot of projects involving the future of cool ARM technology, which will revolutionise how people experience and interact with digital devices.

 
All company and product names appearing in the ARM Blogs are trademarks and/or registered trademarks of ARM Limited per ARM’s official trademark list. All other product or service names mentioned herein are the trademarks of their respective owners.

 
http://blogs.arm.com/multimedia/973-first-delivery-from-heterogeneous-systems-architecture-foundation/?sf13357134=1

HSA Foundation has just released version 0.95 of the Programmer’s Reference Manual, which we affectionately refer to as “the HSAIL spec

The HSA Foundation has just released version 0.95 of the Programmer’s Reference Manual, which we affectionately refer to as “the HSAIL spec”.  This has been in development for more than year, and I’m proud to finally be able to share our work with the external world.  My role in the process was the working-group spec editor for the 0.95 version.  The spec also benefitted significantly from the contributions of Norm Rubin (who wrote the original draft), and Tony Tye (who polished the final one), as well as the contributions from the many architects and experts from the companies in the working-group.
 
The spec describes HSAIL (HSA Intermediate Language, pronounced “H-Sale”).  HSAIL is a low-level intermediate language for a wide variety of parallel processor architectures (including GPUs) supported by members of the HSA Foundation.   HSAIL is a preferred target for library writers and back-end compiler developers who want to target HSA compute devices, and who want to deliver their own optimizations and control the compilation.  HSAIL is architected such that register allocation and other complex compiler optimizations are done before HSAIL is generated, which leads to a robust and fast translation from HSAIL to the device instruction set.  HSAIL also includes a well-defined relaxed memory model including load.acquire, store.release, barrier, and fine-grained barrier operations.   HSAIL is designed to support a wide variety of high-level programming models such as OpenCL™, OpenMP™, C++, and Java.  HSAIL also defines a binary format called “BRIG” that can be embedded in executable files alongside the code for the host CPU instruction-set.
 
Writing in HSAIL is similar to writing in assembly language for a RISC CPU : the language uses a load/store architecture, supports fundamental integer and floating point operations, branches, atomic operations, multi-media operations, and uses a fixed-size pool of registers.   HSAIL also contains built in support for function pointers, exceptions and debugging information.  Additionally, HSAIL defines group memory, hierarchical synchronization primitives, and wavefronts that should look familiar to programmers of GPU computing devices.
 
Today, many accelerator devices have separate address spaces that require cumbersome copy operations, and prevent complex pointer-containing data structures from being used on the accelerator and host.  HSA platforms address this challenge by requiring that all HSA Components can access the same shared, coherent memory space with high performance.    HSAIL helps to address the other major challenge with programming heterogeneous computing devices:  today accelerator programmers typically have to use a dedicated “compute language” such as OpenCL™ or CUDA™ to access the power of the accelerator.   A primary goal of HSA is to bring the power of these compute devices to the programming languages that developers are already using, in a natural and easy-to-use manner.  We are seeing this already with the introduction of C++ AMP, Java Aparapi, and Bolt – programmers are able to access parallel compute resources using programming models that are no more complex than those used for multi-core CPUs.   HSAIL adds the benefits of a portable IR, yet still low-level enough to give language and compiler vendors control over the code generation and associated optimizations.   Additionally, HSAIL is a royalty-free open standard – and open standards spur innovation , eliminate reliance on a single vendor, and always win over time.   The HSA Foundation will be providing publicly available assemblers and disassembler tools for HSAIL, and will additionally provide a code generator back-end for the popular LLVM compiler infrastructure.  LLVM already contains front-end parsers for many popular programming models.  Combined with language constructs to identify parallel regions (some of which already exist), these can be naturally extended to leverage HSAIL and the power efficiency and performance benefits of heterogeneous computing.
 
It has been a long journey and we are excited to share the next steps with the broader development community!
 
-Ben sander
AMD Fellow & Architect for HSA
Main Spec Editor for HSA Programer Reference
 

HSA Programmer Reference: The Formation Of The New Specification

HSA foundation was founded on June 12, 2012, and discussions among board members from founding companies commenced. We quickly reached a consensus to set the work of specifying HSA in motion, and to form the Programmer’s Reference Manual (PRM) working group as early as possible. We also strove to leverage the examples set by successful standard bodies. We picked Khronos as our role model. The first two meetings of the working group, held on Aug 24 and Aug 31 of 2012, produced a Statement of Work, a meeting format, a schedule and meeting frequency. The first working group of the young organization embarked on her journey.  The atmosphere of the working group was extremely friendly and cooperative. In a couple of weeks, we were able to associate the voices with the names, thru some trials and errors, and, of course, friendly reminders. Our original plan was to complete the work in 9 weeks, so that we could submit the spec for ratification by the end of year 2012. Finishing the work in 9 weeks proved to be too ambitious. 3 additional months were needed to produce a version deemed ready for the public.
 
The ultimate mission of HSA is to advance Parallel Computing with GPU or any other kind of programmable devices, to the next level in terms of ease of programming and power efficiency. We needed to repeatedly remind ourselves to strike a balance between current state of the art, and forward-looking ideas beyond the current, conventional way of programming  GPUs, or for that matter any SIMD style processors. Also by looking at use cases that do not yet exist in the market place, we needed to revisit some common themes in computing, such as precision, cache coherency, memory consistency again and again. The goal is to create a standard that is not only practical for wide industry-wise adoption, but also for future innovation and differentiation.
 
The PRM, or commonly referred to as the HSAIL (HSA Intermediate Language) spec,  plays a central role for such a revolution in Parallel Computing. It provides a reference for HSAIL, which is intended to decouples software development from hardware one. One key and differentiating feature of HSAIL is that it is positioned as a virtual ISA for any programmable computing device participating in a HSA-compliant system. Programmers can assume that there is  HSAIL virtual machine supporting HSAIL, and all practical concerns and issues regarding performance and power can be addressed with respects to such a “machine”. Hardware designers  can build their HSA-compliant computing devices with a goal to execute HSAIL code, thru efficient Just-in-Time compilation,  as close to the metal as possible. The HSAIL virtual machine is essentially a load/store architecture, supporting fundamental integer and floating point operations, branches, atomic operations, multimedia operations, and using a fixed size pool of registers.  Additionally, the machine supports group memory, hierarchical synchronization primitives, and wavefronts which, though looks familiar to programmers of GPUs, could potentially be leveraged in non-GPU computing devices as well.
 
For middleware, library and compiler developers, HSAIL is a perfect target due to its low-level nature, and stability and universality compared to native hardware ISAs. They can invest in R&D on top of HSAIL, and be sure that they would get the return thru the HSAIL ecosystem. The application developers, can optimize their code manually in HSAIL, and/or leverage the third-party HSAIL development tools or environments, and be confident that the real-world performance and efficiency of the applications developed this way would match their expectations. Such an assurance is achieved thru hardware vendors striving to optimize their HSA-compliant devices for HSAIL. Since HSAIL defines a virtual machine, not a physical one, hardware companies can innovate and differentiate in their native ISAs and micro-architectures. One of the coolest things about HSAIL is that it can potentially enable an ecosystem in which advances in Parallel Computing can happen independently and synergistically between software and hardware companies.
 
Completing the task of releasing a spec within 6 months from a young foundation is truly an amazing feat. Although AMD provided an initial draft that was nearly complete in terms of features, many of these features required careful reexamination and re-specification. Foundation members sent their best architects to participate, with the mandate to give this work priority. Because of the high quality of collaboration, most issues were resolved through consensus. Only 2 issues had to be resolved by ballot. One ballot question decided whether we should treat FP64 as optional in the base profile. The second ballot question could not be avoided: it was the vote for ratification!  Among issues resolved by consensus, naming and specifying the profiles was the most sticky. Due to different views on technology roadmaps, historical backgrounds and market positioning, the working group could not reach an agreement, and we asked the Board of Directors to arbitrate. And just as the US Supreme Court will sometimes return a case to the lower courts, the board sent the issue back to the working group! The working group reconsidered the issue and found a consensus.
 
The work of the PRM WG continues. There are many cross-group issues, for example, linkage, where the PRM WG plays a necessary role. Additionally, features continue to be examined and tuned. We have also turned our attention to enablement of implementation, by providing the HSAIL grammar and syntax in EBNF format. And we are correcting for consistency: the textual specification, programming examples, EBNF, and BRIG definitions are effectively four different ways of describing a feature.
 
As it happens, several participants are working in different working groups, and often considering the same issue, from the perspective of the PRM, then from the point of the view of system architecture, then in the context of the runtime, … We joked that we show split personalities when participating in different groups.
 
We have an outstanding team of processor, compiler and system architects. I am confident that what we have produced and will continue to produce will be superior to any proprietary solution. With such a great team, and the great companies behind it, I can proudly and confidently say that the future of Heterogeneous Parallel Computing is being shaped and defined here.
http://www.mediatek.com/_en/03_news/01-2_newsDetail.php?sn=1111&p=1
Chien-Ping Lu
Working Group Chair  Programer Reference Manual
Sr. Director, Corporate Technology Office
MediaTek USA Inc.