New HSA Academic Centers of Excellence Undertake Research and Development of HSA-Compliant Technologies

By Dr. John Glossner, Computing Now: https://www.computer.org/web/hsa-connections/content?g=54930593&type=article&urlTitle=new-hsa-academic-centers-of-excellence-undertake-research-and-development-of-hsa-compliant-technologies

The Heterogeneous System Architecture (HSA) Foundation recently added two new HSA Academic Centers of Excellence – Technische Universitaet (TU) Darmstadt, and Friedrich-Alexander-University Erlangen-Nurnberg (FAU), both in Germany.

As mentioned in previous HSA Connections posts, HSA is a standardized platform design that unlocks the performance and power efficiency of the parallel computing engines found in most modern electronic devices. It allows developers to easily and efficiently apply the hardware resources—including CPUs, GPUs, DSPs, FPGAs, fabrics and fixed function accelerators—in today’s complex systems-on-chip (SoCs).

The research these universities are undertaking around HSA could potentially have significant impact for large-scale commercial use, particularly in the data center.

The Embedded Systems and Applications Group (ESA) of TU Darmstadt, for instance, is working with the HSA Foundation to explore how reconfigurable computing (specifically via Field-Programmable Gate Arrays or FPGAs), can be employed in processing units within the HSA framework.

Two key technologies developed by ESA are driving development: The Threadpool Composer system automatically assembles high-performance multi-threaded compute accelerators on FPGAs from existing hardware blocks, providing both pre-built hardware interfaces (e.g., to external memories or the host CPU), as well as software services (e.g., for dispatching compute jobs to the FPGA).

The joint research performed with the HSA Foundation also encompasses major development work on ffLink, a flexible high-performance PCI Express Gen3 x8 interface developed by ESA, which is capable of reaching a transfer rate of more than 7 GB/s between the FPGA accelerator and the host. Both Threadpool Composer and ffLink have been released as open source at https://git.esa.informatik.tu-darmstadt.de.
In the press release, Professor Andreas Koch, who heads the Embedded Systems and Applications Group (ESA) of TU Darmstadt, noted that the collaboration with the HSA Foundation is enabling them to make great progress on research that would have been extremely difficult to tackle without the additional insight provided by the industry partners.
The Department of Computer Science at FAU is currently focusing on integrating image processing accelerators in an FPGA and developing an HSA compliant interface. They are developing a self-designed processor core (packet processor) which is able to process requests and send them back to a host CPU using an HSA interface. The packet processor is then connected to the FAU’s own accelerator core and a PCI Express link to the system’s main memory.

To enable a fast interconnect to the host CPU, FAU is collaborating with TU Darmstadt’s Embedded Systems and Applications Group, which is providing a PCI Express core. FAU is also working on a technical prototype with the HSA Foundation; to support this work AMD, an HSA Foundation member, is providing the host system and technical help for using AMD HSA-enabled APUs, CPUs and GPUs.

The key takeaway from all of this?

FAU Professor Dietmar Fey summed it up nicely: “We’re now able to reach out to many industrial partners and work with them in establishing a standardization for heterogeneous computing platforms. It’s now becoming possible to combine fundamental research from a university with real world industry architectures and applications.”

The two universities are joining Northeastern University in Boston as HSA Academic Centers of Excellence. We’re looking forward to engaging on more projects with other universities worldwide to help make true heterogeneous processing a reality.

Platform and Hardware Requirements for HSA Technologies

by Paul Blinzer, September 8th, Embedded Computing Design: http://embedded-computing.com/guest-blogs/platform-and-hardware-requirements-for-hsa-technologies/#

(HSA) is now a standardized platform design, supported by more than 40 technology companies and 17 universities, that unlocks the performance and power efficiency of the parallel computing engines found in most modern electronic devices. Spearheading HSA is the , a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs, and ISVs, whose goal is making programming for parallel computing easy and pervasive.

http://d2lupdnmi5p5au.cloudfront.net/i__src4f34e627c7935f69caeda682d9742857_par5048b51a68336b802090b7d2d9ec401b.png

Briefly, HSA allows developers to easily and efficiently apply the hardware resources—including CPUs, GPUs, DSPs, , fabrics, and fixed function accelerators—in today’s complex systems-on-chip (SoCs).

In this first of two posts, I’ll focus on platform and hardware requirements for HSA technologies; the second part will center on software and toolchains. Both will be discussed in depth at a tutorial at the upcoming 25th International Conference on Parallel Architectures and Compilation Architectures (PACT). The conference will be held from Sept. 11-15 in Haifa, Israel.

The architecture pillars of HSA

One of the key benefits of HSA is a set of platform architecture features and a programming model that software can depend on for parallel computing. Software using accelerators through an API like OpenCL, CUDA, or similar typically cannot assume that certain hardware features are available on every platform and so must either set a lowest common denominator or support many wildly different ways to essentially implement the same thing all over again to take advantage of an optional API feature while still supporting a lesser-equipped platform.

[Figure 1 | Pillars of HSA requirements]

HSA, in contrast, sets a requirement for a select few modern hardware features that make using the accelerator way more efficient and simplify the programming enormously, similar to what a CPU ISA like x86-64 or ARM has accomplished for compilers and application software development, with a reasonable expectation that a program will run efficiently on a platform with a particular processor and OS.

Let’s start with a brief list of requirements with a short explanation why they’re important.

Security and “quality of service”

An HSA accelerator is used as a peer processor to the CPU by the application, with all benefits and obligations. One of the key obligations is not allowing bad application code to do bad things to the system. Therefore, an HSA accelerator has a “user mode” ingrained for execution, where the operating system (OS) runtime sets strict policies at the hardware level for the accelerator. It can only access data and execute code that is part of the application process and if it accesses anything outside of the expected data, the OS runtime gets notified and can shut down the accelerator access by the application without affecting the rest of the system. A memory management unit (MMU) and other hardware to support it are therefore required by the HSA standard.

Shared virtual memory (SVM)

SVM allows the accelerator to access the application’s data directly and process it. That requires an MMU in hardware. HSA accelerators require it for system security reasons already, so no problem here.

Accelerators without SVM using OpenCL 1.x or the common CUDA API require the CPU to do a lot of work including parse application data, select/copy all data to/from the accelerator (which may need a dedicated buffer), and validate results. There is no concept of passing a pointer and allowing the accelerator to operate on shared memory. Often multiple data must be copied to the accelerator, but only one set of data is chosen. This can waste a lot of time to copy if we don’t know in advance which datasets are required. This copy overhead can seriously degrade the performance benefit of the accelerator.

This performance degradation is eradicated on an HSA accelerator. SVM allows the accelerator to parse and only access the application data it needs directly. As an added benefit, the accelerator and the CPU can access the same data in memory, avoiding unwanted duplicates.

Platform atomic operations

Anyone that has programmed with multithreaded code on a CPU knows how useful atomic operations are for ensuring that two threads can operate on the same data safely. Atomic operations are used in many different ways for implementing semaphores, mutexes, histograms, and many other tasks that require a particular order of execution or of access to work correctly. HSA compatible accelerators must support 32-bit or 64-bit atomics, as threads running on the CPU and on the accelerator operate and synchronize with each other very efficiently using atomic operations. Older accelerators without it always require arbitration using software APIs on the CPU. Software arbitration is very inefficient and these systems end up using far more CPU cycles that are better spent on other tasks.

HSA signals and doorbells

This is something special to HSA and a very significant feature for power-efficiency. You can consider these “atomics with benefits.” HSA Signals are data types created by the runtime that behave similar to atomically updated variables in memory. However, they allow the hardware to monitor and notify state changes, e.g. when a value is changed by an application thread or by an accelerator. One or more accelerators can update an HSA signal, listen and wait on a signal state change, and – if they have nothing else to do – go to sleep while waiting and be woken up immediately when something has changed. By using HSA signals, one accelerator can notify other accelerators directly that data is ready and these can start their processing immediately. If implemented fully, the CPU doesn’t need to perform any coordination and can stay asleep. This is a significant power saving feature because the only hardware that is needed for a task is only active when needed. HSA signals can be used easily everywhere in the software architecture and even HSA-based OpenCL implementations benefit.

User queues and dispatch

If you have programmed any accelerator using OpenCL or CUDA and followed in the debugger how it reaches the hardware, you will have noticed layers upon layers of software levels that the accelerator code and data must pass through until it finally reaches the hardware to be processed. HSA removes this inefficiency and cuts out the middle layers by defining a hardware-based user queue dispatch mechanism that can be directly accessed from the application runtime. The architected queuing language (AQL) that a packet processor of an accelerator uses allows any accelerator to either dispatch work to itself, to the CPU of the system to call OS runtime functions, or to other accelerators.

Cache coherency

Most HSA accelerators have caches that keep frequently used data close to the execution units. But since other processors in the system may also access the cached system memory, the application and the hardware must make sure that their content doesn’t get stale – especially in multithreaded execution. HSA accelerators therefore must provide mechanisms to keep the cached data current and either flush out pending data or invalidate cached data if other processors access the cached system memory. Cache coherency can be automatically enforced by hardware bus protocols or alternatively require instruction controls, which in the case of HSA is part of the definition.

With that, I hope I’ve made you interested enough to eagerly wait for the next blog entry, where I will touch on the HSA memory model HSAIL and how HSA integrates into today’s embedded systems.

Paul Blinzer works on a wide variety of Platform System Software architecture projects and specifically on the Heterogeneous System Architecture System Software at Advanced Micro Devices, Inc. as a Fellow in the System Software group. Paul is the chairperson of the System Architecture Workgroup of the HSA Foundation. Contact Mr. Blinzer at Paul.Blinzer@AMD.com.

HSA Foundation Aims for Broader Adoption of Coherent Memory Standard for Heterogeneous Processors

July 4, 2016, BDTi: http://www.bdti.com/InsideDSP/2016/07/05/HSAFoundation
Modern SoCs increasingly contain a variety of processing resources: one or more CPU cores and a GPU, often with a DSP, programmable logic, or one or multiple special-purpose co-processors for tasks such as computer vision. Properly harnessed, such heterogeneous processors often deliver impressive performance at low cost and low power consumption. But mapping applications onto heterogeneous processors is challenging. OpenCL, a specification standard language and runtime from the Khronos Group, enables the development of code that utilizes processing elements within a heterogeneous single- or multi-chip system. However, any processing efficiency gains derived from specialized computing elements can easily be negated by the added latency (not to mention incremental power consumption) incurred by copying data between computing elements.
Memory coherency among these diverse processing elements enables them to more efficiently share data via pointer-passing and queue-updating operations, versus bundling data and moving it via clumsy I/O operations through complex device drivers. Memory coherency has been common for some time in multi-CPU implementations; expanding the concept to GPUs, DSPs and other dissimilar architectures, however, is more challenging. OpenCL and other heterogeneous programming standards such as OpenMP and C++ AMP don’t make any attempt to standardize memory coherency, which requires the implementation of specific hardware features in each processing element. That mission has been taken up by the HSA (Heterogeneous System Architecture) Foundation, an industry group that has its origins in AMD’s proprietary Fusion System Architecture program (Figure 1).
HSA_Membership

Figure 1. The HSA Foundation boasts a sizeable, diverse membership list, but so far only AMD has chips implementing the organization’s standards.

Founded in mid-2012, the HSA Foundation released v1.0 of its specification suite in the spring of last year. And with newly announced, backward-compatible v1.1, according to foundation president Dr. John Glossner (who is also CEO of General Processor Technologies), the specification further expands beyond its AMD-centric foundations, supporting additional types of processor elements, as well as adding a number of requested features such as more flexible coherent memory access (Figure 2). SoC compatibility with the hardware aspects of the HSA specifications is becoming increasingly common, according to Glossner, and ARM for one agrees. In a recent briefing with Lead Mobile Strategist James Bruce and GPU Developer Tools Product Manager Anand Patel, the two ARM representatives noted that not only the latest Cortex-A73 and Mali-G71 but also the last several generations of ARM CPUs and GPUs are, in combination with newer CoreLink variants of ARM’s AMBA (Advanced Microcontroller Bus Architecture) interconnect, fully compatible with HSA’s memory coherency standards.
HSA_1.0HSA_1.1

Figure 2. The initial v1.0 HSA specification was CPU- and GPU- specific, reflective of the AMD SoC platforms on which it was based (top), but the newer v1.1 spec is more vendor- and processor-agnostic, not to mention more flexible (bottom).

Hardware compatibility alone isn’t sufficient for full compliance with the HSA standards, however, which explains the current dearth of HSA-compliant SoCs in spite of significant industry backing for the HSA concept. At the core of HSA’s software scheme is HSAIL (the HSA Intermediate Language), an intermediate virtualized code abstraction created by HSA-cognizant compilers, which is then dynamically translated to a particular processor’s instruction set by a chip-vendor-supplied HSA Runtime layer. HSAIL-generating compilers are beginning to appear: AMD’s CLOC and the TUT (Tampere University of Technology) POCL both generate HSAIL from OpenCL source code, for example, while General Processor Technologies and Parmance have developed gccBrig, a BRIG (binary format) language front-end to GCC (the GNU Compiler Collection) that is a binary representation of HSAIL. Also, Continuum Analytics sponsors Numba, an open-source Python compiler with direct HSA support, specifically targeting GPU acceleration.
However, to date HSA Foundation creator AMD is the only member company to have developed an HSA Runtime, and then only for its latest Carrizo APU (accelerated processing unit, a CPU-GPU combo), which entered volume production at the end of last year. Even in AMD’s case, direct compilation to the end CPU and GPU instruction sets (versus to a HSAIL intermediate representation) is the preferred approach in AMD’s ROCm (Radeon Open Compute Platform). While we expect to see increased adoption of the HSA standards by AMD and other HSA Foundation member companies, some major chip suppliers are pursuing different approaches. Intel, for example, seems to prefer the Cilk scheme it’s championed, while NVIDIA continues to rely on its proprietary CUDA approach.
Researchers at Northeastern University recently validated that HSA, by removing the need for repeated data copy operations between heterogeneous processing elements, can dramatically improve algorithm performance– at least for a couple of algorithm examples (Figure 3). Three different memory access scenarios were considered: CL12 employs per-element buffers, while CL20 leverages a common albeit small shared virtual memory buffer; both employ only OpenCL. The full OpenCL-plus-HSA implementation, conversely, implements a unified memory space with fine-grained synchronization support, leverages regular pointers and doesn’t require copy operations. The evaluated FIR (finite impulse response) filter algorithm represents a memory-intensive streaming workload; AES (Advanced Encryption Standard) symmetric encryption and decryption conversely is a compute-intensive streaming workload. Glossner was also careful to point out that these results were measured on AMD’s Kaveri APU, which being a pre-HSA 1.0 device supports only limited coherent memory throughput.
HSA_FIRHSA_AES

Figure 3. Recent evaluations conducted by Northeastern University researchers highlight HSA’s performance-boosting potential in both memory-intensive (top) and compute-intensive (bottom) workloads, even with SoCs that aren’t fully HSA-optimized (A Comprehensive Performance Analysis of HSA and OpenCL 2.0, Proceedings of the 2016 International Symposium on Program Analysis and System Software, April 2016).

Any performance loss due to the HSAIL-plus-HSA Runtime multi-layer abstraction will, Glossner feels, be more than counterbalanced by the significant performance boost delivered by HSA’s support for full memory coherency between heterogeneous processing elements. AMD Carrizo APU-based systems are now shipping from PC OEMs such as ASUS, Dell and Lenovo, and Glossner anticipates additional HSA support announcements to arrive shortly from other SoC and IP core providers. Until then, though, HSA will remain an approach with industry-wide potential but limited deployment.
For more information on the HSA Foundation, see the following two videos from the May 2016 Embedded Vision Summit (Video 1 and 2).

AT&T adds fiber markets, Sprint weighs in on small cells … 5 things to know today

By Martha DeGrasse, June 27th, RCR Wireless News: http://www.rcrwireless.com/20160627/carriers/att-adds-fiber-markets-sprint-weighs-in-on-small-cells-5-things-to-know-today-tag4
1. AT&T is promising business customers download and upload speeds of up to 1 gigabit per second. The company said today that it has expanded its AT&T Fiber service in Texas, Tennessee, South Carolina and Oklahoma. In addition, the carrier is expanding AT&T Fiber in several major cities, including Los Angeles, San Francisco, San Diego and Fresno, California; Miami; Dallas and El Paso, Texas; and Louisville, Kentucky.
AT&T also said it is launching nationwide voice-over-IP service through AT&T Business Fiber. The service is available in 180 U.S. cities, most of which are located in the South, Southeast and Central U.S., or on the West Coast.
2. Sprint said it is not concerned about the pace at which its small cell rollout is proceeding. The company’s top executives report that “the permitting and approval stage for its small cell deployment has been ahead of expectations,” according to Wells Fargo analyst Jennifer Fritzsche, who met with Sprint’s CEO, CFO and CTO last week. In a separate meeting, Mobilitie CEO Gary Jabara told Fritzsche’s team that Mobilitie is cooperating closely with local authorities as it deploys on Sprint’s behalf, and that there have been no cities that have put a complete stop to small cell deployments.
Sprint said last summer that it planned to deploy tens of thousands of small cells, and so far has not released a public update to that number. Jabara said this spring that Mobilitie has deployed fewer than 2,000 small cells to date.
3. Apple is now licensing patents from Huawei, according to The Wall Street Journal. Citing “a person familiar with the matter,” the paper did not specify what type of patent Huawei reportedly licensed to Apple. Noting that Huawei spent $9.2 billion on research and development last year vs. Apple’s $8.1 billion, the report said the Chinese company is also the world’s largest filer of international patent applications under the Patent Cooperation Treaty.
In the smartphone market, Huawei is a distant third behind Samsung and Apple, but the company has made it clear that it wants to compete head-on with the two market leaders. Huawei already dominates the market for wireless infrastructure, where it holds a leading position despite political pressures that keep Huawei’s gear out of U.S. wireless networks.
4. Huawei’s smartphone ambitions may include a proprietary operating system. The company has reportedly hired former Apple designer Abigail Brody to work on an operating system that would be an alternative to Android if Huawei’s relationship with Android developer Google takes a turn for the worse. Other smartphone makers have tried to develop proprietary operating systems, but so far application developers have been reluctant to invest time and money in apps for platforms other than Android and iOS.
5. The Heterogeneous System Architecture Foundation has released a specification the group says will make it easier to integrate digital solutions that use disparate hardware. HSA is a standardized platform design supported by more than 40 technology companies and 17 universities. The new spec adds multivendor architecture support, which means manufacturers will be able to combine IP blocks from more than one vendor. One of the group’s primary goals is to enable heterogeneous computing for vision-based “internet of things” systems.

Toward a Hardware-Agnostic World: HSA Foundation Releases Specification v1.1

By Jim Turley, EE Journal: http://www.eejournal.com/archives/articles/20160601-hsa/
I think there’s something great and generic about goldfish. They’re everybody’s first pet. – Paul Rudd
It’s finally happened: processors are now completely generic and interchangeable.
Might as well go home, CPU designers. There is no differentiation left to exploit. All of your processor architectures, instruction sets, pipelines, code profiling, register files, clever ALUs, bus interfaces – all of it is now as generic and substitutable as 80’s hair band drummers. Your entire branch of technology has been supplanted by some programmers.
Okay, so maybe it’s not quite that dire. But we’re getting there.
You have the HSA Foundation to thank for that. Their job is to make CPUs, DSPs, GPUs, VLIW machines (and pretty much anything else that can execute code) totally interchangeable. In the big SWOT analysis of hardware resources, the CPU becomes a “don’t care.” That is, HSA (which stands for Heterogeneous Systems Architecture) tries to make any code execute on any processor, regardless of its architecture, instruction set, or number of cores. They’ll let you run your operating system on a DSP, your graphics code on an integer CPU, and your signal-processing algorithms on a GPU. Hardware is hardware; just write your code and let HSA sort it out.
At least, that’s the promise the group has been making for the past few years. It’s what Steve Jobs would’ve called, “a big hairy audacious goal.” Hey, let’s treat all programming languages the same and all hardware engines the same. Programmers can write their source code in whatever language(s) they prefer, and let it run on whatever hardware they have lying around. Most of all, HSA allows you to mix different processor architectures together (that’s the “heterogeneous” part) so that you can, for example, run a multicore x86 processor alongside a cluster of ARM cores, next to a gaggle of nVidia GPUs. Pay no attention to how those processors are interconnected, or how many there are, or even what type of chip you’ve got. It’s all good! Throw ’em all together and let the software sort ’em out!
Sound like magic? Kind of. Sound like a bad idea that’s already been done to death by a thousand different university students who think they’ve stumbled on a fantastic (and original) idea? You’d be correct there, too. The idea of a universal hardware platform is hardly new, and the road to hardware independence is paved with other people’s venture capital. Java is about the only example of hardware-independent software that made any kind of a dent in the industry – but dents can be good or bad.
But wait a sec – isn’t Java already hardware agnostic (as in, “we don’t believe hardware exists”), and if so, why do we need another one? And for that matter, isn’t all code written in C++ or Python or BASIC or any decent language also platform-independent? Wasn’t that the whole idea of high-level languages? What problem are we actually solving here, and hasn’t it already been solved anyhow?
Well, yes and no. Java bytecode is (ahem) more or less transportable across different CPU architectures… assuming the architecture in question has its own bytecode interpreter or JIT or equivalent translator. And C code is certainly transportable… right up until it’s compiled. At that point, it’s very hardware-specific. But neither of these examples really ignores the underlying architecture of the chip you’re programming. Nobody writes C code without knowing if it’s intended for a conventional CPU, or a DSP, or a graphics processor. Same goes for any other programming language. You always want to know something about the processor it’s going to run on, even if you’re not bit-twiddling individual configuration registers.
So HSA wants to abstract-away that last vestige of processor prejudice. This is particularly important and useful in today’s systems that mix and match so many different kinds of processors. How cool would it be to write your C or Python and truly not care how many processors, or of what type, were ultimately going to host it?
The core of HSA’s technology, as with so many other “universal hardware platforms,” is an intermediate virtual machine. In other words, you’re writing code for an imaginary CPU, and HSA-compliant tools then convert that to actual machine code for the hardware you really have. It’s not too different in concept from any other compiler, and pretty similar to the way Java is compiled.
This intermediate layer is called HSAIL (HSA Intermediate Language), and it’s specified just like a real CPU with a real instruction set and everything. In fact, you can download the HSAIL specification for free and build your own HSA-compliant toolchain if you like. The HSA Foundation would probably be happy to encourage you.
The only hardware requirement to using HSA is that all the processors in your system must share a single, cache-coherent memory space. That’s important, and it’s non-negotiable. It’s the key feature that allows HSA tools to allocate and reallocate code segments among processors. When everyone shares a memory map, a pointer is a pointer, regardless of who created it or who dereferences it. Cache coherence is also mandatory, for much the same reason. The results of one processor’s calculations have to be universally accessible to all the other processors, without careful planning or message-passing.
In fact, that lack of planning and messaging is one of HSA’s strengths, though it’s hardly unique. The group recently ran some benchmarks comparing HSA-compliant code with OpenCL (which also tolerates heterogeneous hardware resources). In HSA’s testing, their code did far better, of course, and often by orders of magnitude.
An FIR filter, for example, ran about 10x to 100x faster than the equivalent OpenCL code. Pretty impressive. But can a toolchain really make that much difference? Depends what you’re comparing it to. Software FIR filters are very memory-intensive, and the OpenCL implementation handles its data structures in a “pass by value” method. In other words, it copies all of the data from one processor’s memory space to another’s. That wastes a huge amount of time (and consumes a lot of memory). HSA, in contrast, does “pass by reference.” Voila – you’ve saved a mountain of time with a different toolchain.
So who’s behind the HSA Foundation? Who stands to gain from this? Like many consortia, HSA draws its members from industry. On the CPU side, they’ve got support from AMD, ARM, and Imagination Technologies. So there’s x86, ARM, and MIPS represented, as well as Radeon, Mali, and PowerVR graphics. Toshiba, Texas instruments, Tensilica, Analog Devices, Ceva, Synopsys (with ARC), and other second-tier CPU vendors also participate. A lot of universities are contributing manpower, and several research laboratories are represented, too. So a good cross-section of interested parties overall.
Does it really work? It seems to, at least in early testing. The group has just released version 1.1 of its specification (also available for free download), and they’re adding support for more compilers and more processors. Compared to v1.0, HSA v1.1 is now more closely compatible with gcc. It’s a long and tricky process, but the HSA Foundation seems to be making real progress toward making CPU designers obsolete.

HSA spec upgrade supports multivendor SoCs

By Peter Clarke, EE Times Europe: http://www.electronics-eetimes.com/news/hsa-spec-upgrade-supports-multivendor-socs
The Heterogeneous Systems Architecture (HSA) Foundation – a grouping of chip, IP and software companies – has released the HSA 1.1 specification claiming it takes developers closer to energy-efficient heterogeneous computing.
The specification update comes just over a year after v1.0 and enhances the ability to integrate proprietary IP blocks and blocks from multiple vendors in heterogeneous designs
The specification is intended to allow developers to write software and efficiently apply it to hardware resources of multiple types – CPUs, GPUs, DSPs, FPGAs, fabrics and fixed-function accelerators. This can be done by writing in OpenCL 2.X, C++, Java and compiling to HSAIL, the HSA intermediate language.
The additions under HSA 1.1 include: multi-vendor support, improved interoperation with graphics, cameras and other image processors, digital signal processors; a formal definition of the HSA memory model; support for system-level profiling; run-time improvements including the capability to wait on multiple signals, a non-temporal memory access that allows infrequently used values to be removed from a cache efficiently
There is also an open-source LLDB-based debugger sponsored by Codeplay Ltd supporting kernels compiled using the open source CLOC compiler and the HSA assembler. All are available from the HSA’s GitHub repository.
“HSA is increasing traction, with HSA compliant systems now in the market, an increasing number of developer tools available, and now the ability to leverage IP blocks from different vendors,” said HSA Foundation president John Glossner.

HSA updated to v1.1 with new features

by Charlie Demerjian, SemiAccurate: http://semiaccurate.com/2016/05/31/hsa-updated-v1-1-new-features/
The HSA Foundation is announcing v1.1 of their spec today with some important changes. Since SemiAccurate first brought you the news of HSA years ago, the slow progress forward has been picking up pace quickly.
Just over a year from the March 2015 launch of v1.0 of the HSA spec comes the new revision. Since fully HSA1.0 compatible devices are just hitting the market now, there is a Dell box imminent with all the features enabled on a Carrrizo desktop, how long will we have to wait for v1.1 hardware? That is the beauty of the v1.1 spec, there are no hardware changes required so if you are HSA1.0 compatible you are HSA1.1 compatible.
In light of this it may not seem like v1.1 brings much to the table but that is not the case. Understanding those differences may take a pretty technical mind but most SemiAccurate readers will probably keep up. If you aren’t familiar with the current state of HSA, we won’t rehash it in detail but start here and here and here.
The basics don’t change with v1.1 which is no surprise in light of the hardware compatibility, you still compile your code to the HSAIL intermediate language and that still uses the hQ structure to pass data. Memory is still pinned rather than copied ad infinitum, and efficiency of data use is still a top priority. What the new spec offers are features that most wanted earlier and wider support.
First up is the spreading of HSA to a wider range of hardware. Currently it only supports a CPU and GPU as targets because, well, that’s all the hardware that was out there. Looking at the breadth of HSA Foundation members it is clear that support will add a wider class of devices and v1.1 delivers just that. Image processors, DSPs, NICs, and a whole host of other ISA are now officially supported, and nearly anything with a bunch of parallel processors can be added. Look for more to be added as soon as hardware from supporting vendors is official announced.
Software and drivers are a key point in this type of multi-vendor interaction and HSA1.1 supports that in a few new ways. There are transparent features now and mulit-vendor IP blocks are better integrated too. Also added are several key pieces of information that can be explicitly queried too so if you want to know where a page table resides you can actually get that answer directly. This was a big request in older versions, sometimes you need to find information that a generic abstraction can’t provide.
Speaking of generic abstractions in v1.1 we have a vendor neutral generic driver, something that makes sense for a virtual ISA like the one HSA provides. Think of this driver as the layer from the system to HSA, each hardware vendor still needs to provide a specific driver for their hardware. What this will hopefully do is simplify and standardize the coding process for the user and software writer. If you think about it, abstracting the hardware from the user perspective is a key enabler for multiple classes of hardware, and more direct polling of structures does most of the rest.
Another big gap in the 1.0 spec surrounds pausing of threads. As you might be aware the hQ structure and massively threaded programming does a lot of pausing of threads and HSA is no exception. HSA1.0 has mechanisms for pausing and restarting threads but 1.1 takes it a step further with multi-wait signals. A thread can now wait for more than one signal and apply some simple logic to the unpause process. This may seem basic but it wasn’t there and now is.
Multi-wait signaling can be combined with so called forward progress rules, also new. This does exactly what it sounds like, it guarantees a thread will make progress under certain conditions. Between the two new features you can implement QoS and service guarantees, a necessary step for many media applications especially real-time ones. With HSA1.1 you don’t have to write it all yourself, the primitives are there for you and work across multiple hardware blocks.
Memory has a lot of new features too starting with a formal memory model definition. If you think about it, HSA already has both big and little endian hardware supported and with new additions a formal model will make life a lot saner for programmers. Non-temporal memory accesses, basically cache eviction after use, and multi-agent sharing of images is also now official, the latter being effectively access to pinned memory spaces from multiple blocks.
That brings us to the tools and here we start with the finalizer. It now allows linkage to standard code objects and supports versioning. The first of these is pretty self explanatory, the second tends to be more familiar to the non-consumer space. Versioning effectively allows the code to specify which variant of the finalizer it needs and more importantly get it. This may seem trivial to non-programmers but you just need to look at modern shenanigans by Microsoft and forced march upgrades to realize how valuable flexible versioning is.
Moving farther down the tool chain there is a new profiling API that supports multiple agents. Massively threaded code running across multiple differing hardware blocks can be a tad problematic to optimize and that is where the new API comes in. Depending on the hardware it can work in realtime or just gather data for future examination. There is now direct access to hardware counters and timestamps as well. Gaming is an obvious example of where this will pay off but media work will see huge benefits as well. If nothing else this will save a lot of programmer hair from hitting the floor after being torn out in chunks.
Last up we have added language support with the headliner being Python and the Numba NumPy aware compiler. It is available on GitHub with direct HSA support and does automatic parallization of supported functions. Java, Javascript, OpenCL, and C++ were supported too but the new C++17 standard will probably include a standard template library for HSA as well. All the HSA1.0 tools and toolchains will work with v1.1 as well and most new features will likely be supported in the near future.
Those are the major points of the new HSA1.1 spec. It runs on the same hardware, adds a few languages and profiling tools, and brings a lot of new hardware possibilities into the mix. To support this there is a standard virtual ISA, explicit querying of some structures, and a much more robust signaling and wait model. To the user it should work better on more devices, to the programmer it should be simpler to support and easier to fix when things go wrong. For the hardware vendors it can only increase the available software for their offerings, what’s not to like?

SoC Spec Defines Core Interfaces: HSA version 1.1 defines IP block interfaces

By Rick Merritt, EE Times: http://www.eetimes.com/document.asp?doc_id=1329786
SAN JOSE, Calif.—The Heterogeneous System Architecture (HSA) Foundation defined interfaces for third-party IP blocks so engineers can design SoCs compliant with its specs for shared coherent memory. To date, only Advanced Micro Devices, which initiated the group, ships a processor supporting the ad hoc standard for speeding up on-chip processes shared among different cores.
The HSA spec is positioned as a more open alternative to the SoC techniques and programming environments supported by Intel and Nvidia. The interfaces defined in HSA’s version 1.1 released today are already baked into a handful of CPU, GPU, DSP and fabric cores from at least three vendors including Arteris and Imagination Technologies.
“In our version 1.0, SoC makers didn’t need to define how, say, a DSP gets access to a shared memory-page table,” said John Glossner, HSA’s president. “Now what were opaque elements of the standard are specified in a way that gives us multivendor transparency, enabling vendor-neutral device drivers,” he said.
More than 40 companies are part of the group, including ARM and Mediatek which have said they will support the approach. The technology aims to serve a broad set of markets from mobile SoCs and desktops to high-performance computing systems.
“All our new products will support HSA,” said Glossner, who works for Optimum Semiconductor Technologies, a US-based IP licensing company that is part of a larger chip conglomerate in China.
Looking forward, “the next two years are about getting software up and working — this is the first multivendor spec we’ve released,” said Glossner, with a debug spec expected to be the biggest part of the work.
“Debugging multiple hetero cores from a single source is complex, and we want to make sure to get it right, trying to poll and set break points is difficult and needs higher level abstractions – a heterogenous debug tool suite should look like it is uni-threaded,” he said.
HSA already supports a basic form of debugging.
“We support passing high-level debug information to [an abstracted agent], and it is included in the generated code object,” said Glossner. “We are working towards making that support universal across tools just like we did for profiling,” he added.
The 1.1 spec already supports profiling across cores. Tools including open source compilers and a runtime environment for HSA are currently available with test results published by AMD and academics in the group.
The HSA spec is agnostic about programming environments although its main target has been OpenCL. A parallel version of C++ is due in 2017 that should be able to take advantage of the HSA techniques, Glossner said.
The group has not yet decided of future hardware will require additions that might generate a version 2.0 spec.

HSA 1.1 Specification Launched: Extending HSA to More Vendors & Processor Types

By Ryan Smith, AnandTech: http://www.anandtech.com/show/10387/hsa-11-specification-launched-multi-vendor-support
For the last few years now we have been keeping tabs on the development of the Heterogeneous System Architecture, a set of technical standards and an associated instruction language designed to allow efficient heterogeneous compute. Originally envisioned by AMD, HSA has been the cornerstone of their efforts to develop a fully functional ecosystem for heterogeneous hardware and software. Define a standard, make it easy(ier) for developers to create software around it, and, if all goes well, AMD’s big bet on GPU technology made almost 10 years ago will pay off.
However while AMD was the birthplace of what would become HSA, the standard as a whole has been about more than just one company. Which is why HSA has been under development of the HSA Foundation since 2012. The foundation is composed of many members, including a number of CPU/GPU design heavyweights such as ARM, Qualcomm, Imagination, and Samsung, all of whom are contributing to the development of HSA, and in turn we expect are at least giving themselves the option to leverage it in the future.
With all of that said, while HSA itself is a group project, in practice the first iteration of the standard was AMD-focused. This owed to not only AMD’s early involvement, but also due to the fact that the standard needed hardware to be developed against – to have a proof of concept, which was AMD’s Carrizo APU. As a result the HSA 1.0 specification did not offer as much flexibility with other architectures as the rest of the foundation would like. This is something that they have been working behind the scenes to address, and today the Foundation will finally be taking their next step with the publication of the HSA 1.1 standard.
The big (though not sole) focus of the 1.1 standard then is extending it to better work with non-AMD hardware. This includes not only SoCs and other integrated devices from other vendors (e.g. ARM), but additional classes of accelerators such as DSPs. The HSA foundation wants to make the HSA standard truly heterogeneous for more than just AMD APUs, and for more than just CPU + GPU combinations. The 1.1 standard, in turn, is their effort to form a more perfect standard for heterogeneous compute.
What’ on the table for HSA 1.1 then is not a radical departure from HSA 1.0, but it is an extension and a further tightening of the standard to meet the goals of the Foundation. On the hardware side this is fully backwards compatible – meaning it will work on 1.0 hardware such as Carrizo – while setting up the updated standard to work on additional hardware. This is especially the case with additional classes of accelerators, as 1.0 was primarily focused on abstracting away the GPU (per AMD’s APU needs).
As the first version of the true multi-vendor specification then, 1.1 is designed to help vendors be able to mix and match HSA-capable blocks in an effective manner. The Foundation itself has a heavy mobile SoC presence (be it integrators or IP providers), and it’s easy to imagine how someone like MediaTek would want to be able to ensure that they can easily make HSA-capable SoCs using both Imagination and ARM GPUs, or how Qualcomm may want to use HSA in the future in their Hexagon DSPs. To get there, the 1.1 specification makes transparent a number of system level issues. Information about shared page tables, signals, queues, and more are now better exposed through the updated standard, which plays a big part in bringing about multi-vendor support.
The 1.1 specification also addresses some other issues that were felt at one point or another with HSA. The memory model now has a formal definition (using cat/HERD, for those few of you who know those tools). Along those lines there is additional memory functionality such as support for non-temporal memory accesses, which specifically comes into play when you need to tell the cache to flush an item after it’s used (good for one-time-use items). And on the signal side of matters, HAS code can now understand how to wait on multiple signals, an improvement over 1.0 where code could only wait on a single device.
Finally, 1.1 also includes updates to help optimize performance and make the API play nicer with other code. A new profiling API has been introduced, which exposes hardware-level counters and other information to allow better profiling of the performance of HSA program execution, which can then be fed back into profile guided optimizations. Meanwhile the HSA finalizer has been reworked so that its internal representation isn’t quite so obscure, with the new, more standard-looking representation making the finalizer more suitable for linking to other tools.
Yet despite all of these updates, the change should be completely transparent to application developers. Because HSA is ultimately a means of abstracting away the work and the quirks necessary to make heterogeneous compute work well, application developers won’t see these changes. Who does see these changes are the hardware developers, who write the associated runtimes that actually compile HSAIL intermediate code down to device code. And even then, we’re told that updating a 1.0 implementation to 1.1 is not especially painful, particularly when compared to writing an implementation to begin with. AMD for their part already has a 1.1 implementation up and running, which, logically enough, is being used as the basis of their Radeon Open Compute Platform (ROCm), where ROCm adds the additional discrete GPU functionality AMD specifically needs.
Though with that said, the existence of ROCm and other platforms does bring up the struggle the HSA Foundation is facing on adoption. While ROCm is HSA-based, at the same time AMD is doing an end-run around the HSAIL, preferring to compile direct-to-ISA. This still utilizes the HSA Runtime, and as a result benefits from and validates the basic HSA strategy, but it’s an example of how the HSAIL aspect of the standard has struggled.
Similarly, earlier this week we saw ARM pass on embracing HSAIL as well for the heterogonous aspects of their Mali-G71 GPU, following the same train of thought as AMD. The G71 GPU is for all intents and purposes an HSA 1.1 GPU – following the HSA hardware specification to implement heterogonous computing in a common and sane way – but the company is not utilizing HSAIL.
ARM for their part is following an OpenCL-centric software strategy for exposing the heterogeneous aspects of their hardware to developers. The interesting thing about the ARM implementation is that they have gone above and beyond the basic aspects of OpenCL 2.0, offering finer-grained sync that OpenCL requires at a minimum. Finer-grained sync that would otherwise be something better suited for HSA. As a result part of the HSA Foundation’s efforts are focused on showcasing the additional benefits of HSA over OpenCL 2.0, to entice hardware manufacturers and developers into using HSA.
Ultimately, in our discussion with Foundation president John Glossner (who replaced Phil Rogers), despite these setbacks he’s still bullish on HSAIL. It’s his hope that as finalizers continue to improve, there will be less reason for companies like AMD to bypass HSAIL as they do now. And in the meantime, further success with HSA and HSAIL in general should help to encourage more hardware vendors to adopt HSAIL.
Finally, to pull one more slide out of the HSA deck as a “this is cool” subject, Continuum Analytics’ Numba Python compiler has recently added direct HSA support. This makes it a lot easier for developers writing code in Python to easily add HSA-compliant vectorization to their programs. Python is a widely used language, and while even “automatic” parallelization isn’t the true holy grail of no-effort program parallelization, it does get HSA one step closer, at least in this case.
Wrapping things up, today’s launch of the HSA 1.1 specification, despite the minor version number increment, should in the long run prove to be a significant event for the HSA Foundation. By finally getting the specification to the point where it can more readily support multi-vendor hardware, the ecosystem as a whole will have the opportunity to evolve into a true multi-vendor ecosystem, the ultimate purpose of AMD spinning their work off into the HSA Foundation to begin with. The hard part as an outside observer will be waiting; while the specification was ratified back in April, there is still going to be some lag between the ratification and when additional hardware will be ready to support it.