Heterogeneous System Architecture: Spearheading Bold New Standards for Tomorrow’s Computing

IEEE Computing Society: https://www.computer.org/publications/tech-news/spearheading-bold-new-standards-for-tomorrows-computing
“The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time.” Tom Cargill, Bell Labs.
Improving software and the systems it supports have always posed a challenge. But there are those who persistently continue to improve computing with better ideas. The Heterogeneous System Architecture (HSA) Foundation represents more than 70 leading technology companies and universities.  Combining their insights and resources, they have developed HSA specifications and compliant heterogeneous systems. To date, HSA teams have made major strides in establishing heterogeneous computing standards and its applications and achievements in several critically important areas.
Leveraging the Full Power of Parallel Computing
The HSA standardized platform leverages the performance and power efficiency of parallel computing engines currently used in most electronic devices. Developers can finally apply CPUs, GPUs, DSPs, FPGAs, fabrics and fixed function accelerators found in today’s complex Systems-on-Chip (SoCs) with remarkable ease and efficiency.
This has led to considerable progress in applying HSA to artificial intelligence, robotics, ADAS/autonomous driving, IoT, software-defined communications and other applications. The drive to establish a unifying standard has spearheaded support for bold new advances. These include innovations in product development, R&D, ecosystem formation, and conformance tests in other related industries. Breakthroughs have surfaced in such key areas as standard application evaluation, instruction set architectures, system architecture, security protection, and network interconnection.
HSA Working Groups: Raising the Bar in Heterogeneous Computing
Uniting teams with diverse talents continues to take HSA computing to greater thresholds of efficiency. In a recent summit, HSA working groups in the China Regional Committee (CRC) moved forward with a number of key insights:

  • Application & System Evaluation Working Group. This team has led the way in application and development trends in artificial intelligence.
  • Interconnect Working Group. Breaking new ground in Network-on-chip in heterogeneous computing SoCs, this team has made significant strides in ‘next-step’ verification and standardization work arrangements.
  • Compilation & Runtime LIB Working Group. Engaged in vector computing and related programming models, this group continues to offer recommendations for facilitating integration into HSA system architectures.
  • System Architecture Working Group. Leveraging HSA to address software-defined communications and radio, this team is leading the way in heterogeneous multi-core chip architecture and application development.
  • Security & Protection Working Group. Focusing on adapting heterogeneous computing to ancillary areas, this working group has made major strides in security protection.

Matching the Right Processor with the Right Application
“People have no idea how fast the world is changing,” said Peter Diamandis at a recent Singularity University’s Global Summit. It’s eye opening just how many of today’s electronics are driven by SoCs. There’s little doubt that SoCs have become the heart of mobile devices, high performance computing systems, AR/VR systems, machine learning and servers. All are segueing to heterogeneous architectures made up of IP blocks. These are frequently designed and programmed in a variety of proprietary languages. This has created a “Tower of Babel” as hardware tech continues to move faster the expected, outpacing the software that supports it.
Addressing Barriers and Bottlenecks
The HSA paradigm addresses these inherent ‘barriers and bottlenecks’ by ensuring that the right processor is used for the application best suited for the job. United with cache-coherent shared virtual memory, HSA systems provide big-bandwidth access to memory and boost app performance while also reducing power consumption.
The HSA structure allows programmers to configure applications in their existing programming languages, free of concerns over native instruction set architecture (ISA) of heterogeneous processors. Programs can be  compiled to native code or to a virtual instruction set that can be later finalized to a native ISA. In many cases no specialized programmer knowledge is needed to achieve parallelization.
Most Surveyed Would Include HSA
An HSA survey of SoC companies, IP providers, software providers, academics, OEMs, OS vendors, and software developers proved quite revealing. It concluded that their systems would all include heterogeneous features. Respondents noted that they are currently leveraging CPUs, GPUs, digital signal processors FPGAs, and fixed function accelerators.
There’s no denying that heterogeneous systems are at the epicenter of today’s tech disruptions—everything from tablets and smartphones to scientific computers. Heterogeneous architectures are leading the way to the next generation of increasingly smarter devices. This has opened up vast new markets in machine learning and AI, data centers, high performance computing, mobile, AR/VR and embedded IoT.
Powerful New Tools Extend HSA’s Reach
Innovative new tools continue to enhance and extend the reach of HSA’s model. A sampling:

  • HIP—Heterogeneous-Compute Interface for Portability (HIP) adds next level mobility to CUDA programs. The first release of HIPCL demonstrates its viability and is useful for end-users. It runs most CUDA examples in the HIP repository and the list of supported CUDA applications is expected to grow as new features are added.
  • POCL—Portable Computing Language (POCL) provides a flexible open source implementation framework of the OpenCL standard. POCL helps integrate as many diverse devices as possible in a single OpenCL context. This optimizes performance at the system level, and ultimately harnessing the power of all heterogeneous devices in the system under a single API.
  • FPGAs—Field Programmable Gate Arrays (FPGAs) equip today’s systems designers with a far more robust selection of hardware to work with. Comprised mostly of simple lookup tables and flip-flops, these components are dynamically configured to form complex algorithms directly in hardware. This structure offers low latency, unlimited ability to reconfigure systems, and high-energy efficiency due to the dedicated circuitry.
  • Memory Centric Architecture—This allows heterogeneous systems to reside in the same memory address space and share memory coherently. It enables multiple processors to write to memory without first copying it to a separate address space. And it does so while allowing other processors to simultaneously read from the same location. A major portion of legacy software can also be re-used.

In its simplest form, HSA is a productivity application programming interface (API), one that leverages the power and potential of heterogeneous computing. It eliminates the ‘bottlenecks and barriers’ of traditional heterogeneous programming, allowing developers to focus on algorithms without micro-managing system resources. Applications are able to seamlessly blend scalar processing with high-performance computing on CPU’s, GPU’s, DSP’s, Image Signal Processors, VLIW’s, Neural Network Processors, FPGA’s, and more. It’s a timely innovation simplifying heterogeneous programming.

HIPCL: From CUDA to OpenCL Execution

by Pekka Jääskeläinen and Michal Babej, Customized Parallel Computing group, Tampere University, Finland. IEEE Computer Society: https://www.computer.org/publications/tech-news/from-cuda-to-opencl-execution

Heterogeneous-Compute Interface for Portability (HIP) is a runtime API and a conversion tool to help make CUDA programs more portable. It was originally contributed by AMD to the open source community with the intention to ease the effort of making CUDA applications also work on AMD’s ROCm platform.
While AMD and NVIDIA share the vast majority of the discrete GPU market, it is useful to make this “CUDA portability enhancement route” available to an even wider set of platforms. Since the Khronos OpenCL standard remains the most widely adopted cross-platform heterogeneous programming API/middleware, it is interesting to study whether HIP could be ported on top of it, expanding its scope potentially to all OpenCL supported devices. Our CPC group has worked on this project, known as HIPCL, for some time; it’s now published and available in Github.
During the development of HIPCL, CPC aims to identify features lacking in OpenCL and other challenges when transitioning from CUDA applications to OpenCL-supported platforms. The main challenge we have identified so far is a lack of an Intermediate Representation (IR) that is widely adopted by multiple OpenCL implementations and that also has solid open source infrastructure available. While the main program of a heterogeneous application is typically compiled to a native instruction-set binary of the host CPU, using an IR target and just-in-time compilation for the kernel parts helps maintain portability of the application executable across multiple heterogeneous devices.
Standard Portable Intermediate Representation (SPIR) versions 1.2 and 2.0 are the first IRs defined by Khronos to be used with OpenCL. Because it’s based on the LLVM compiler framework IR, they have relatively good support in the upstream LLVM compiler, but lack wide adoption in vendor implementations: Nvidia does not seem to support SPIR at all. The AMD APP SDK CPU-only implementation supports SPIR, but AMD’s GPU implementations don’t support any SPIR version. The older versions of the Intel OpenCL implementations support SPIR, while the newer versions (proprietary CPU runtime and the new NEO runtime) support only SPIR-V. Overall, the current focus of the OpenCL community is now on SPIR-V. SPIR-V supports both OpenCL (since version 2.1) and also a now popular graphics and compute API, Vulkan. It also has LLVM to SPIR-V conversion tools available as open source. SPIR-V thus seemed the most “future proof” choice for the HIPCL’s fat binary output’s IR.
In the first release of HIPCL we focused on testing the output on the Portable Computing Language (POCL) open source OpenCL implementation framework as well as Intel OpenCL SDK “NEO” for GPUs, both of which had adequate SPIR-V support for our test cases. We plan to expand the support to other OpenCL platforms in the future releases as the SPIR-V support in the OpenCL implementations keep maturing.
The first release of HIPCL is a proof-of-concept, but is already useful for end-users. It can run most of the CUDA examples in the HIP repository and the list of supported CUDA applications will grow steadily as we add new features. That being said, we naturally welcome contributions from the community in the form of pull requests and issue reports with CUDA applications that do not port. However, when reporting issues on non-porting CUDA applications, please first make sure that your application works with the upstream HIP to ROCm path, since our primary focus is on getting the OpenCL support on par with it. The current status of HIPCL will be kept up to date in the README section on the front page of the Github repository. If you plan to work on a new feature, please let us know first by opening a feature request issue to avoid us working on it at the same time.
On behalf of the CPC group, we want to thank key funding sources that are helping us realize a more open and diverse heterogeneous computing ecosystem: The HSA Foundation, Academy of Finland (funding decision 297548) and ECSEL JU project FitOptiVis (project number 783162).
For additional information, contact us via pekka.jaaskelainen@tuni.fi.

Accelerator Framework for Portable Computing Language (POCL)

by Pekka Jääskeläinen and Kati Tervo, Customized Parallel Computing group, Tampere University, Finland. IEEE Computer Society: https://www.computer.org/publications/tech-news/accelerator-framework-for-portable-computing-language

Diverse heterogeneous platforms that utilize various types of resources such as general purpose processors, customized co-processors, hardware accelerators and FPGAs have been in the core of Customized Parallel Computing (CPC) group’s research interests for almost two decades. The group’s mission is to research and develop technologies that make customized heterogeneous parallel platforms easier to design and program to enable their benefits for a wider range of applications and end-users.
One of the activities of CPC has been to study, adopt and promote standards such as HSA, OpenCL and more recently C++. Along these lines, a major contribution of CPC to the heterogeneous platform community is Portable Computing Language (POCL), a flexible open source implementation framework of the OpenCL standard. The goal of POCL is to integrate as many diverse devices as possible in a single OpenCL context to allow system level optimizations, eventually harnessing all heterogeneous devices in the system for the application to use under a single API. We consider the OpenCL API a good core on top of which higher-level software layers can be added for increased engineering productivity, automatic adaptation and other purposes.
POCL already supports multiple device types: For example, HSA Base profile based accelerators with native LLVM ISA based compilation, NVIDIA GPUs via libcuda, multiple CPUs, and open source application-specific processors using the TCE target. It is also known to have multiple private backends that have not (yet) been upstreamed.
The latest class of devices we want to integrate to POCL are fixed function hardware accelerators. Hardware accelerators are power efficient implementations of challenging algorithms in heterogeneous computing platforms which are used to make key tasks in applications such as video codecs or machine vision pipelines faster, more power efficient, and less chip area consuming. Their high-level programming and integration to the application software in a standard and portable way presents one of the interesting challenges which CPC is currently studying.
While the efficiency benefits of hardware accelerators are clear, their trade-off in comparison to software programmable co-processors is in their post-fabrication inflexibility: The function the accelerator performs cannot be changed after the chip has been manufactured – the accelerator’s data path cannot be freely re-programmed to implement a new function outside the supported configuration parameters. However, there is the coarser “task-level” degree of programmability that should be considered: Even if the functions in the single IP blocks in the system cannot be changed, it is essential that the accelerator functionality is integrated to the application software logic in a cohesive and efficient manner.
OpenCL 1.2 introduced a device type called custom device and the concept of built-in kernels, which brought hardware accelerators to OpenCL programmers. Using these concepts, a device driver can advertise to be a non-programmable accelerator that supports a set of “kernels” which are merely identified only by their name.
Since the semantics of the built-in kernels implemented by the accelerators are not “centrally defined” anywhere, the end user is supposed to know the accelerator by its device ID and the meaning of the built-in functions it provides, which, of course, reduces the portability of the OpenCL program across different vendors when utilizing hardware accelerators.
As a framework for easily adopting the OpenCL standard for the diversity of heterogeneous devices including hardware accelerators, what can POCL then provide to make the application integration of hardware accelerators easier? In our first code contribution to POCL accelerator support, we rely on the following concepts which have been shown to work well together in customization of SoCs (FPGA verified) that also include hardware accelerators:
1) Define a standardized memory mapped hardware IP interface/wrapper which POCL can assume to be present in the address space of the process. We based the interface on a set of memory mapped registers for probing the device essentials (to implement “plug’n play” functionality for reconfigurable platforms) and utilized the bit-level-specified HSA AQL for kernel queueing.
2) Contribute an example custom device driver implementation in POCL upstream. The default implementation assumes the hardware accelerator has the standard interface. It integrates accelerator launches to the top-level task scheduling process of POCL to “play nice” with the other devices in the same context, for example, by parallelizing the accelerator execution with CPU kernel execution etc.
3) Provide a list of known built-in functions with pre-specified names, integer function identifiers, and argument interfaces which can be expanded in the future releases of POCL. This helps the portability problem in case the list becomes a defacto standard – which can sometimes happen in case of widely adopted open source software.
Using the list of known accelerators, multiple vendors or, for example, open source community contributors can implement the accelerator in their own way and the application writers can use it, knowing that the invoked built-in functions implement exactly the intended functionality (no matter who provided the IP block). The built-in function descriptors can be along the lines of (this textual description is implemented as software structs and enums):
Kernel called pocl.vecadd_i3 which implements element-wise addition of a vector of 3-bit integers stored in byte vectors. The first argument is a physical address pointer to the beginning of the 1st input vector, the 2nd to the beginning of the 2nd and 3rd argument points to the output (which can be the same as one of the inputs). The vector width (in rounded up number of bytes) is specified using the grid size dimension x.
This initial framework and example implementations of the above mentioned concepts have now been committed to POCL master branch. The quick start instructions along with an example accelerator created with the open ASIP tools we are developing are available here. The work is ongoing and we are happy to receive your feedback. Pull requests are especially welcomed! You can reach us via the POCL Github or the POCL discussion channels.
We are currently looking into improving efficient asynchronous host-autonomous execution capabilities between multiple accelerators, and also are investigating support for SoCs that have IOMMUs capable of system shared fine grained virtual memory as mandated by the HSA Full Profile and OpenCL 2.0 System Shared Virtual Memory.
Finally, we would like to thank the funding sources that make our ongoing customized computing open source and academic contributions possible: The HSA foundation, Academy of Finland (funding decision 297548) and ECSEL JU project FitOptiVis (project number 783162).
For additional information, contact Pekka Jääskeläinen at: pekka.jaaskelainen@tuni.fi.

FPGAs in the World of Heterogeneous Systems

by Dr.-Ing. Marc Reichenbach and Philipp Holzinger (M.Sc.), Friedrich-Alexander University (FAU), Chair of Computer Architecture, IEEE Tech News: https://www.computer.org/publications/tech-news/fpgas-in-the-world-of-heterogeneous-systems

Recently emerging applications like deep learning, ultra-high definition image/video processing or virtual reality require a vast amount of processing power in increasingly smaller devices never seen before. Heterogeneous architectures are seen as a solution to this problem by both, industry and academia since they provide a significantly better performance per watt. In addition to the more common GPUs and DSPs, Field Programmable Gate Arrays (FPGAs) further enhance the selection of hardware to system designers. Internally, they are mostly made up of simple lookup tables and flip-flops. These components are then dynamically configured to form complex algorithms directly in hardware. This structure offers unlimited reconfigurability, low latency and most importantly very high energy efficiency due to the dedicated circuitry.
These unique features make them an indispensable part of the solution to these imminent problems. However, their inherent complexity renders their accessibility significantly worse compared to other kinds of accelerators. Therefore, new ways to interact with FPGAs in heterogeneous systems are needed to use their full potential, especially when they are used in conjunction with other accelerator types.
For this purpose, programming an FPGA must be made considerably simpler for developers. Nowadays, heterogeneous systems typically use Khronos’ open standard “OpenCL” to interact with its devices. It includes the language OpenCL C to write “kernels” which can be offloaded to an accelerator device, as well as a C/C++ API to describe these kernel dispatches from the host software.
In the past, this has been also successfully applied to FPGAs. A method called high-level synthesis (HLS) has been created to make this form of automated hardware design from a higher level language possible. Traditionally, accelerators are designed by hand, which is an expensive and time consuming process, but typically leads to the best results.
This process is not always feasible for example when short and stringent time frames need to be met. However, with HLS, the application behavior is described in a less complex source language and then compiled to synthesizable HDL (hardware description language) code. With this method, the need to write time consuming HDL becomes less important, which considerably improves the accessibility of FPGAs.
However, relying on the OpenCL standard alone is not sufficient for today’s diverse landscape of programming languages. Every developer should be able to use the best suited language for the desired task, without compromises in using accelerators. What is really needed is not only a hardware independent API, but also a language agnostic specification on a sufficiently low level. With this, every kind of accelerator can be targeted from any programming language. This issue is especially urgent for FPGAs as they are inherently harder to use.

The HSA Approach

These management problems of heterogeneous hardware are targeted by the not-for-profit Heterogeneous System Architecture (HSA) Foundation. They published a set of royalty-free specifications to provide a uniform solution for different architectures. Using this reference protocol, compilers can map constructs, which describe parallelism, from any language to the vendor neutral API.
Furthermore, the HSA specification also provides a virtual ISA, called HSAIL, which provides an abstraction for the kernel code itself. A “finalizer” is then used to generate the architecture specific code just in time. Therefore, with an HSA enabled compiler front end and a suitable finalizer, a language and accelerator independent workflow can be realized. However, HSAIL is more closely modeled after regular GPU assembler code than arbitrary logic. For this reason the mapping to FPGAs is significantly more complex than to instruction based accelerators.

A Shared HSA Workflow for FPGAs

Nevertheless, an analysis showed that it is possible to build an HLS-based HSA workflow which seamlessly extends the existing GPU and DSP solutions. A proof-of-concept realized this by automatically combining generated circuits with traditional hardware components like SIMT (single instruction, multiple thread) schedulers and caches.
With this concept, FPGAs can utilize the single source programming paradigm and all available source languages of the HSA Foundation. Moreover, the language front ends as well as the host and HSAIL compiler back ends are completely independent of the accelerator. As a consequence, they can be shared between CPU, GPU, DSP and FPGA. With this hybrid approach, newly supported languages of any vendor can be directly leveraged by FPGAs. Therefore, a greater selection of languages and accelerators is available which significantly improves the productivity of developers. For companies this also reduces the absolute time needed to develop a toolchain for their heterogeneous system.
About FAU
Marc Reichenbach is postdoctoral researcher; Philipp Holzinger is Ph.D. student at the Chair of Computer Architecture (headed by Prof. Dr.-Ing. Dietmar Fey) at FAU. Founded in 1743, FAU is a research university with an international perspective and one of the largest universities in Germany, with 40,174 students, 256 degree programs, 4,000 academic staff (including over 647 professors), and 500 partnerships with universities all over the world. Contact Mr. Reichenbach at marc.reichenbach@fau.de; Mr. Holzinger at philipp.holzinger@fau.de.

The Inherent Freedom of Heterogeneous Systems

By Zvonimir Bandic. IEEE Computing Society: https://www.computer.org/publications/tech-news/heterogeneous-system-architecture/The-Inherent-Freedom-of-Heterogeneous-Systems.
Few can deny that the choke points for many of today’s increasingly sophisticated systems continue to be hardware cost and power consumption. It’s no surprise that system performance has tapered off for a lack of “architectural vision”. One way out of this dilemma has been the introduction of heterogeneous computing. This concept is fundamentally based on the assumption that systems can use more than one kind of processor or core. Performance and energy efficiency rise by adding dissimilar co-processors that typically use specialized processing capabilities to handle specific tasks.

The Rise of Open, Memory Centric Architecture
Taking heterogeneity to the next level calls for the creation of an architectural platform that uses a memory-centric architecture. This means that heterogeneous systems can now reside in the same memory address space, and share memory coherently. It also means that multiple processors can write to memory without first copying it to a separate address space while simultaneously other processors can read from the same location. Access is coordinated using atomic operations and everything remains consistent. A key advantage is that this enables re-use of a considerable amount of legacy software.
New Freedom to Connect RISC-V Compute Nodes: OmniXtend™ protocol
The open, memory-centric architecture provides vast, new avenues of flexibility for today’s designers. Unlike earlier architectures that left designers hamstrung in what they could and could not connect to, the new architectural paradigm finally means that a diverse array of RISC-V compute nodes can now be connected to universally shared memory (NUMA)—standardized and open coherence protocols, such as Western Digital’s OmniXtend™ (see Figure 1), a new open approach to providing cache coherent memory over an Ethernet fabric. This leverages the full power and promise of heterogeneous computing, including capabilities based on an open-source architecture supported by a full spectrum of hardware, including ubiquitous Ethernet physical layer and programmable Ethernet switches, such as Barefoot Toffino P4 programmable switch. One can imagine the flexibility this gives designers, a free hand that allows one to attach CPUs, AI devices, memories, and other devices/systems. It is worth mentioning that this new breed of heterogeneous systems can easily be compatible with other ISA platforms and belong to the same memory domain. This means that companies can now take advantage of low-cost, high-capacity memory solutions to systems using the new memory-centric architecture. And it allows the entire ecosystem to be built around innovative new peripherals—beyond memories—to include accelerator systems for AI workloads.

Figure 1 Memory centric architecture with OmniXtend: utilizes P4 programmable Ethernet switch and ubiquitous Ethernet Phy. It allows large numbers of RISC-V compute nodes to connect to universally shared memory (NUMA) – utilizing standardized and open coherence protocols. This concept also enables aggregation and disaggregation of memory through memory appliance – memory heavy node.
Open Sourcing the Core
To ensure maximum efficiencies, today’s vast silos of data must be closer to their compute “geographies.” Proprietary CPUs can no longer handle the demands of vastly sophisticated systems. RISC-V-based designs allow open standard interfaces to be used, which, in turn, enable specialty processing, memory-centric designs, unique storage, and flexible interconnect applications. One can foresee the emergence of new data-centric applications such as the Internet of Things (IoT), secure processing, industrial controls and more. This innovative, new open approach will finally deliver cache-coherent memory over an Ethernet “fabric.”
Open Sourced RISC-V Cores and Ecosystem Enablement
Building of open networked cache coherency schemes such as Western Digital’s OmniXtend requires a whole ecosystem of building blocks – processors, accelerators and compute cores. RISC-V instruction set architecture is open, and allows for building such ecosystem. As a part of this process, Western Digital recently released the SweRV Core™ (see https://github.com/westerndigitalcorporation/swerv_eh1) and an associated instruction set simulator (ISS). The SweRV Core is a 32-bit, mostly in order 2-way superscalar, 9 stage pipeline with a superior performance compared to current open sourced RISC-V cores.
One is left with the inescapable conclusion that any truly viable system must be based on platforms that include a memory-centric architecture. Heterogeneous systems should all reside in the same memory domain and share memory in a coherent way. In addition, legacy software should be able to piggy-back on this new architecture to take advantage of the many benefits it provides today’s systems designers.
Western Digital, SweRV Core, and OmniXtend are registered trademarks or trademarks of Western Digital Corporation or its affiliates in the US and/or other countries. All other marks are the property of their respective owners.

New Generation of Heterogeneous Systems for AI Applications

Embedded Systems Engineering: http://eecatalog.com/machine-learning-ai/2019/01/15/new-generation-of-heterogeneous-systems-for-ai-applications/

Open computing platforms can take AI past the training wheels stage, with benefits for automotive, healthcare, industrial and more.
Today’s AI applications will touch every aspect of our lives—including transport, finance, retail, health care, smart manufacturing, education, and services industries. AI technologies will be at the forefront of digitally connected cars, smart manufacturing, and medical image recognition. The question to ask ourselves is how can we leverage the power of AI with today’s diverse systems and protocols? The answer lies in an emerging new ecosystem designed to unite many of today’s heterogeneous “pieces of computing power.”
Bringing Abstraction to Heterogeneous Platforms
Because heterogeneous processors are widely available, new platforms will be expected to leverage a huge amount of computing power. This includes acceleration units (GPU, DSP, and FPGA). Understandably, artificial intelligence, machine learning, and neural networks are at the forefront of this new computing paradigm. New architectures are also needed to address the massive computing capability augmented by CPU cluster-based computers. Migrating this approach to the mainstream presents a challenge, principally because heterogeneous programming models have not been standardized, lacking portability.
Enter HSA’s Open Computing Platform
The challenge facing many industries is that existing architectures are inadequate for today’s AI and big data workloads. An open computing platform of Heterogeneous Systems Architecture (HSA) offers an elegantly viable solution. This new breed of architecture will spearhead an entirely new realm of opportunities, not the least of which are autonomous driving, more computing power, and robust data centers. Systems designers will finally have an efficient new ecosystem, one designed specifically to address today’s burgeoning array of computer architectures and protocols.
Easier Programming of Hetero Devices
The HSA Foundation’s consortium of semiconductor companies, tools/IP providers, software vendors, and academic institutions develops royalty-free standards and open-source software. This makes it dramatically easier to program heterogeneous computing devices. It reduces the complexities of heterogeneous systems through a new ecosystem; one that specifies parameters like runtime and system architecture APIs that piggyback cache-coherent shared virtual memory hardware. No more time-consuming operating system calls. Systems now run at the user level. With single-source programming, both control and computer code reside in the same file or project. No need for expert programmers to decipher tool-chains of multiple processors for individual access.
Programming in Standard Languages
Another key benefit for AI applications developers is that the HSA platform conforms to a variety of different programming languages. Compilation tools are available from both proprietary and open-source projects (LLVM and GCC). HSA compilers are available for C/C++, OpenCL, OpenMP, C++AMP, Python, and more. This flexibility vastly extends the power and reach of AI applications now on many drawing boards.
Leveraging Developer Productivity
Defined as a productivity engine that leverages the power and potential of heterogeneous computing, HSA removes many of the barriers of traditional heterogeneous programming. Developers can finally focus on their algorithms without having to micro-manage system resources. The goal is to sponsor applications that seamlessly blend scalar processing with high-performance computing on CPUs, GPUs, DSPs, Image Signal Processors, VLIWs, Neural Network Processors, FPGAs, and more.
There’s little doubt that AI applications will impact how we live, work, and play. AI technologies will be at the forefront of digitally connected transportation, smart manufacturing, and medical technologies. But it will be the power and flexibility of  heterogeneous computing that will make these AI breakthroughs feasible and change the face of our world.


Dr. John Glossner is president of the HSA Foundation.

New Generation of Heterogeneous Systems for AI Applications

Computing Now, HSA Connections: https://www.computer.org/portal/web/hsa-connections/content?g=54930593&type=article&urlTitle=new-generation-of-heterogeneous-systems-for-ai-applications

Today’s AI applications will touch every aspect of our lives—including transport, finance, retail, health care, smart manufacturing, education, and services industries. AI technologies will be at the forefront of digitally connected cars, smart manufacturing, and medical image recognition. The question to ask ourselves is how can we leverage the power of AI with today’s diverse systems and protocols? The answer lies in an emerging new ecosystem designed to unite many of today’s heterogeneous “pieces of computing power.”
Bringing Abstraction to Heterogeneous Platforms
Because heterogeneous processors are widely available, new platforms will be expected to leverage a huge amount of computing power. This includes acceleration units (GPU, DSP, and FPGA).  Understandably, AI/ML/NN are at the forefront of this new computing paradigm. New architectures are also needed to address the massive computing capability augmented by CPU cluster-based computers. Migrating this approach to the mainstream presents a challenge, principally because heterogeneous programming models have not been standardized, lacking portability.
Enter HSA’s Open Computing Platform
The challenge facing many industries is that existing architectures are inadequate for today’s AI and big data workloads. An open computing platform of Heterogeneous Systems Architecture (HSA) offers an elegantly viable solution. This new breed of architecture will spearhead an entirely new realm of opportunities, not the least of which are autonomous driving, more computing power, and robust data centers. Systems designers will finally have an efficient new ecosystem, one designed specifically to address today’s burgeoning array of computer architectures and protocols.
Easier Programming of Hetero Devices
The HSA Foundation’s consortium of semiconductor companies, tools/IP providers, software vendors, and academic institutions develops royalty-free standards and open-source software. This makes it dramatically easier to program heterogeneous computing devices. It reduces the complexities of heterogeneous systems through a new ecosystem; one that specifies parameters like runtime and system architecture APIs that piggyback cache-coherent shared virtual memory hardware. No more time-consuming operating system calls. Systems now run at the user level. With single-source programming, both control and computer code reside in the same file or project. No need for expert programmers to decipher tool-chains of multiple processors for individual access.
Programming in Standard Languages
Another key benefit for AI applications developers is that the HSA platform conforms to a variety of different programming languages. Compilation tools are available from both proprietary and open-source projects (LLVM and GCC). HSA compilers are available for C/C++, OpenCL, OpenMP, C++AMP, Python and more. This flexibility vastly extends the power and reach of AI applications now on many drawing boards.
Leveraging Developer Productivity
Defined as a productivity engine that leverages the power and potential of heterogeneous computing, HSA removes many of the barriers of traditional heterogeneous programming. Developers can finally focus on their algorithms without having to micro-manage system resources. The goal is to sponsor applications that seamlessly blend scalar processing with high-performance computing on CPU’s, GPU’s, DSP’s, Image Signal Processors, VLIW’s, Neural Network Processors, FPGA’s, and more.
There’s little doubt that AI applications will impact how we live work and play. AI technologies will be at the forefront of digitally connected transportation, smart manufacturing, and medical technologies. But it will be the power and flexibility of  heterogeneous computing that will make these AI breakthroughs feasible and change the face of our world.

Resolving the Challenges in Heterogeneous Computing

By Dr. John Glossner, HSA Foundation. Embedded Computing Design: http://www.embedded-computing.com/hardware/resolving-the-challenges-in-heterogeneous-computing
Programmers have historically faced major challenges in implementing applications in their existing programming languages. Not the least of which were multiple native instruction set architecture (ISA) inherent in heterogeneous processors. Today, those concerns are being put to rest, thanks to the introduction of Heterogeneous System Architecture (HSA).
A complex System on Chip (SoC) is at the heart of most electronic products today. Typically comprised of a wide range of IP blocks, often from different vendors, these blocks include everything from general-purpose processors (CPUs) to Deep Neural Networks (DNNs). Each is often designed and programmed in different, proprietary languages, creating a tech “Tower of Babel” for developers. Understandably, a solution had to be found, one that efficiently and cost-effectively addresses today’s growing hardware diversity.
The Move Toward Heterogeneous Architectures
Heterogeneous System Architecture has successfully addressed programming multiple different processors and exploiting the power of heterogeneity. Developers have grown increasingly aware of heterogeneous chips and their potential to dramatically reduce the power needed to perform complex compute applications. When programs are optimized for specialized heterogeneous systems, each system processor can execute code using the least power required for that particular function. The result is more performance at lower power than non-heterogeneous systems.
But there is another benefit to HSA, one that finally allows developers to design and program increasingly complex heterogeneous systems much faster. It helps to ensure that the right processor is used at the right time for the right task. Combined with cache-coherent shared virtual memory, HSA systems realize high-bandwidth access to memory, increased application performance and reduced power consumption.
Best of Both Worlds
Heterogeneous computing combines the best of general purpose computing and specialized computing. It specifies how a CPU can “talk” to an accelerator and often finds both integrated onto the same silicon die. So heterogeneous processors—such as CPUs, GPUs, DSPs, FPGAs, specialized accelerators and others—can finally be integrated and cooperate to achieve an ideal balance of performance and power consumption for a given application. Understandably, most designers today favor greater integration in the systems they build. While this adds a level of difficulty to the design process, the benefits of this approach—speed, fewer devices and lower overall cost—outweigh the inherent challenges.
Creating a Uniform Standard
HSA computing standards have made major strides since HSAF was founded in 2012. Today, there are not only royalty-free open specifications available but also fully operational production systems. HSA has become increasingly attractive to systems designers. It simplifies heterogeneous programming, creating standards that allow different types of processors to be programmed using many common programming languages including C/C++, Python, OpenCL, Java, etc. HSA ingeniously uses a single source file and automatically distributes parts of an application to the best processor to do the actual computing.
Survey Underscores HSA’s Broad Appeal
In a recent survey of HSA Foundation members, 100 percent indicated their systems have HSA features and 80 percent are now HSA-compliant. Respondents also cited improved SoC design and programming processes, greater interoperability between blocks from different IP suppliers, higher performance and lower power consumption. Most companies indicated they would continue to use multiple programming languages including ISO C++, ISO C11/C99, OpenMP 3.1/4.0 with C, and several others. Respondents also indicated a need to develop solutions for technologies that include Global Debug, further defining the memory model, security, virtualization and extensions to HSAIL.
Boon to Users
Heterogeneous systems are at the core of a variety of technological disruptions. Tablets, smartphones and scientific computers were all created as specialized systems. Going forward, heterogeneous architectures are playing a vital role in creating the next generation of disruptive devices. This includes 46 percent of desktops and mobile devices; 69 percent of servers, IoT, and embedded devices and 92 percent of AI and computer vision systems.
China and Beyond
The recently concluded Heterogeneous Computing Standards & International AI Conference, held in Xiamen, China is helping to lay the groundwork for heterogeneous computing standards not only in China but worldwide. The conference brought together industry leaders to discuss processors, software, applications, machine learning, and fintech for heterogeneous systems in artificial intelligence applications. And the subsequent series of China Regional Committee (CRC) Technical Symposiums this year have proposed the first version of standards and reviewed a number of additional proposals with all working groups providing input on the overall framework, key content and new features. The goal here is for key scientists and companies from China to adopt and adapt these technologies and that these specifications be incorporated worldwide.
Lastly, the HSA Foundation will join in the World Artificial Intelligence Conference (WAIC 2018) to be held in Shanghai West Bund September 17-19 (http://www.waic2018.com/index-en.html) and co-host the WAIC 2018 | Heterogeneous Computing Summit Forum on September 19 as an important part of the 3-day event. Themed on “Heterogeneous Computing, Standards Establishment and AI Empowerment,” the forum will invite global companies, world-renowned experts, Chinese officials, and representatives from industry, universities and research institutes to introduce the latest developments in global heterogeneous computing, release the latest results of China’s heterogeneous computing standards research and share the typical applications of deep integration of heterogeneous computing and AI.
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs, and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.

HSA Foundation China Regional Committee & China Standards Group for Heterogeneous System Architecture 2nd Technical Symposium to Be Held in Nanjing

To Focus on Advancements in China’s Heterogeneous Computing Standards
NANJING, China, June 12, 2018 — The Heterogeneous System Architecture (HSA) Foundation China Regional Committee (CRC) and China Standards Group for Heterogeneous System Architecture Technical Symposium – Nanjing Session, is scheduled to be held on June 20th in Jiangbei New District, Nanjing to discuss the progress of China’s heterogeneous computing standards and its applications and achievements in artificial intelligence, IoT, and robotics. HSAF President Dr. John Glossner will be one of the featured presenters.
The Symposium is sponsored by the China Electronic Standardization Institute (CESI), an HSA Foundation promoter member, and the HSA Foundation China Regional Committee (CRC). China Standards Group for Heterogeneous System Architecture (CSH), Nanjing IC Industry Service Center, Huaxia General Processor Technologies (Huaxia GPT), Nanjing University of Technology are serving as co-organizers.
HSA is a standardized platform design that unlocks the performance and power efficiency of the parallel computing engines found in most modern electronic devices. It allows developers to easily and efficiently apply the hardware resources—including CPUs, GPUs, DSPs, FPGAs, fabrics and fixed function accelerators—in today’s complex systems-on-chip (SoCs).
Previously, on May 29, the HSA Foundation CRC and CSH Technical Symposium – Hunan Session was successfully held at the Hunan Institute of Technology. More than 50 experts and scholars from member organizations of the HSA Foundation, the CRC, the CSH, relevant universities, research institutes, and companies attended the Symposium.
The Symposium focused on the latest developments in China’s heterogeneous computing standards research. During the Symposium participants discussed heterogeneous computing in artificial intelligence, robotics, ADAS/autonomous driving, IoT, software-defined communications and other applications. Technical discussions and proposals provided key insights to help support product development, R&D, ecosystem formation, and conformance tests in related industries.
The Symposium proposed the first version of standards and reviewed a number of additional standards proposals with all working groups providing input on the overall framework, key content, and new features. At the meeting, Dr. Xiaodong Zhang, the Chair of the CRC and CSH, introduced the working groups’ goals, progress, and plans for 2018. Afterwards, standard proposals and deliberations were conducted in areas such as standard application evaluation, instruction set architectures, system architecture, security protection, and network interconnection. Members attending included State Grid, Huaxia GPT, Fudan University, Jiangsu Research Center of Software Defined Radio, HME, NationZ Technologies, and Sun Yat-Sen University, etc.
 
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.