HSA Foundation China Regional Committee & China Standard Group of Heterogeneous System Technical Symposium to Be Held in Hunan

YUEYANG, CHINA, May 22, 2018 — The Heterogeneous System Architecture (HSA) Foundation China Regional Committee (CRC) and China Standard Group of Heterogeneous System Technical Symposium – Hunan Session, will be held on May 29 in Yueyang, Hunan.
The Symposium is sponsored by the China Electronic Standardization Institute (CESI), an HSA Foundation promoter member, and the HSA Foundation China Regional Committee (CRC). China Standard Group of Heterogeneous System and Hunan Institute of Technology are serving as co-organizers.
The Symposium will focus on the latest developments in China’s heterogeneous computing standards research and target on the discussion of heterogeneous computing in artificial intelligence, software-defined communications and other applications.
““The HSA Foundation CRC has led the way in nationalizing heterogeneous computing standards in China and continues to advance the mission of the Foundation, which is to make heterogeneous programming universally easier,” said Dr. John Glossner, the Foundation’s president
 
AGENDA

  1. General Info
    1. Date/Time: May 29, 2018 | 9:00-18:00
    2. Venue: Hunan Institute of Technology, Yueyang, Hunan
    3. Participants: HSA Foundation Members, HSA Foundation China Regional Committee (CRC) Members, Related Universities, Research Institutes and Companies
  2. Topics
                a. Morning Session: 9:00-12:00

i. Interpreting the Industry Standard Establishment Process by the Ministry of Industry and Information Technology – Fang Liu, CESI (10
ii. Interpreting the Draft Standard Recommendations by the CRC Working Groups (60 minutes)

– Application & System Evaluation Working Group
– Virtual ISA Working Group
– System Architecture Working Group
– Compilation & Runtime LIB Working Group
– OS & Multivendor Working Group
– Interconnect Working Group
– Security & Protection Working Group
– Conformance Test Working Group

iii. System Requirements and Standardization Concepts for Software-Defined Satellite Chip Platforms (30 minutes)
iv. 2018 Working Objectives and Execution Plan for the CRC – Dr. Xiaodong Zhang, Chair of the CRC (30 minutes)
v. Recommendations for Implementation of Variable-Length Vector Parallel Computing in Heterogeneous Computing Standards – Dr. Lei Wang, Huaxia (Beijing) General Processor Technologies (30 minute)

                     b. Afternoon Session: 14:00-18:00

i. Recommendations for Heterogeneous Computing Instruction Architecture for Convolutional Neural Network Parallel Computing – Dr. Jun Han, State Key Laboratory of ASIC, Fudan University (30 minutes)
ii. Recommendations for Software-Defined Radio Instruction Architecture based on Heterogeneous Computing Standards – Dr. Chaoyang Tian, ​​ Jiangsu Research Center of Software Digital Radio (30 minutes)
iii. Several Enhancements to Security and Protection Strategies for Heterogeneous Computing Systems – Shaowei Chen, Senior Expert, Nationz Technologies (30 minutes)
iv. Recommendations for New Heterogeneous Computing Network Interconnection Specification Compatible with International Mainstream Standards – Dr. Zhiyi Yu, Sun Yat-sen University (30 minutes)
v. HSA Lecture Series (120 minutes)
1.
HSA Basic Knowledge Series
2. HSA Basic Knowledge Series – the Use of HSA Open Source Tools and Programming Practices
3. HSA Basic Knowledge Series – the Status of HSA Artificial Intelligence Open Source Library

 
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.

Proof-of-Concept C++17 Parallel STL Offloading for GCC/libstdc++

Computing Now, HSA Connections: https://www.computer.org/portal/web/hsa-connections/content?g=54930593&type=article&urlTitle=proof-of-concept-c-17-parallel-stl-offloading-for-gcc-libstdc-
Introduction:
Parmance and General Processor Technologies have been collaborating on C++17 Parallel STL offloading support based on HSA (Heterogeneous System Architecture) and GCC (GNU Compiler Collection). A working proof-of-concept has been now released and made available in https://github.com/parmance/par_offload. This post is a high level overview of the project.
Heterogeneous Offloading and C++17
The C++17 standard released in December 2017 adds execution policies in its standard template library (STL) algorithm definition. Execution policies enable the programmer to declare that the algorithm library call, along with any user-defined functionality the call uses, is safe to execute in parallel. The user-defined functionality is referred to as “element access functions” (EAF) by the standard.
The PSTL (Parallel Standard Template Library) of C++17 focuses on forward progress guarantees and their implications to parallelization safety on homogeneous processors.
However, there is no “parallel heterogeneous offloading execution policy” yet in the C++ standard; there seems to be an implicit assumption that the parallel execution will occur in the same processor where it was invoked. To make the offloading decisions explicit, for our offloading implementation, we defined a new execution policy type ‘parallel_offload_policy’ (par_offload) which the programmer can use to declare “heterogeneous offload” or “multiple-ISA” safety for the involved user-defined functions.
A call to the ‘transform’ PSTL function with this policy looks like the following:
std::transform(std::execution::experimental::par_offload,
pixel_data.begin(), pixel_data.end(),
pixel_data.begin(),
[](char c) -> char {
return c * 16;
});
In this case, a lambda function was used to iterate over all the elements in the pixel_data of std::vector type with the processing offloaded to a heterogeneous device, if one is available.
Shared Virtual Memory
Explicit data management is problematic in the case of offloading general purpose C/C++ programs that assume a unified address space and allow passing pointers to functions without attached size information. Indeed, a single unified coherent address space across all the processors in a heterogeneous platform would remove a major obstacle in heterogeneous platforms and make programming such devices much simpler.
Heterogeneous System Architecture (HSA) (1.0 published in March 2015) is a language neutral standard targeting heterogeneous systems. It defines a cache-coherent shared global virtual memory as a core feature. That is, an HSAF heterogeneous platform supports data sharing across devices (called agents) as easily as in “homogeneous” C/C++ multithreaded programming.
In the GCC PSTL offloading work we used the HSA Runtime as a heterogeneous platform middleware and rely on the coherent system memory capabilities of the HSA Full Profile. HSA is interesting for this use case most importantly due to its shared heterogeneous memory requirement that is expected to work seamlessly with C/C++ memory model. Also there is a wide selection of open source components implementing the different parts of the specs available. For example, its intermediate language HSAIL has both front end and backend support already in upstream GCC. There are also implementations of its runtime API to enable development and testing via offloading to CPU based targets.
Implementation Status and Future Plans
We now have a proof-of-concept offloading implementation of several PSTL algorithms running with multiple ways to define the user-specified functionality working. The implementation supports lambda functors (with and without captures), C functions, std::functions containing C functions, function objects, and user defined data types.
Next we plan to properly integrate the prototype to libstdc++ and GCC, implement the rest of the algorithms and finally optimize the performance.
Links and references
The code in Github https://github.com/parmance/par_offload
ISO/IEC 14882:2017 Programming languages — C++ Publication date 2017-12 https://www.iso.org/standard/68564.html
Heterogeneous Systems Architecture Foundation http://www.hsafoundation.com

Developing Heterogeneous Cache Coherent SoCs

Chip Design: http://eecatalog.com/chipdesign/2017/12/19/developing-heterogeneous-cache-coherent-socs/
Automotive and other customer needs are not what they once were.
As with other challenges, the task of successfully developing heterogeneous cache coherent SoCs demands an understanding of your customers’ requirements. Conducting interviews with your customers aids this understanding. For example, through the interview you should learn:

  • How many IPs are needed to connect to the heterogeneous system;
  • What kind of bandwidth each IP requires;
  • The types of IPs that are in the system;
  • What kind of features you would enable in the interconnect IP.

The next step is to define “heterogeneity” because, while many people use the word “heterogeneous,” it has a number of meanings. Some guidelines:

  • You must have different types of processors within the same system;
  • Different processor types also have different cache structures. For example, an Arm CPU would use the same cache structure as another Arm core, but a different CPU may pose a different cache structure
  • Different types of IPs must also be considered:
    • CPUs, GPUs, and DSPs
    • IPs that make up an SoC, such as those for connectivity, USB, SATA, etc.

A highly flexible snoop filter architecture accommodates different cache structures of different kinds of processors. It also reduces the number of memory bits required to perform snoop filtering.
Adapt to Changing Customer Needs
Understanding what the customer requirements are for non-coherency and coherency is a must. Are the coherent and non-coherent domains separated, a full merger, or a customized mix? ArterisIP, for instance, has developed a component called a non-coherent bridge. Its purpose is to drive non-coherent accesses into the coherent domain.
A few years ago, coherency systems were small and compact with a maximum of three to four different processors. Coherency was confined to CPU clusters, and functionality was grouped under an application. Coherency wasn’t necessarily distributed beyond a subsystem.
However, customer needs are changing, and today there is a need for greater processor performance. Companies are adding more and different types of processors. In addition:

  • SoC layouts are expanding tremendously;
  • Processors are growing larger;
  • Complex layouts are affecting the coherency domain;
  • Coherent domain is expanding all over the chip.

So how do you handle all these?  First, you must make sure the infrastructure is designed to distribute coherency system-wide. The interconnect technology must enable network packet transport and accommodate a variety of topologies, such as ring and mesh. The infrastructure must also be configurable and flexible because as design complexity continues to grow, designers need to understand which topologies best suit a particular chip layout. Having the proper tools to predict where complexities might cause performance and power issues in the chip layout stage is critical to adapting to the layout and discovering which topology best resolves these issues.
Optimizing Power Consumption of Complex Systems
To optimize for power, first, you need to provide a power-ready IP. Once this is accomplished, you need to implement some tried and true techniques—these may include voltage domain, power domain, clock gating, and high-level clock gating.
When an IP is power-ready, it will have connectivity to a power interface and can be controlled by a PMU (Power Management Unit) in the system. The PMU will decide when to shut down the IP – i.e. when it is not in use or not needed by the system. At the application level, this power-aware controller (PMU) can lower system power consumption by putting an IP on idle.
Maturing to Meet Challenges
Heterogeneous SoCs are still in development and haven’t yet matured. But processors in coherent domain are now sharing data with each other. Other CPUs and GPUs have become cache coherent, although I’m confident we can do a lot more.
Moreover, data sharing is not only between the processor and the GPU, but among all the IPs of the system—a concept that is still work in progress. This idea must be pushed a little bit farther to achieve total coherency. Today not many non-coherent IPs share data with coherent IPs. But applications are emerging that need coherency, and this will bring new requirements.
Some of these design challenges are hindering product development, for example, for Advanced Driver-Assistance Systems (ADAS) for automotive. Automotive applications have performance requirements and the need to share data with heterogeneous processors to achieve those requirements. We’ll see the introduction of new features to this market. Other markets will include artificial intelligence and machine learning.
A decade ago, mobile application processors were driving the need to cache coherency. Next, data center systems took over as the primary drivers. Now the automotive market is fuelling the race to extend cache coherency to all of the heterogeneous processing elements in SoCs. In two or three years, a new trend will emerge to extend heterogeneous cache coherency even further—but designers will need flexibility, configurability and scalability to ensure that these systems are high-performance, low-latency, and power- and cost-efficient.


J.P. Loison is Corporate SoC Application Architect, ArterisIP, which provides system-on-chip (SoC) interconnect IP to accelerate SoC semiconductor assembly for a wide range of applications. These applications include those spanning automobiles to mobile phones, IoT, cameras, SSD controllers, and servers for customers such as Samsung, Huawei / HiSilicon, Mobileye (Intel), Altera (Intel), and Texas Instruments. The company is located in Campbell, CA.

HSA Foundation China Regional Committee Wraps Up Successful 2nd Annual Symposium

Wide Array of Interfaces, Specs Discussed for Next Gen of Heterogeneous Computing, AI, SDR, and More
BEIJING, CHINA, DEC. 20, 2017 — The China Regional Committee (CRC) of the Heterogeneous System Architecture (HSA) Foundation has successfully concluded its 2nd Symposium in Beijing. The CRC was formed earlier this year; its mandate is to enhance the awareness of heterogeneous computing and promote the adoption of standards such as Heterogeneous System Architecture (HSA) in China.
More than 40 representatives of the CRC members and related companies, research institutes and universities throughout China attended the conference. HSA Foundation President Dr. John Glossner also participated in this important benchmark meeting that exchanged ideas on important topics including interfaces and specifications for the next generation of heterogeneous computing, vector parallel computing model, system security and protection, artificial intelligence, software defined radio, Network-on-Chip (NoC), and programming of commercial HSA chips. The meeting was co-organized by China Electronics Standardization Institute (CESI) and the HSA Foundation’s CRC, and sponsored by Huaxia General Processor Technologies.
Last year the HSA Foundation held its first Global Summit in Beijing. The CRC has actively carried out various work in conjunction with CESI for the development of global heterogeneous computing standards with a China focus.
At the meeting, each CRC working group shared its progress and insights on related key technologies:
Application & System Evaluation Working Group – “The application situation and development trend of artificial intelligence in China and typical rigid demands and key indicators of artificial intelligence” – presented by State Grid;
Virtual ISA Working Group – “Artificial intelligence instruction set design for heterogeneous computing and exploratory research of HSAIL artificial intelligence extended subset” – presented by Dr. Jun Han, Fudan University;
Interconnect Working Group – “Latest research results on network-on-chip in the heterogeneous computing SoCs, and the next step verification and standardization work arrangements” – presented by Dr. Zhiyi Yu, Sun Yat-sen University;
Compilation & Runtime LIB Working Group – “The latest research trends in vector computing models and related programming models, and basic recommendations for facilitating integration into HSA system architectures” – presented by Dr. Lei Wang, Huaxia General Processor Technologies;
System Architecture Working Group – “Using HSA to systematically address the basic views of software-defined communications, software-defined radio, heterogeneous multi-core chip architecture and application development” – presented by Wanting Tian, Sanechips Technology;
Security & Protection Working Group – “Research work and principles on adapting heterogeneous computing for security protection” – presented by Shaowei Chen, Nationz Technologies.
The CRC has been adding members since the first CRC Symposium in May; some of which include Huaqiao University, Hunan University, Jimei University, Tsinghua University, Xiamen University, Xiamen University of Technology and Zhejiang University.
Supporting quotes:
“The HSA Foundation CRC has been laying the groundwork for standardization progress in heterogeneous computing standards in China for almost a year. It is focused on supporting the needs of HSA Foundation members in China and helping to fulfill the mission of the Foundation, which is to make heterogeneous programming universally easier.”
Dr. John Glossner, HSA Foundation President
“Since its formation, the CRC has received the support and attention of many academic institutions, companies, and government authorities in China. The work product and coverage of the CRC has been expanding and developing rapidly, making it one of China’s first “innovative brands” for standardization of heterogeneous computing. In 2018 the CRC and HSAF will work towards adoption of the v1.2 specifications and extensions enabling the transformation of HSA chips and platform products in many applications.”
Dr. Xiaodong Zhang, HSA Foundation CRC Chair
“The main research direction of our team is Software Defined Radio. Due to the flexibility of SDR, it allows for implementation across a wide range of applications. The earliest SDR platforms were based on FPGAs and DSPs with large size and high-power consumption making generalized SDR systems problematic. However, the HSA platform provides new possibilities for SDR research. HSA has many advantages such as low power consumption, low cost, and high integration. Those are hard to find in traditional SDR platforms.”
Dr. Ming Zhao, Professor, Tsinghua University
“Micro-Processor Research and development Center (MPRC) of Peking University is the pioneer of innovating indigenous microprocessor (CPU) and computer systems in China. To minimize the digital gap between developed and developing countries, MPRC is committed to the development of computers with independently developed CPUs and heterogeneous SoCs. The advantage of a heterogeneous architecture is the ability to be adaptable. During the evolution from desktop computing to mobile computing to Big Data, systems that adapt are the ones that are most successful. MPRC will work together with other members in HSA Foundation to improve life with heterogeneous technology.”
Dr. Junlin Lu, Deputy director of MPRC, Peking University
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.

Developing Heterogeneous Cache Coherent SoCs – and More! Q&A with Arterisip's J.P. Loison, Corporate SoC Application Architect

Computing Now, HSA Connections: https://www.computer.org/portal/web/hsa-connections/content?g=54930593&type=article&urlTitle=developing-heterogeneous-cache-coherent-socs-and-more-

Editor’s Note: 
ArterisIP provides system-on-chip (SoC) interconnect IP to accelerate SoC semiconductor assembly for a wide range of applications from automobiles to mobile phones, IoT, cameras, SSD controllers, and servers for customers such as Samsung, Huawei / HiSilicon, Mobileye (Intel), Altera (Intel), and Texas Instruments. The company is located in Campbell, CA.
Describe in detail the various design challenges faced today in developing Heterogeneous Cache Coherent SoCs.
The first thing you need to do is understand customer requirements. This includes asking the right questions, some of which may include:
  • Understand how many IPs need to connect to the heterogeneous system;
  • What kind of bandwidth does the IP require;
  • What kind of IP and what kind of features can you enable with interconnect IP.
The next step is to define heterogeneity because many people are using the heterogeneous word, but there are different meanings behind the word.  Some key tasks and guidelines:
  • You must have different types of processors within the same family;
  • Then you have to accommodate different types of processors that are available on the market.
  • Different processor types also have a different cache structures.
    • An ARM CPU would use the same cache structure as another ARM core all over the processor.
  • A different CPU poses a different cache structure.
  • Accommodate different types of IPs as well:
    • CPU, GPU, and DSPs:
    • Then there are all other types the IPs that you combine into an SoC like connectivity IP, USB, SATA, etc.
It’s also important to be able to accommodate different (cache) protocol systems in terms of coherent and non-coherent protocol. Some examples:
  • Flexible snoop filter capability accommodates different cache structures of different kinds of processors.
    • Snoop filter capabilities operate in two different directions to accommodate any cache structure of any processor that is available today.
    • Another challenge: Reduce the number of memory bits that you need to perform snoop filtering.
How do you integrate IP that is not-cache coherent and achieve better performance? Provide a brief example or two?
You need to understand what the customer requirements are in terms of the mix of non-coherency and coherency requirements.  Are they separated, a full merger of both domains or a customized mix? Arteris, for instance, developed a component called a non-coherent bridge.  Its purpose is to drive non-coherent accesses back into the coherent domain.  It also enables a differentiator between the non-coherent and coherent domains.
How to you create a cache-coherent system that is easily placed on a chip?
A few years ago, coherency systems were small and compact – a max of three to four different processors. Coherency was confined to CPU clusters, functionality was grouped under an application and all subsystems were connected to an application.
But coherency wasn’t necessarily distributed beyond a subsystem. Customer needs are changing, there is a need for greater processor performance and companies are adding more and different types of processors. In addition:
  • SoC layouts are expanding tremendously;
  • Size of processors growing larger;
  • Complex layouts affect coherency domain;
  • Coherent domain is expanding all over the chip.
So how do you handle it?  First, you must make sure the infrastructure is designed to distribute coherency system-wide.  It has to be an interconnect technology that enables network packet transport and it also must accommodate a variety of topologies such as ring and mesh. The infrastructure must also be configurable and flexible because as design complexity continues to grow, designers must be able to understand which topologies are best suited for a particular chip layout. Having the proper tools that can predict where complexities might cause performance and power issues in the chip layout stage is critical to revising the layout and providing the best solution in terms of which topology might resolve these issues.
How can you optimize power consumption of complex systems? 
You first need to provide power-ready IP; once this accomplished, then you need to implement some well-known techniques – these may include voltage domain, power domain, clock gating and high-level clock gating.
If power-ready it will also have connectivity to a power interface and can be controlled by an MPU in the system that will decide when to shut down the IP when not in use or not needed by the system.  At the application level, this power-aware controller (MPU) can lower system power consumption by putting an IP on idle.
How long will it take to reasonably surmount some/all of the aforementioned issues?
Heterogeneous SoCs are still in development and haven’t yet matured. But processors in coherent domain now sharing data with each other. Other CPUs and GPUs have become cache coherent although I’m confident we can do a lot more.
With data sharing, this is not only between processor and GPU, but between all of the IPs of the system – it’s a concept that is in progress. This IP must be pushed a little bit farther to achieve total coherency. Today there are still not too many non-coherent IPs sharing data with coherent IPs. But we’re now starting to see applications now emerging that need coherency and this will bring new requirements.
Are these design challenges currently hindering product development in select verticals?  If so, which ones?
Yes, one that comes to mind is ADAS (Advanced Driver-Assistance Systems for automotive. Automotive applications will have a lot of requirements because of the need to add performance and share data with heterogeneous processors to achieve those requirements. We’ll see the introduction of new features to this market. Other markets will include artificial intelligence and machine learning.
A decade ago, mobile application processors were driving the need to cache coherency and then data center systems started becoming the primary driver.  Now the automotive market is driving the need to extend cache coherency to all of the heterogeneous processing elements in SoCs.  In two or three years, a new trend will emerge to extend heterogeneous cache coherency even further – but designers will need flexibility, configurability and scalability to ensure that these systems are high-performance, low-in-latency and reasonable in terms of power consumption and cost.

Everything You Need to Know About Why AMD Open Sourced the OpenCL Driver Stack for ROCm

Computing Now, HSA Connections: https://www.computer.org/portal/web/hsa-connections/content?g=54930593&type=article&urlTitle=hsa-connectio-1
Introduction: AMD is a co-founder and member of the HSA Foundation.  This article is excerpted and edited from a blog post by Vincent Hindriksen, founder of Stream HPC, a Netherlands-based software development company.
Last May, AMD open sourced the OpenCL driver stack for ROCm. With this they kept their promise to open source (almost) everything. Earlier the hcc compiler, kernel-driver and several other parts were open sourced.
Why this is a big thing?
There are indeed several open source OpenCL implementations, but with one big difference: they’re secondary to the official compiler/driver. So, implementations like PortableCL and Intel Beignet play catch-up. AMD’s open source implementations are primary.
They contain:
  • OpenCL 1.2 compatible language runtime and compiler
  • OpenCL 2.0 compatible kernel language support with OpenCL 1.2 compatible runtime
  • Support for offline compilation right now – in-process/in-memory JIT compilation is to be added.
Performance of ROCm was mostly on par with AMD’s closed source drivers, with a few outliers. A few months ago ROCm 1.6 was released, where again performance was noticeably improved. For the next release performance improvements are expected again.
Why was it open sourced?
There were several reasons. AMD listened carefully to their customers in HPC, while taking note of where the industry was going.
Get deeper understanding of how functions are implemented
It’s useful to understand how functions are implemented. For instance the difference between sin() and native_sin() can tell you a lot more on what’s best to be used. It doesn’t tell how the functions are implemented on the GPU, but does tell which GPU-functions are called.
Learning a new platform has never been so easy. Deep understanding is needed if you want to go beyond “it works”.
Debug software deeper
Any software engineer has experience with libraries that don’t perform as promised or work as documented. Integration issues with “black box” libraries, are therefore a typical reason for big project delays. If the library was open source, the debugger could step in and give all information needed to solve the problem quickly.
When working with drivers it’s about the same. GPU drivers and compilers are extremely complex and inevitably your project hits that one bug nobody encountered before. With all open source drivers, you can step into the driver with the same debugger. Moreover, the driver can be recompiled with fixed code instead of having to write a less secure work-around.
Get bugs solved quicker
A trace now includes the driver-stack and the line-numbers. Even a suggestion for a fix can be given. This also helps reduce the time to get the fix for all steps. When a fix is suggested AMD only needs to test for regression to accept it. This makes the work for tools like CLsmith a lot easier.
A bonus of open source projects is that over time the code quality becomes better than projects where code is never seen by outsiders, which also adds to quicker solving of bugs.
Get low-priority improvements in the driver
Popular software like Blender and the LuxMark benchmark can expect to get attention from driver developers. For the rest of us, we have to hope our special code-constructions are comparable to one that is targeted. This results in many forums-comments and bug-reports being written, for which the compiler team doesn’t have enough time. This is frustrating for both sides.
Now everyone can help build a driver for everyone.
Get support for complete new things
Proprietary code needs official access and legal documents that have all kinds of restrictions, which open source code does not.
More often there is opportunity in what is not there yet, and research needs to be done to break the chicken-egg conundrum. Optimized 128-bit computing? Easy complex numbers in OpenCL? Native support for Halide as an alternative to OpenCL? All up-to-date driver-code is available to make these possible.
Nurture other projects
Code can be “borrowed” from AMD’s projects and be used in (un)expected places. This ranges from GPU-simulators to experimental compilers.
Currently the forks of the ROCm-driver are mostly used to fix bugs or are thousands of commits behind. Who knows what the future brings.
Get better support in more Linux distributions
It’s easier to include open source drivers in Linux distributions. These OpenCL drivers do need a binary firmware (which were disassembled and seem to do as advertised). There is a discussion if firmware can be seen as hardware and can be marked as “libre”, but fact is that AMD’s contributions to the Linux 4.x kernel do get accepted.
Improve and increase university collaborations
If the software was protected, it was only possible under strict contracts to work on AMD’s compiler infrastructure. In the end it was easier to focus on the open source backends of LLVM than to go through the legal path.
Universities are very important to find unexpected opportunities, integrate the latest research in, bring potential new employees and do research collaborations. Timour Paltashev (senior manager, Radeon Technology Group, GPU architecture and global academic connections) can be reached via timour dot paltashev at amd dot com for more info.
Final words
It probably makes total sense to open source the drivers. Most notably key advantages include reduced costs and increased control due to easier debugging and bug-solving.
AMD is now a modern hardware company that understands software is a crucial part of their products. They believe that open source software gives an edge over the competition and made this bold move to let everybody peek in their kitchen.

HSA Q&A with Dr. John Glossner

Computing Now: https://www.computer.org/web/computingnow/insights/content?g=53319&type=article&urlTitle=hsa-connections
HSA computing standards have progressed significantly since the HSA Foundation (HSAF) was established in 2012. Today, for instance, there are not only royalty free open specifications available but also fully operational production systems.

Representatives from newly joined HSA Foundation members in China

Pictured: Representatives from newly joined HSA Foundation members in China
In this Q&A, Dr. John Glossner, HSA Foundation president, provides additional insights on HSA-specific trends and issues:
What are the connections/differences between heterogeneous computing, general purpose computing and specialized computing? If heterogeneous computing is the future, what will happen to general purpose computing and specialized computing?
General purpose computing is what you find in a CPU. It is meant to be able to process any function but streaming data, like artificial intelligence (AI), might not always be efficiently processed on a CPU.
Specialized computing would be a design made for one particular application such as AI but it would not be intended to run general purpose code (sometimes called control code). The specialized accelerator typically has the advantage that it is much lower power to execute the special purpose application (e.g., AI).
Heterogeneous computing combines the best of both. It specifies how a CPU can talk to an accelerator and often finds both integrated onto the same silicon die. So heterogeneous processors, meaning different types – such as CPUs, GPUs, DSPs, specialized accelerators and others, are all integrated together and cooperate to achieve an ideal balance of performance and power consumption for a given application.
What is the ultimate goal for the HSAF? How and what need to be done to achieve this?
The goal of the HSA Foundation is to make heterogeneous programming easier. That means creating standards that allow different types of processors to be programmed in the same language, using one single source file, and then automatically distributing parts of the application to the best processor to do the computing.
If research institutions and companies participate in establishing and promoting the standards of heterogeneous computing, will it affect their current development and solutions?
With open specifications and open source implementations of standards and tools, the Foundation’s hope is that it accelerates the pace of development and adoption of the technology. Corporations participating in HSAF enjoy royalty free access to all technologies developed.
The Foundation announced the formation of the China Regional Committee (CRC) in May. What were the motivations and goals in establishing the CRC and what is the connection/differences between CRC standards and HSA standards?
While the HSA Foundation has made a lot of progress there are always regional considerations and research opportunities to improve current systems. Recently China has become a leader in AI and other semiconductor technologies. With the emergence of low latency applications such as AI and virtual reality (VR) the Foundation anticipates improvements to current specifications. As this is an area of research and development being led by China, it is natural to invite key scientists and companies from China to adopt and adapt technologies and specifications.
How many local organizations have joined the CRC? What are members’ perspectives?
More than 30 members have joined the CRC to date. They comprise semiconductor companies, research universities and institutes (e.g., Chinese Academy of Sciences), tools and algorithms designers, test verification, and China standardization groups.
What effects will in-depth research and development of heterogeneous computing standards and technologies have on promoting China’s semiconductor industry advances?
China has become a global leader in semiconductor development and algorithms such as AI that execute on semiconductor chips. Heterogeneous systems that are now emerging are expected to accelerate R&D throughout the global industry. The formation of the CRC and future global adoption of the work done by the CRC should advance China’s semiconductor industry as well as contribute to worldwide growth.
What are the implications of developing and promoting heterogeneous computing standards for the creation of China’s heterogeneous computing industry chain and ecosystem?
While the algorithms that the CRC is evaluating are of immediate concern within China, it is expected that the entire global community and ecosystem will benefit from the standardization work being performed by the CRC.
Do heterogeneous computing chips have a wide range of AI applications? What are the specific advantages?
Heterogeneous chips have the potential to dramatically reduce the electric power to perform AI applications. When programs are optimized for specialized heterogeneous systems, each processor in the system can execute code that is most power efficient for its own function. This provides higher performance at lower power than non-heterogeneous systems.
What should China do to rapidly cultivate the heterogeneous computing industry?
By participating in the HSAF CRC, China can adapt and adopt technologies related to heterogeneous systems for China-specific issues. However, it is anticipated that these enhancements will be integrated into global HSAF specifications because the problems are common to many semiconductor companies.

Heterogeneous Computing Standards & International AI Conference Paving the Way Towards Global HSA Specifications

Xiamen, Fujian, China, July 9, 2017 – The recently concluded Heterogeneous Computing Standards & International AI Conference, held in Xiamen, is helping to lay the groundwork for heterogeneous computing standards not only in China, but worldwide. The two-day event was co-hosted by the China Electronic Standardization Institute (CESI), the HSA Foundation and the Chinese Association of Artificial Intelligence, with an organizing committee including Huaxia General Processor Technologies, the HSA Foundation’s newly formed China Regional Committee (CRC), and the Xiamen Integrated Circuit Industry Association.
Heterogeneous System Architecture (HSA) is a standardized platform design that unlocks the performance and power efficiency of the parallel computing engines found in most modern electronic devices. It provides an ideal mainstream platform for next-generation SoCs in a range of applications including artificial intelligence.
The Heterogeneous Computing Standards & International AI Conference brought together a number of industry leaders to discuss processors, software, applications, machine learning, and fintech for heterogeneous systems in artificial intelligence applications. HSA Foundation members including AMD, Arteris, Cadence, CESI, Huaxia General Processor Technologies, Imagination Technologies, Shanghai Advanced Research Institute – Chinese Academy of Sciences, and Xiamen University shared their latest results with hundreds of participants at the event.
Other presenting companies included Creekspring AI, DeepGlint, DeepPhi Tech, Gold Medal Global Investment, ICETech, KACHIP, Sanechips Technology, State Grid, and others. Dozens of renowned scholars and officials from universities, institutes and related industry companies also participated in the event.
The recently formed HSA Foundation CRC is laying the groundwork for standardization progress in heterogeneous computing standards in China. It is focused on supporting the needs of HSA Foundation members in China and helping to fulfill the mission of the Foundation, which is to make heterogeneous programming universally easier. The formation of the CRC and potential global adoption of the work done by the CRC will advance China’s semiconductor industry as well as contribute to worldwide growth.
“As China is emerging as a powerhouse in programming heterogeneous systems, AI and semiconductor technology, it is natural to invite key scientists and companies from China to adopt and adapt technologies and specifications. We fully anticipate that these changes will not remain local to CRC working groups but will be incorporated into the global specifications and adopted worldwide,” said Dr. John Glossner, HSA Foundation president.
The CRC has instituted the following working groups and elected their Chairs to evaluate, enhance and develop HSA technologies:
• Application & System Evaluation Working Group – Dr. Kunlun Gao, Global Energy Interconnection Research Institute
• Virtual ISA Working Group – Dr. Jun Han, Fudan University
• System Architecture Working Group – Wanting Tian, Sanechips Technology
• Compilation & Runtime LIB Working Group – Dr. Lei Wang, Huaxia GPT
• OS & Multivendor Working Group – Dr. Min Gong, Beijing Linx Technology
• Interconnect Working Group – Dr. Zhiyi Yu, Sun Yat-sen University
• Security & Protection Working Group – Dr. Songhai Liang, Nationz Technologies
• Conformance Test Working Group – Dawei Chen, CESI
Supporting Quotes:
“Nearly all SoC’s are heterogeneous systems. The HSA Foundation’s technology makes programming these systems much simpler by providing single-source toolchains, common API’s, and a choice of programming languages. When executed on an HSA runtime, both high performance and low power can be achieved. GPT has licensable cores supporting HSA technologies and is actively contributing to the development of the specifications”. By internally adopting HSA, GPT has accelerated development of heterogeneous systems in multiple application domains including machine learning and artificial intelligence.”

Kerry Li, CEO, Huaxia General Processor Technologies

“China is firmly placed at the heart of heterogeneous systems, AI and semiconductor technology, with the HSA Foundation playing a key role in increasing awareness within the industry of the challenges and driving the availability of solutions. The recent China Regional Council event was a real triumph and Imagination was very pleased to participate in such a successful event. The event highlighted just how much potential heterogeneous computing has in terms of AI. As a founding member of the HSA Foundation, we look forward to continuing our work with other members to create specifications that make it easier to develop and program heterogeneous SoCs, as well as developing IP cores that enable the realization of such SoCs.”

James Liu, VP and GM China, Imagination Technologies

Application & System Evaluation Working Group
“Our goal is to verify the advanced nature of the HSA technology and the applicability of the HSA standards through a typical application demonstration. HSA has become a trend in advanced computing technology; its huge technical potential cannot just stay on paper as for standards, it also plays a role in multiple applications, reflecting the technical value through verification of actual cases. State Grid, the world’s largest public service corporation, has an urgent need for high-speed computing and artificial intelligence computing for ultra-large-scale power grids. We anticipate that HSA technology will be used in the future to meet these computing needs and ensure the smooth implementation of the national strategy on Global Energy Interconnection.”

Dr. Kunlun Gao, Director of Computing and Application Lab, Global Energy Interconnection Research Institute

Virtual ISA Working Group
“Our working group will focus on virtual explicitly parallel ISA that brings parallel acceleration to high level language. The virtual ISA, called HSAIL, can be finalized to native ISAs of different architectures such as CPU, GPU, DSP, custom accelerator, etc. Enabling data parallel programming is a key feature of HSAIL, so flexible vector processing, such as variable vector lengths and mixed-precision vector operations, will be involved in the technical discussion of our group. Moreover, some special instructions related to AI applications might also be considered for inclusion in HSAIL. This is an important open problem so far.”

Dr. Jun Han, Professor, Fudan University

System Architecture Working Group
“The establishment of the CRC will drive the HSA standardization process, and the CRC will become an important force in building HSA standards. The CRC System Architecture Working Group will study the necessity and performance advantages of heterogeneous architecture from an overall perspective, and topics brought by heterogeneous architecture on processor design, interconnected bus design, memory system design, low-power design, and testability design, etc. in order to form the heterogeneous architecture design methodology.”

Wanting Tian, Vice President, Sanechips Technology

Compilation & Runtime LIB Working Group
“The compiler and runtime are interrelated components to connect HSA and its working groups, supporting the virtual ISA and operating system interface specification. The compiler and runtime are the main method of user evaluation system and directly determine the developer/user experience with the HSA system.”

Dr. Lei Wang, Technical Director, Huaxia General Processor Technologies

OS & Multivendor Working Group
“We are dedicated to providing operating system support for the CRC and HSA Foundation. The main focus of the OS & Multivendor Working Group will include kernel work on system security and multi process resource sharing as well as coordinating multiple vendors on hardware-OS and application development.”

Dr. Min Gong, Chief Scientist, Beijing Linx Technology

Interconnect Working Group
“The interconnect network is becoming increasingly important due to the larger number of heterogeneous cores and more advanced fabrication technology. The CRC’s Interconnect Working Group will organize experts with a strong background from academic and industry. Our goal is to evaluate interconnect network protocol/standards for many-core heterogeneous systems, which will be efficient, scalable, and can be reused in various systems.”

Dr. Zhiyi Yu, Professor, Sun Yat-sen University

Security & Protection Working Group
“There is no doubt that the security and protection issues have become the foundation of the key technologies of heterogeneous computing for heterogeneous system architectures. The main task of the Security & Protection Working Group is to systematically solve the problem of safe operation and system protection of the HSA, and to develop a corresponding interface strategy and specifications from various aspects of instruction, thread, process, storage, IO, on-chip interconnection, operating system, application, etc. This is to promote and ensure the sustainable development and healthy growth of the new generation of heterogeneous computing chip products and its ecosystem.”

Dr. Songhai Liang, Chief Scientist of SoC Design, Nationz Technologies

Conformance Test Working Group
“CESI plays a role in the CRC to deal with the work of standardization and conformance test. HSA technology has a significant influence on the design of the next generation of SoCs. With the aim of promoting positive developments for the HSA Foundation, it is necessary for relevant parties to make efforts to research and develop relevant technical specifications of HSA and to lead relevant companies to adopt and commercialize the specifications. CESI can provide relevant products with tests and verifications, which are compliant to the standards of the HSA.”

Dawei Chen, Professor & Research Center Director, CESI

About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.