Proof-of-Concept C++17 Parallel STL Offloading for GCC/libstdc++

Posted on May 10, 2018May 10, 2018 by mfrickie

Computing Now, HSA Connections: https://www.computer.org/portal/web/hsa-connections/content?g=54930593&type=article&urlTitle=proof-of-concept-c-17-parallel-stl-offloading-for-gcc-libstdc-
Introduction:
Parmance and General Processor Technologies have been collaborating on C++17 Parallel STL offloading support based on HSA (Heterogeneous System Architecture) and GCC (GNU Compiler Collection). A working proof-of-concept has been now released and made available in https://github.com/parmance/par_offload. This post is a high level overview of the project.
Heterogeneous Offloading and C++17
The C++17 standard released in December 2017 adds execution policies in its standard template library (STL) algorithm definition. Execution policies enable the programmer to declare that the algorithm library call, along with any user-defined functionality the call uses, is safe to execute in parallel. The user-defined functionality is referred to as “element access functions” (EAF) by the standard.
The PSTL (Parallel Standard Template Library) of C++17 focuses on forward progress guarantees and their implications to parallelization safety on homogeneous processors.
However, there is no “parallel heterogeneous offloading execution policy” yet in the C++ standard; there seems to be an implicit assumption that the parallel execution will occur in the same processor where it was invoked. To make the offloading decisions explicit, for our offloading implementation, we defined a new execution policy type ‘parallel_offload_policy’ (par_offload) which the programmer can use to declare “heterogeneous offload” or “multiple-ISA” safety for the involved user-defined functions.
A call to the ‘transform’ PSTL function with this policy looks like the following:
std::transform(std::execution::experimental::par_offload,
pixel_data.begin(), pixel_data.end(),
pixel_data.begin(),
[](char c) -> char {
return c * 16;
});
In this case, a lambda function was used to iterate over all the elements in the pixel_data of std::vector type with the processing offloaded to a heterogeneous device, if one is available.
Shared Virtual Memory
Explicit data management is problematic in the case of offloading general purpose C/C++ programs that assume a unified address space and allow passing pointers to functions without attached size information. Indeed, a single unified coherent address space across all the processors in a heterogeneous platform would remove a major obstacle in heterogeneous platforms and make programming such devices much simpler.
Heterogeneous System Architecture (HSA) (1.0 published in March 2015) is a language neutral standard targeting heterogeneous systems. It defines a cache-coherent shared global virtual memory as a core feature. That is, an HSAF heterogeneous platform supports data sharing across devices (called agents) as easily as in “homogeneous” C/C++ multithreaded programming.
In the GCC PSTL offloading work we used the HSA Runtime as a heterogeneous platform middleware and rely on the coherent system memory capabilities of the HSA Full Profile. HSA is interesting for this use case most importantly due to its shared heterogeneous memory requirement that is expected to work seamlessly with C/C++ memory model. Also there is a wide selection of open source components implementing the different parts of the specs available. For example, its intermediate language HSAIL has both front end and backend support already in upstream GCC. There are also implementations of its runtime API to enable development and testing via offloading to CPU based targets.
Implementation Status and Future Plans
We now have a proof-of-concept offloading implementation of several PSTL algorithms running with multiple ways to define the user-specified functionality working. The implementation supports lambda functors (with and without captures), C functions, std::functions containing C functions, function objects, and user defined data types.
Next we plan to properly integrate the prototype to libstdc++ and GCC, implement the rest of the algorithms and finally optimize the performance.
Links and references
The code in Github https://github.com/parmance/par_offload
ISO/IEC 14882:2017 Programming languages — C++ Publication date 2017-12 https://www.iso.org/standard/68564.html
Heterogeneous Systems Architecture Foundation http://www.hsafoundation.com

Five Minutes with John Glossner, President of HSA Foundation

Posted on February 1, 2018February 1, 2018 by mfrickie

Foundation President Dr. John Glossner was interviewed recently on Embedded Computing Design’s “Five Minutes With..” podcast. Please click here to listen to the segment.

Developing Heterogeneous Cache Coherent SoCs

Posted on December 21, 2017December 21, 2017 by mfrickie

Chip Design: http://eecatalog.com/chipdesign/2017/12/19/developing-heterogeneous-cache-coherent-socs/
Automotive and other customer needs are not what they once were.
As with other challenges, the task of successfully developing heterogeneous cache coherent SoCs demands an understanding of your customers’ requirements. Conducting interviews with your customers aids this understanding. For example, through the interview you should learn:

How many IPs are needed to connect to the heterogeneous system;
What kind of bandwidth each IP requires;
The types of IPs that are in the system;
What kind of features you would enable in the interconnect IP.

The next step is to define “heterogeneity” because, while many people use the word “heterogeneous,” it has a number of meanings. Some guidelines:

You must have different types of processors within the same system;
Different processor types also have different cache structures. For example, an Arm CPU would use the same cache structure as another Arm core, but a different CPU may pose a different cache structure
Different types of IPs must also be considered:
- CPUs, GPUs, and DSPs
- IPs that make up an SoC, such as those for connectivity, USB, SATA, etc.

A highly flexible snoop filter architecture accommodates different cache structures of different kinds of processors. It also reduces the number of memory bits required to perform snoop filtering.
Adapt to Changing Customer Needs
Understanding what the customer requirements are for non-coherency and coherency is a must. Are the coherent and non-coherent domains separated, a full merger, or a customized mix? ArterisIP, for instance, has developed a component called a non-coherent bridge. Its purpose is to drive non-coherent accesses into the coherent domain.
A few years ago, coherency systems were small and compact with a maximum of three to four different processors. Coherency was confined to CPU clusters, and functionality was grouped under an application. Coherency wasn’t necessarily distributed beyond a subsystem.
However, customer needs are changing, and today there is a need for greater processor performance. Companies are adding more and different types of processors. In addition:

SoC layouts are expanding tremendously;
Processors are growing larger;
Complex layouts are affecting the coherency domain;
Coherent domain is expanding all over the chip.

So how do you handle all these? First, you must make sure the infrastructure is designed to distribute coherency system-wide. The interconnect technology must enable network packet transport and accommodate a variety of topologies, such as ring and mesh. The infrastructure must also be configurable and flexible because as design complexity continues to grow, designers need to understand which topologies best suit a particular chip layout. Having the proper tools to predict where complexities might cause performance and power issues in the chip layout stage is critical to adapting to the layout and discovering which topology best resolves these issues.
Optimizing Power Consumption of Complex Systems
To optimize for power, first, you need to provide a power-ready IP. Once this is accomplished, you need to implement some tried and true techniques—these may include voltage domain, power domain, clock gating, and high-level clock gating.
When an IP is power-ready, it will have connectivity to a power interface and can be controlled by a PMU (Power Management Unit) in the system. The PMU will decide when to shut down the IP – i.e. when it is not in use or not needed by the system. At the application level, this power-aware controller (PMU) can lower system power consumption by putting an IP on idle.
Maturing to Meet Challenges
Heterogeneous SoCs are still in development and haven’t yet matured. But processors in coherent domain are now sharing data with each other. Other CPUs and GPUs have become cache coherent, although I’m confident we can do a lot more.
Moreover, data sharing is not only between the processor and the GPU, but among all the IPs of the system—a concept that is still work in progress. This idea must be pushed a little bit farther to achieve total coherency. Today not many non-coherent IPs share data with coherent IPs. But applications are emerging that need coherency, and this will bring new requirements.
Some of these design challenges are hindering product development, for example, for Advanced Driver-Assistance Systems (ADAS) for automotive. Automotive applications have performance requirements and the need to share data with heterogeneous processors to achieve those requirements. We’ll see the introduction of new features to this market. Other markets will include artificial intelligence and machine learning.
A decade ago, mobile application processors were driving the need to cache coherency. Next, data center systems took over as the primary drivers. Now the automotive market is fuelling the race to extend cache coherency to all of the heterogeneous processing elements in SoCs. In two or three years, a new trend will emerge to extend heterogeneous cache coherency even further—but designers will need flexibility, configurability and scalability to ensure that these systems are high-performance, low-latency, and power- and cost-efficient.

J.P. Loison is Corporate SoC Application Architect, ArterisIP, which provides system-on-chip (SoC) interconnect IP to accelerate SoC semiconductor assembly for a wide range of applications. These applications include those spanning automobiles to mobile phones, IoT, cameras, SSD controllers, and servers for customers such as Samsung, Huawei / HiSilicon, Mobileye (Intel), Altera (Intel), and Texas Instruments. The company is located in Campbell, CA.

HSA Foundation China Regional Committee Wraps Up Successful 2nd Annual Symposium

Posted on December 20, 2017December 20, 2017 by glossner

Wide Array of Interfaces, Specs Discussed for Next Gen of Heterogeneous Computing, AI, SDR, and More
BEIJING, CHINA, DEC. 20, 2017 — The China Regional Committee (CRC) of the Heterogeneous System Architecture (HSA) Foundation has successfully concluded its 2nd Symposium in Beijing. The CRC was formed earlier this year; its mandate is to enhance the awareness of heterogeneous computing and promote the adoption of standards such as Heterogeneous System Architecture (HSA) in China.
More than 40 representatives of the CRC members and related companies, research institutes and universities throughout China attended the conference. HSA Foundation President Dr. John Glossner also participated in this important benchmark meeting that exchanged ideas on important topics including interfaces and specifications for the next generation of heterogeneous computing, vector parallel computing model, system security and protection, artificial intelligence, software defined radio, Network-on-Chip (NoC), and programming of commercial HSA chips. The meeting was co-organized by China Electronics Standardization Institute (CESI) and the HSA Foundation’s CRC, and sponsored by Huaxia General Processor Technologies.
Last year the HSA Foundation held its first Global Summit in Beijing. The CRC has actively carried out various work in conjunction with CESI for the development of global heterogeneous computing standards with a China focus.
At the meeting, each CRC working group shared its progress and insights on related key technologies:
• Application & System Evaluation Working Group – “The application situation and development trend of artificial intelligence in China and typical rigid demands and key indicators of artificial intelligence” – presented by State Grid;
• Virtual ISA Working Group – “Artificial intelligence instruction set design for heterogeneous computing and exploratory research of HSAIL artificial intelligence extended subset” – presented by Dr. Jun Han, Fudan University;
• Interconnect Working Group – “Latest research results on network-on-chip in the heterogeneous computing SoCs, and the next step verification and standardization work arrangements” – presented by Dr. Zhiyi Yu, Sun Yat-sen University;
• Compilation & Runtime LIB Working Group – “The latest research trends in vector computing models and related programming models, and basic recommendations for facilitating integration into HSA system architectures” – presented by Dr. Lei Wang, Huaxia General Processor Technologies;
• System Architecture Working Group – “Using HSA to systematically address the basic views of software-defined communications, software-defined radio, heterogeneous multi-core chip architecture and application development” – presented by Wanting Tian, Sanechips Technology;
• Security & Protection Working Group – “Research work and principles on adapting heterogeneous computing for security protection” – presented by Shaowei Chen, Nationz Technologies.
The CRC has been adding members since the first CRC Symposium in May; some of which include Huaqiao University, Hunan University, Jimei University, Tsinghua University, Xiamen University, Xiamen University of Technology and Zhejiang University.
Supporting quotes:
“The HSA Foundation CRC has been laying the groundwork for standardization progress in heterogeneous computing standards in China for almost a year. It is focused on supporting the needs of HSA Foundation members in China and helping to fulfill the mission of the Foundation, which is to make heterogeneous programming universally easier.”
Dr. John Glossner, HSA Foundation President
“Since its formation, the CRC has received the support and attention of many academic institutions, companies, and government authorities in China. The work product and coverage of the CRC has been expanding and developing rapidly, making it one of China’s first “innovative brands” for standardization of heterogeneous computing. In 2018 the CRC and HSAF will work towards adoption of the v1.2 specifications and extensions enabling the transformation of HSA chips and platform products in many applications.”
Dr. Xiaodong Zhang, HSA Foundation CRC Chair
“The main research direction of our team is Software Defined Radio. Due to the flexibility of SDR, it allows for implementation across a wide range of applications. The earliest SDR platforms were based on FPGAs and DSPs with large size and high-power consumption making generalized SDR systems problematic. However, the HSA platform provides new possibilities for SDR research. HSA has many advantages such as low power consumption, low cost, and high integration. Those are hard to find in traditional SDR platforms.”
Dr. Ming Zhao, Professor, Tsinghua University
“Micro-Processor Research and development Center (MPRC) of Peking University is the pioneer of innovating indigenous microprocessor (CPU) and computer systems in China. To minimize the digital gap between developed and developing countries, MPRC is committed to the development of computers with independently developed CPUs and heterogeneous SoCs. The advantage of a heterogeneous architecture is the ability to be adaptable. During the evolution from desktop computing to mobile computing to Big Data, systems that adapt are the ones that are most successful. MPRC will work together with other members in HSA Foundation to improve life with heterogeneous technology.”
Dr. Junlin Lu, Deputy director of MPRC, Peking University
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.

New Survey from HSA Foundation Highlights Importance, Benefits of Heterogeneous Systems

Posted on December 5, 2017December 5, 2017 by mfrickie

Beaverton, Oregon, Dec. 5, 2017 – The Heterogeneous System Architecture (HSA) Foundation today released key findings from a second comprehensive members survey. The survey reinforced why heterogeneous architectures are becoming integral for future electronic systems.
HSA is a standardized platform design supported by more than 70 technology companies and universities that unlocks the performance and power efficiency of the parallel computing engines found in most modern electronic devices. It allows developers to easily and efficiently apply the hardware resources—including CPUs, GPUs, DSPs, FPGAs, fabrics and fixed function accelerators—in today’s complex systems-on-chip (SoCs).
Some of the survey questions – and results:
Will the system have HSA features?
Last year, 58.82% of the respondents answered affirmatively; this year, 100%!
Will it be HSA-compliant?
In 2016, 69.23% said it would; 2017 figures rose to 80%.
What is the top challenge in implementing heterogeneous systems?
27.27% responded in 2016 that it was a lack of standards for software programming models; the 2017 survey also identified this as the most important issue, but the numbers decreased to 7.69%. Also, half of the respondents last year said it was a lack of developer ecosystem momentum.
Some remarks that further accentuate key survey findings:
“Many HSA Foundation members are currently designing, programming or delivering a wide range of heterogeneous systems – including those based on HSA,” said HSA Foundation President Dr. John Glossner. “Our 2017 survey provides additional insight into key issues and trends affecting these systems that power the electronic devices across every aspect of our lives.”
Greg Stoner, HSA Foundation Chairman and Managing Director said that “the Foundation is developing resources and ecosystems conducive to its members’ various focuses on different application areas, including machine learning, artificial intelligence, datacenter, embedded IoT, and high-performance computing. The Foundation has also been making progress in support of these ecosystems, getting closer to taking normal C++ code and compiling to an HSA system.”
Stoner added that “ROCm 7 by AMD will port HSA for Caffe and TensorFlow; GPT, in the meantime, is releasing an open-sourced HSAIL-based Caffe library, with the first version already up and running – this permits early access for developers.”
Dr. Xiaodong Zhang, from Huaxia General Processor Technologies, who serves as chairman of the China Regional Committee (CRC; established by the HSA Foundation to enhance global awareness of heterogeneous computing), said that “China’s semiconductor industry is rapidly developing, and the CRC is building an ecosystem in the region to include technology, talent, and markets together with an open approach to take advantage of synergies among industry, academia, research, and applications.”

About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.”
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.

Contact:
Neal Leavitt
Leavitt Communications
(760) 639-2900
neal@leavcom.com

Developing Heterogeneous Cache Coherent SoCs – and More! Q&A with Arterisip's J.P. Loison, Corporate SoC Application Architect

Posted on November 14, 2017November 14, 2017 by mfrickie

Computing Now, HSA Connections: https://www.computer.org/portal/web/hsa-connections/content?g=54930593&type=article&urlTitle=developing-heterogeneous-cache-coherent-socs-and-more-

Editor’s Note:

ArterisIP provides system-on-chip (SoC) interconnect IP to accelerate SoC semiconductor assembly for a wide range of applications from automobiles to mobile phones, IoT, cameras, SSD controllers, and servers for customers such as Samsung, Huawei / HiSilicon, Mobileye (Intel), Altera (Intel), and Texas Instruments. The company is located in Campbell, CA.

Describe in detail the various design challenges faced today in developing Heterogeneous Cache Coherent SoCs.

The first thing you need to do is understand customer requirements. This includes asking the right questions, some of which may include:

Understand how many IPs need to connect to the heterogeneous system;
What kind of bandwidth does the IP require;
What kind of IP and what kind of features can you enable with interconnect IP.

The next step is to define heterogeneity because many people are using the heterogeneous word, but there are different meanings behind the word. Some key tasks and guidelines:

You must have different types of processors within the same family;
Then you have to accommodate different types of processors that are available on the market.
Different processor types also have a different cache structures.
- An ARM CPU would use the same cache structure as another ARM core all over the processor.
A different CPU poses a different cache structure.
Accommodate different types of IPs as well:
- CPU, GPU, and DSPs:
- Then there are all other types the IPs that you combine into an SoC like connectivity IP, USB, SATA, etc.

It’s also important to be able to accommodate different (cache) protocol systems in terms of coherent and non-coherent protocol. Some examples:

Flexible snoop filter capability accommodates different cache structures of different kinds of processors.
- Snoop filter capabilities operate in two different directions to accommodate any cache structure of any processor that is available today.
- Another challenge: Reduce the number of memory bits that you need to perform snoop filtering.

How do you integrate IP that is not-cache coherent and achieve better performance? Provide a brief example or two?

You need to understand what the customer requirements are in terms of the mix of non-coherency and coherency requirements. Are they separated, a full merger of both domains or a customized mix? Arteris, for instance, developed a component called a non-coherent bridge. Its purpose is to drive non-coherent accesses back into the coherent domain. It also enables a differentiator between the non-coherent and coherent domains.

How to you create a cache-coherent system that is easily placed on a chip?

A few years ago, coherency systems were small and compact – a max of three to four different processors. Coherency was confined to CPU clusters, functionality was grouped under an application and all subsystems were connected to an application.

But coherency wasn’t necessarily distributed beyond a subsystem. Customer needs are changing, there is a need for greater processor performance and companies are adding more and different types of processors. In addition:

SoC layouts are expanding tremendously;
Size of processors growing larger;
Complex layouts affect coherency domain;
Coherent domain is expanding all over the chip.

So how do you handle it? First, you must make sure the infrastructure is designed to distribute coherency system-wide. It has to be an interconnect technology that enables network packet transport and it also must accommodate a variety of topologies such as ring and mesh. The infrastructure must also be configurable and flexible because as design complexity continues to grow, designers must be able to understand which topologies are best suited for a particular chip layout. Having the proper tools that can predict where complexities might cause performance and power issues in the chip layout stage is critical to revising the layout and providing the best solution in terms of which topology might resolve these issues.

How can you optimize power consumption of complex systems?

You first need to provide power-ready IP; once this accomplished, then you need to implement some well-known techniques – these may include voltage domain, power domain, clock gating and high-level clock gating.

If power-ready it will also have connectivity to a power interface and can be controlled by an MPU in the system that will decide when to shut down the IP when not in use or not needed by the system. At the application level, this power-aware controller (MPU) can lower system power consumption by putting an IP on idle.

How long will it take to reasonably surmount some/all of the aforementioned issues?

Heterogeneous SoCs are still in development and haven’t yet matured. But processors in coherent domain now sharing data with each other. Other CPUs and GPUs have become cache coherent although I’m confident we can do a lot more.

With data sharing, this is not only between processor and GPU, but between all of the IPs of the system – it’s a concept that is in progress. This IP must be pushed a little bit farther to achieve total coherency. Today there are still not too many non-coherent IPs sharing data with coherent IPs. But we’re now starting to see applications now emerging that need coherency and this will bring new requirements.

Are these design challenges currently hindering product development in select verticals? If so, which ones?

Yes, one that comes to mind is ADAS (Advanced Driver-Assistance Systems for automotive. Automotive applications will have a lot of requirements because of the need to add performance and share data with heterogeneous processors to achieve those requirements. We’ll see the introduction of new features to this market. Other markets will include artificial intelligence and machine learning.

A decade ago, mobile application processors were driving the need to cache coherency and then data center systems started becoming the primary driver. Now the automotive market is driving the need to extend cache coherency to all of the heterogeneous processing elements in SoCs. In two or three years, a new trend will emerge to extend heterogeneous cache coherency even further – but designers will need flexibility, configurability and scalability to ensure that these systems are high-performance, low-in-latency and reasonable in terms of power consumption and cost.

Everything You Need to Know About Why AMD Open Sourced the OpenCL Driver Stack for ROCm

Posted on October 18, 2017October 18, 2017 by mfrickie

Computing Now, HSA Connections: https://www.computer.org/portal/web/hsa-connections/content?g=54930593&type=article&urlTitle=hsa-connectio-1

Introduction: AMD is a co-founder and member of the HSA Foundation. This article is excerpted and edited from a blog post by Vincent Hindriksen, founder of Stream HPC, a Netherlands-based software development company.

Last May, AMD open sourced the OpenCL driver stack for ROCm. With this they kept their promise to open source (almost) everything. Earlier the hcc compiler, kernel-driver and several other parts were open sourced.

Why this is a big thing?

There are indeed several open source OpenCL implementations, but with one big difference: they’re secondary to the official compiler/driver. So, implementations like PortableCL and Intel Beignet play catch-up. AMD’s open source implementations are primary.

They contain:

OpenCL 1.2 compatible language runtime and compiler
OpenCL 2.0 compatible kernel language support with OpenCL 1.2 compatible runtime
Support for offline compilation right now – in-process/in-memory JIT compilation is to be added.

Performance of ROCm was mostly on par with AMD’s closed source drivers, with a few outliers. A few months ago ROCm 1.6 was released, where again performance was noticeably improved. For the next release performance improvements are expected again.

Why was it open sourced?

There were several reasons. AMD listened carefully to their customers in HPC, while taking note of where the industry was going.

Get deeper understanding of how functions are implemented

It’s useful to understand how functions are implemented. For instance the difference between sin() and native_sin() can tell you a lot more on what’s best to be used. It doesn’t tell how the functions are implemented on the GPU, but does tell which GPU-functions are called.

Learning a new platform has never been so easy. Deep understanding is needed if you want to go beyond “it works”.

Debug software deeper

Any software engineer has experience with libraries that don’t perform as promised or work as documented. Integration issues with “black box” libraries, are therefore a typical reason for big project delays. If the library was open source, the debugger could step in and give all information needed to solve the problem quickly.

When working with drivers it’s about the same. GPU drivers and compilers are extremely complex and inevitably your project hits that one bug nobody encountered before. With all open source drivers, you can step into the driver with the same debugger. Moreover, the driver can be recompiled with fixed code instead of having to write a less secure work-around.

Get bugs solved quicker

A trace now includes the driver-stack and the line-numbers. Even a suggestion for a fix can be given. This also helps reduce the time to get the fix for all steps. When a fix is suggested AMD only needs to test for regression to accept it. This makes the work for tools like CLsmith a lot easier.

A bonus of open source projects is that over time the code quality becomes better than projects where code is never seen by outsiders, which also adds to quicker solving of bugs.

Get low-priority improvements in the driver

Popular software like Blender and the LuxMark benchmark can expect to get attention from driver developers. For the rest of us, we have to hope our special code-constructions are comparable to one that is targeted. This results in many forums-comments and bug-reports being written, for which the compiler team doesn’t have enough time. This is frustrating for both sides.

Now everyone can help build a driver for everyone.

Get support for complete new things

Proprietary code needs official access and legal documents that have all kinds of restrictions, which open source code does not.

More often there is opportunity in what is not there yet, and research needs to be done to break the chicken-egg conundrum. Optimized 128-bit computing? Easy complex numbers in OpenCL? Native support for Halide as an alternative to OpenCL? All up-to-date driver-code is available to make these possible.

Nurture other projects

Code can be “borrowed” from AMD’s projects and be used in (un)expected places. This ranges from GPU-simulators to experimental compilers.

Currently the forks of the ROCm-driver are mostly used to fix bugs or are thousands of commits behind. Who knows what the future brings.

Get better support in more Linux distributions

It’s easier to include open source drivers in Linux distributions. These OpenCL drivers do need a binary firmware (which were disassembled and seem to do as advertised). There is a discussion if firmware can be seen as hardware and can be marked as “libre”, but fact is that AMD’s contributions to the Linux 4.x kernel do get accepted.

Improve and increase university collaborations

If the software was protected, it was only possible under strict contracts to work on AMD’s compiler infrastructure. In the end it was easier to focus on the open source backends of LLVM than to go through the legal path.

Universities are very important to find unexpected opportunities, integrate the latest research in, bring potential new employees and do research collaborations. Timour Paltashev (senior manager, Radeon Technology Group, GPU architecture and global academic connections) can be reached via timour dot paltashev at amd dot com for more info.

Final words

It probably makes total sense to open source the drivers. Most notably key advantages include reduced costs and increased control due to easier debugging and bug-solving.

AMD is now a modern hardware company that understands software is a crucial part of their products. They believe that open source software gives an edge over the competition and made this bold move to let everybody peek in their kitchen.

Five Minutes With…John Glossner, President, HSA Foundation

Posted on October 4, 2017October 4, 2017 by mfrickie

Foundation President Dr. John Glossner was interviewed recently on Embedded Computing Design’s “Five Minutes With..” podcast. Please click here to listen to the segment.

HSA and ROCm Architectures to be Highlighted at Next Week’s CppCon

Posted on September 19, 2017September 19, 2017 by mfrickie

BEAVERTON, OR, Sept. 19, 2017– The HSA (Heterogeneous System Architecture) Foundation and Foundation member AMD will be providing a comprehensive session on HSA technologies and AMD’s ROCm architecture at next week’s CppCon. The conference will be held from Sept. 24-29 in Bellevue, WA at the Meydenbauer Conference Center.
CppCon is an annual gathering for the worldwide C++ community and is geared to appeal to anyone from C++ novices to experts.
The presentation by AMD Fellow Paul Blinzer is included as part of a session on ‘concurrency and parallelism’ running from 8:30-10 PM on Tuesday, Sept. 28 at the Meydenbauer Conference Center, Harvard, Room #406. Attendees will learn about what allows these architectures to use computational hardware accelerators like GPUs, DSPs and others with native C++, without resorting to proprietary APIs, programming libraries or limited language features.
Heterogeneous System Architecture (HSA) is a standardized platform design that unlocks the performance and power efficiency of the parallel computing engines found in most modern electronic devices. It provides an ideal mainstream platform for next-generation SoCs in a range of applications including artificial intelligence.
For more information on the presentation and to register, please see https://cppcon.org/registration/.
For more information, including a full list of speakers, supporting organizations and sponsors please visit: https://cppcon.org/cppcon-2017-program/
About Paul Blinzer
Paul Blinzer works on a wide variety of Platform System Software architecture projects and specifically on the Heterogeneous System Architecture (HSA) System Software at Advanced Micro Devices, Inc. (AMD) as a Fellow in the System Software group. Living in the Seattle, WA area, during his career he has worked in various roles on system level driver development, system software development, graphics architecture, graphics & compute acceleration since the early ’90s. Paul is the chairperson of the “System Architecture Workgroup” of the HSA Foundation. He has a degree in Electrical Engineering (Dipl.-Ing) from TU Braunschweig, Germany.
https://www.linkedin.com/in/paul-blinzer-4523602
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.

Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.

Contact:
Neal Leavitt
Leavitt Communications
(760) 639-2900
neal@leavcom.com

HSA Foundation, AMD Headlining HSA Technologies Tutorial at 26th International Conference on Parallel Architectures and Compilation Techniques

Posted on September 6, 2017September 6, 2017 by mfrickie

BEAVERTON, OR, Sept. 6, 2017 – The HSA (Heterogeneous System Architecture) Foundation and Foundation member AMD will provide a half-day tutorial on HSA technologies and AMD’s Radeon™ Open Compute at this week’s 26th International Conference on Parallel Architectures and Compilation Architectures (PACT). The conference will be held from Sept. 9-13 in Portland, OR.
PACT brings together researchers from architecture, compilers, applications and languages to present and discuss innovative research of common interest. PACT recently widened its scope to include insights useful for the design of machines and compilers from applications such as, but not limited to, machine learning, data analytics and computational biology.
The tutorial, presented by AMD Fellow Paul Blinzer, runs from 9 AM to 12 PM on Saturday, Sept. 9th. Key elements will include an introduction into HSA and Radeon™ Open Compute runtime, followed by an in-depth session focusing on HSA, its components and the software ecosystem.
Heterogeneous System Architecture (HSA) is a standardized platform design that unlocks the performance and power efficiency of the parallel computing engines found in most modern electronic devices. It provides an ideal mainstream platform for next-generation SoCs in a range of applications including artificial intelligence.
The tutorial and other PACT sessions will be held at the Doubletree by Hilton Hotel Portland.
For more information on the tutorial and to register, please see https://parasol.tamu.edu/pact17/rates-registration.
For more information, including a full list of speakers, supporting organizations and sponsors please visit: https://parasol.tamu.edu/pact17/main-conference.
About Paul Blinzer
Paul Blinzer works on a wide variety of Platform System Software architecture projects and specifically on the Heterogeneous System Architecture (HSA) System Software at Advanced Micro Devices, Inc. (AMD) as a Fellow in the System Software group. Living in the Seattle, WA area, during his career he has worked in various roles on system level driver development, system software development, graphics architecture, graphics & compute acceleration since the early ’90s. Paul is the chairperson of the “System Architecture Workgroup” of the HSA Foundation. He has a degree in Electrical Engineering (Dipl.-Ing) from TU Braunschweig, Germany.
https://www.linkedin.com/in/paul-blinzer-4523602
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.
Contact:
Neal Leavitt
Leavitt Communications
(760) 639-2900
neal@leavcom.com

Heterogeneous System Architecture Foundation

Category Archives: Newsflash