Computing Now, HSA Connections: https://www.computer.org/portal/web/hsa-connections/content?g=54930593&type=article&urlTitle=proof-of-concept-c-17-parallel-stl-offloading-for-gcc-libstdc-
Introduction:
Parmance and General Processor Technologies have been collaborating on C++17 Parallel STL offloading support based on HSA (Heterogeneous System Architecture) and GCC (GNU Compiler Collection). A working proof-of-concept has been now released and made available in https://github.com/parmance/par_offload. This post is a high level overview of the project.
Heterogeneous Offloading and C++17
The C++17 standard released in December 2017 adds execution policies in its standard template library (STL) algorithm definition. Execution policies enable the programmer to declare that the algorithm library call, along with any user-defined functionality the call uses, is safe to execute in parallel. The user-defined functionality is referred to as “element access functions” (EAF) by the standard.
The PSTL (Parallel Standard Template Library) of C++17 focuses on forward progress guarantees and their implications to parallelization safety on homogeneous processors.
However, there is no “parallel heterogeneous offloading execution policy” yet in the C++ standard; there seems to be an implicit assumption that the parallel execution will occur in the same processor where it was invoked. To make the offloading decisions explicit, for our offloading implementation, we defined a new execution policy type ‘parallel_offload_policy’ (par_offload) which the programmer can use to declare “heterogeneous offload” or “multiple-ISA” safety for the involved user-defined functions.
A call to the ‘transform’ PSTL function with this policy looks like the following:
std::transform(std::execution::experimental::par_offload,
pixel_data.begin(), pixel_data.end(),
pixel_data.begin(),
[](char c) -> char {
return c * 16;
});
In this case, a lambda function was used to iterate over all the elements in the pixel_data of std::vector type with the processing offloaded to a heterogeneous device, if one is available.
Shared Virtual Memory
Explicit data management is problematic in the case of offloading general purpose C/C++ programs that assume a unified address space and allow passing pointers to functions without attached size information. Indeed, a single unified coherent address space across all the processors in a heterogeneous platform would remove a major obstacle in heterogeneous platforms and make programming such devices much simpler.
Heterogeneous System Architecture (HSA) (1.0 published in March 2015) is a language neutral standard targeting heterogeneous systems. It defines a cache-coherent shared global virtual memory as a core feature. That is, an HSAF heterogeneous platform supports data sharing across devices (called agents) as easily as in “homogeneous” C/C++ multithreaded programming.
In the GCC PSTL offloading work we used the HSA Runtime as a heterogeneous platform middleware and rely on the coherent system memory capabilities of the HSA Full Profile. HSA is interesting for this use case most importantly due to its shared heterogeneous memory requirement that is expected to work seamlessly with C/C++ memory model. Also there is a wide selection of open source components implementing the different parts of the specs available. For example, its intermediate language HSAIL has both front end and backend support already in upstream GCC. There are also implementations of its runtime API to enable development and testing via offloading to CPU based targets.
Implementation Status and Future Plans
We now have a proof-of-concept offloading implementation of several PSTL algorithms running with multiple ways to define the user-specified functionality working. The implementation supports lambda functors (with and without captures), C functions, std::functions containing C functions, function objects, and user defined data types.
Next we plan to properly integrate the prototype to libstdc++ and GCC, implement the rest of the algorithms and finally optimize the performance.
Links and references
The code in Github https://github.com/parmance/par_offload
ISO/IEC 14882:2017 Programming languages — C++ Publication date 2017-12 https://www.iso.org/standard/68564.html
Heterogeneous Systems Architecture Foundation http://www.hsafoundation.com
Category Archives: Newsflash
Five Minutes with John Glossner, President of HSA Foundation
Foundation President Dr. John Glossner was interviewed recently on Embedded Computing Design’s “Five Minutes With..” podcast. Please click here to listen to the segment.
Developing Heterogeneous Cache Coherent SoCs
Chip Design: http://eecatalog.com/chipdesign/2017/12/19/developing-heterogeneous-cache-coherent-socs/
Automotive and other customer needs are not what they once were.
As with other challenges, the task of successfully developing heterogeneous cache coherent SoCs demands an understanding of your customers’ requirements. Conducting interviews with your customers aids this understanding. For example, through the interview you should learn:
- How many IPs are needed to connect to the heterogeneous system;
- What kind of bandwidth each IP requires;
- The types of IPs that are in the system;
- What kind of features you would enable in the interconnect IP.
The next step is to define “heterogeneity” because, while many people use the word “heterogeneous,” it has a number of meanings. Some guidelines:
- You must have different types of processors within the same system;
- Different processor types also have different cache structures. For example, an Arm CPU would use the same cache structure as another Arm core, but a different CPU may pose a different cache structure
- Different types of IPs must also be considered:
- CPUs, GPUs, and DSPs
- IPs that make up an SoC, such as those for connectivity, USB, SATA, etc.
A highly flexible snoop filter architecture accommodates different cache structures of different kinds of processors. It also reduces the number of memory bits required to perform snoop filtering.
Adapt to Changing Customer Needs
Understanding what the customer requirements are for non-coherency and coherency is a must. Are the coherent and non-coherent domains separated, a full merger, or a customized mix? ArterisIP, for instance, has developed a component called a non-coherent bridge. Its purpose is to drive non-coherent accesses into the coherent domain.
A few years ago, coherency systems were small and compact with a maximum of three to four different processors. Coherency was confined to CPU clusters, and functionality was grouped under an application. Coherency wasn’t necessarily distributed beyond a subsystem.
However, customer needs are changing, and today there is a need for greater processor performance. Companies are adding more and different types of processors. In addition:
- SoC layouts are expanding tremendously;
- Processors are growing larger;
- Complex layouts are affecting the coherency domain;
- Coherent domain is expanding all over the chip.
So how do you handle all these? First, you must make sure the infrastructure is designed to distribute coherency system-wide. The interconnect technology must enable network packet transport and accommodate a variety of topologies, such as ring and mesh. The infrastructure must also be configurable and flexible because as design complexity continues to grow, designers need to understand which topologies best suit a particular chip layout. Having the proper tools to predict where complexities might cause performance and power issues in the chip layout stage is critical to adapting to the layout and discovering which topology best resolves these issues.
Optimizing Power Consumption of Complex Systems
To optimize for power, first, you need to provide a power-ready IP. Once this is accomplished, you need to implement some tried and true techniques—these may include voltage domain, power domain, clock gating, and high-level clock gating.
When an IP is power-ready, it will have connectivity to a power interface and can be controlled by a PMU (Power Management Unit) in the system. The PMU will decide when to shut down the IP – i.e. when it is not in use or not needed by the system. At the application level, this power-aware controller (PMU) can lower system power consumption by putting an IP on idle.
Maturing to Meet Challenges
Heterogeneous SoCs are still in development and haven’t yet matured. But processors in coherent domain are now sharing data with each other. Other CPUs and GPUs have become cache coherent, although I’m confident we can do a lot more.
Moreover, data sharing is not only between the processor and the GPU, but among all the IPs of the system—a concept that is still work in progress. This idea must be pushed a little bit farther to achieve total coherency. Today not many non-coherent IPs share data with coherent IPs. But applications are emerging that need coherency, and this will bring new requirements.
Some of these design challenges are hindering product development, for example, for Advanced Driver-Assistance Systems (ADAS) for automotive. Automotive applications have performance requirements and the need to share data with heterogeneous processors to achieve those requirements. We’ll see the introduction of new features to this market. Other markets will include artificial intelligence and machine learning.
A decade ago, mobile application processors were driving the need to cache coherency. Next, data center systems took over as the primary drivers. Now the automotive market is fuelling the race to extend cache coherency to all of the heterogeneous processing elements in SoCs. In two or three years, a new trend will emerge to extend heterogeneous cache coherency even further—but designers will need flexibility, configurability and scalability to ensure that these systems are high-performance, low-latency, and power- and cost-efficient.
J.P. Loison is Corporate SoC Application Architect, ArterisIP, which provides system-on-chip (SoC) interconnect IP to accelerate SoC semiconductor assembly for a wide range of applications. These applications include those spanning automobiles to mobile phones, IoT, cameras, SSD controllers, and servers for customers such as Samsung, Huawei / HiSilicon, Mobileye (Intel), Altera (Intel), and Texas Instruments. The company is located in Campbell, CA.
HSA Foundation China Regional Committee Wraps Up Successful 2nd Annual Symposium
Wide Array of Interfaces, Specs Discussed for Next Gen of Heterogeneous Computing, AI, SDR, and More
BEIJING, CHINA, DEC. 20, 2017 — The China Regional Committee (CRC) of the Heterogeneous System Architecture (HSA) Foundation has successfully concluded its 2nd Symposium in Beijing. The CRC was formed earlier this year; its mandate is to enhance the awareness of heterogeneous computing and promote the adoption of standards such as Heterogeneous System Architecture (HSA) in China.
More than 40 representatives of the CRC members and related companies, research institutes and universities throughout China attended the conference. HSA Foundation President Dr. John Glossner also participated in this important benchmark meeting that exchanged ideas on important topics including interfaces and specifications for the next generation of heterogeneous computing, vector parallel computing model, system security and protection, artificial intelligence, software defined radio, Network-on-Chip (NoC), and programming of commercial HSA chips. The meeting was co-organized by China Electronics Standardization Institute (CESI) and the HSA Foundation’s CRC, and sponsored by Huaxia General Processor Technologies.
Last year the HSA Foundation held its first Global Summit in Beijing. The CRC has actively carried out various work in conjunction with CESI for the development of global heterogeneous computing standards with a China focus.
At the meeting, each CRC working group shared its progress and insights on related key technologies:
• Application & System Evaluation Working Group – “The application situation and development trend of artificial intelligence in China and typical rigid demands and key indicators of artificial intelligence” – presented by State Grid;
• Virtual ISA Working Group – “Artificial intelligence instruction set design for heterogeneous computing and exploratory research of HSAIL artificial intelligence extended subset” – presented by Dr. Jun Han, Fudan University;
• Interconnect Working Group – “Latest research results on network-on-chip in the heterogeneous computing SoCs, and the next step verification and standardization work arrangements” – presented by Dr. Zhiyi Yu, Sun Yat-sen University;
• Compilation & Runtime LIB Working Group – “The latest research trends in vector computing models and related programming models, and basic recommendations for facilitating integration into HSA system architectures” – presented by Dr. Lei Wang, Huaxia General Processor Technologies;
• System Architecture Working Group – “Using HSA to systematically address the basic views of software-defined communications, software-defined radio, heterogeneous multi-core chip architecture and application development” – presented by Wanting Tian, Sanechips Technology;
• Security & Protection Working Group – “Research work and principles on adapting heterogeneous computing for security protection” – presented by Shaowei Chen, Nationz Technologies.
The CRC has been adding members since the first CRC Symposium in May; some of which include Huaqiao University, Hunan University, Jimei University, Tsinghua University, Xiamen University, Xiamen University of Technology and Zhejiang University.
Supporting quotes:
“The HSA Foundation CRC has been laying the groundwork for standardization progress in heterogeneous computing standards in China for almost a year. It is focused on supporting the needs of HSA Foundation members in China and helping to fulfill the mission of the Foundation, which is to make heterogeneous programming universally easier.”
Dr. John Glossner, HSA Foundation President
“Since its formation, the CRC has received the support and attention of many academic institutions, companies, and government authorities in China. The work product and coverage of the CRC has been expanding and developing rapidly, making it one of China’s first “innovative brands” for standardization of heterogeneous computing. In 2018 the CRC and HSAF will work towards adoption of the v1.2 specifications and extensions enabling the transformation of HSA chips and platform products in many applications.”
Dr. Xiaodong Zhang, HSA Foundation CRC Chair
“The main research direction of our team is Software Defined Radio. Due to the flexibility of SDR, it allows for implementation across a wide range of applications. The earliest SDR platforms were based on FPGAs and DSPs with large size and high-power consumption making generalized SDR systems problematic. However, the HSA platform provides new possibilities for SDR research. HSA has many advantages such as low power consumption, low cost, and high integration. Those are hard to find in traditional SDR platforms.”
Dr. Ming Zhao, Professor, Tsinghua University
“Micro-Processor Research and development Center (MPRC) of Peking University is the pioneer of innovating indigenous microprocessor (CPU) and computer systems in China. To minimize the digital gap between developed and developing countries, MPRC is committed to the development of computers with independently developed CPUs and heterogeneous SoCs. The advantage of a heterogeneous architecture is the ability to be adaptable. During the evolution from desktop computing to mobile computing to Big Data, systems that adapt are the ones that are most successful. MPRC will work together with other members in HSA Foundation to improve life with heterogeneous technology.”
Dr. Junlin Lu, Deputy director of MPRC, Peking University
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.
New Survey from HSA Foundation Highlights Importance, Benefits of Heterogeneous Systems
Beaverton, Oregon, Dec. 5, 2017 – The Heterogeneous System Architecture (HSA) Foundation today released key findings from a second comprehensive members survey. The survey reinforced why heterogeneous architectures are becoming integral for future electronic systems.
HSA is a standardized platform design supported by more than 70 technology companies and universities that unlocks the performance and power efficiency of the parallel computing engines found in most modern electronic devices. It allows developers to easily and efficiently apply the hardware resources—including CPUs, GPUs, DSPs, FPGAs, fabrics and fixed function accelerators—in today’s complex systems-on-chip (SoCs).
Some of the survey questions – and results:
Will the system have HSA features?
Last year, 58.82% of the respondents answered affirmatively; this year, 100%!
Will it be HSA-compliant?
In 2016, 69.23% said it would; 2017 figures rose to 80%.
What is the top challenge in implementing heterogeneous systems?
27.27% responded in 2016 that it was a lack of standards for software programming models; the 2017 survey also identified this as the most important issue, but the numbers decreased to 7.69%. Also, half of the respondents last year said it was a lack of developer ecosystem momentum.
Some remarks that further accentuate key survey findings:
“Many HSA Foundation members are currently designing, programming or delivering a wide range of heterogeneous systems – including those based on HSA,” said HSA Foundation President Dr. John Glossner. “Our 2017 survey provides additional insight into key issues and trends affecting these systems that power the electronic devices across every aspect of our lives.”
Greg Stoner, HSA Foundation Chairman and Managing Director said that “the Foundation is developing resources and ecosystems conducive to its members’ various focuses on different application areas, including machine learning, artificial intelligence, datacenter, embedded IoT, and high-performance computing. The Foundation has also been making progress in support of these ecosystems, getting closer to taking normal C++ code and compiling to an HSA system.”
Stoner added that “ROCm 7 by AMD will port HSA for Caffe and TensorFlow; GPT, in the meantime, is releasing an open-sourced HSAIL-based Caffe library, with the first version already up and running – this permits early access for developers.”
Dr. Xiaodong Zhang, from Huaxia General Processor Technologies, who serves as chairman of the China Regional Committee (CRC; established by the HSA Foundation to enhance global awareness of heterogeneous computing), said that “China’s semiconductor industry is rapidly developing, and the CRC is building an ecosystem in the region to include technology, talent, and markets together with an open approach to take advantage of synergies among industry, academia, research, and applications.”
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.”
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.
Contact:
Neal Leavitt
Leavitt Communications
(760) 639-2900
neal@leavcom.com
Developing Heterogeneous Cache Coherent SoCs – and More! Q&A with Arterisip's J.P. Loison, Corporate SoC Application Architect
Editor’s Note:
- Understand how many IPs need to connect to the heterogeneous system;
- What kind of bandwidth does the IP require;
- What kind of IP and what kind of features can you enable with interconnect IP.
- You must have different types of processors within the same family;
- Then you have to accommodate different types of processors that are available on the market.
- Different processor types also have a different cache structures.
- An ARM CPU would use the same cache structure as another ARM core all over the processor.
- A different CPU poses a different cache structure.
- Accommodate different types of IPs as well:
- CPU, GPU, and DSPs:
- Then there are all other types the IPs that you combine into an SoC like connectivity IP, USB, SATA, etc.
- Flexible snoop filter capability accommodates different cache structures of different kinds of processors.
- Snoop filter capabilities operate in two different directions to accommodate any cache structure of any processor that is available today.
- Another challenge: Reduce the number of memory bits that you need to perform snoop filtering.
- SoC layouts are expanding tremendously;
- Size of processors growing larger;
- Complex layouts affect coherency domain;
- Coherent domain is expanding all over the chip.
Everything You Need to Know About Why AMD Open Sourced the OpenCL Driver Stack for ROCm
- OpenCL 1.2 compatible language runtime and compiler
- OpenCL 2.0 compatible kernel language support with OpenCL 1.2 compatible runtime
- Support for offline compilation right now – in-process/in-memory JIT compilation is to be added.
Five Minutes With…John Glossner, President, HSA Foundation
Foundation President Dr. John Glossner was interviewed recently on Embedded Computing Design’s “Five Minutes With..” podcast. Please click here to listen to the segment.
HSA and ROCm Architectures to be Highlighted at Next Week’s CppCon
BEAVERTON, OR, Sept. 19, 2017– The HSA (Heterogeneous System Architecture) Foundation and Foundation member AMD will be providing a comprehensive session on HSA technologies and AMD’s ROCm architecture at next week’s CppCon. The conference will be held from Sept. 24-29 in Bellevue, WA at the Meydenbauer Conference Center.
CppCon is an annual gathering for the worldwide C++ community and is geared to appeal to anyone from C++ novices to experts.
The presentation by AMD Fellow Paul Blinzer is included as part of a session on ‘concurrency and parallelism’ running from 8:30-10 PM on Tuesday, Sept. 28 at the Meydenbauer Conference Center, Harvard, Room #406. Attendees will learn about what allows these architectures to use computational hardware accelerators like GPUs, DSPs and others with native C++, without resorting to proprietary APIs, programming libraries or limited language features.
Heterogeneous System Architecture (HSA) is a standardized platform design that unlocks the performance and power efficiency of the parallel computing engines found in most modern electronic devices. It provides an ideal mainstream platform for next-generation SoCs in a range of applications including artificial intelligence.
For more information on the presentation and to register, please see https://cppcon.org/registration/.
For more information, including a full list of speakers, supporting organizations and sponsors please visit: https://cppcon.org/cppcon-2017-program/
About Paul Blinzer
Paul Blinzer works on a wide variety of Platform System Software architecture projects and specifically on the Heterogeneous System Architecture (HSA) System Software at Advanced Micro Devices, Inc. (AMD) as a Fellow in the System Software group. Living in the Seattle, WA area, during his career he has worked in various roles on system level driver development, system software development, graphics architecture, graphics & compute acceleration since the early ’90s. Paul is the chairperson of the “System Architecture Workgroup” of the HSA Foundation. He has a degree in Electrical Engineering (Dipl.-Ing) from TU Braunschweig, Germany.
https://www.linkedin.com/in/paul-blinzer-4523602
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.
Contact:
Neal Leavitt
Leavitt Communications
(760) 639-2900
neal@leavcom.com
HSA Foundation, AMD Headlining HSA Technologies Tutorial at 26th International Conference on Parallel Architectures and Compilation Techniques
BEAVERTON, OR, Sept. 6, 2017 – The HSA (Heterogeneous System Architecture) Foundation and Foundation member AMD will provide a half-day tutorial on HSA technologies and AMD’s Radeon™ Open Compute at this week’s 26th International Conference on Parallel Architectures and Compilation Architectures (PACT). The conference will be held from Sept. 9-13 in Portland, OR.
PACT brings together researchers from architecture, compilers, applications and languages to present and discuss innovative research of common interest. PACT recently widened its scope to include insights useful for the design of machines and compilers from applications such as, but not limited to, machine learning, data analytics and computational biology.
The tutorial, presented by AMD Fellow Paul Blinzer, runs from 9 AM to 12 PM on Saturday, Sept. 9th. Key elements will include an introduction into HSA and Radeon™ Open Compute runtime, followed by an in-depth session focusing on HSA, its components and the software ecosystem.
Heterogeneous System Architecture (HSA) is a standardized platform design that unlocks the performance and power efficiency of the parallel computing engines found in most modern electronic devices. It provides an ideal mainstream platform for next-generation SoCs in a range of applications including artificial intelligence.
The tutorial and other PACT sessions will be held at the Doubletree by Hilton Hotel Portland.
For more information on the tutorial and to register, please see https://parasol.tamu.edu/pact17/rates-registration.
For more information, including a full list of speakers, supporting organizations and sponsors please visit: https://parasol.tamu.edu/pact17/main-conference.
About Paul Blinzer
Paul Blinzer works on a wide variety of Platform System Software architecture projects and specifically on the Heterogeneous System Architecture (HSA) System Software at Advanced Micro Devices, Inc. (AMD) as a Fellow in the System Software group. Living in the Seattle, WA area, during his career he has worked in various roles on system level driver development, system software development, graphics architecture, graphics & compute acceleration since the early ’90s. Paul is the chairperson of the “System Architecture Workgroup” of the HSA Foundation. He has a degree in Electrical Engineering (Dipl.-Ing) from TU Braunschweig, Germany.
https://www.linkedin.com/in/paul-blinzer-4523602
About the HSA Foundation
The HSA (Heterogeneous System Architecture) Foundation is a non-profit consortium of SoC IP vendors, OEMs, Academia, SoC vendors, OSVs and ISVs, whose goal is making programming for parallel computing easy and pervasive. HSA members are building a heterogeneous computing ecosystem, rooted in industry standards, which combines scalar processing on the CPU with parallel processing on the GPU, while enabling high bandwidth access to memory and high application performance with low power consumption. HSA defines interfaces for parallel computation using CPU, GPU and other programmable and fixed function devices, while supporting a diverse set of high-level programming languages, and creating the foundation for next-generation, general-purpose computing.
Follow the HSA Foundation on Twitter, Facebook, LinkedIn and Instagram.
Contact:
Neal Leavitt
Leavitt Communications
(760) 639-2900
neal@leavcom.com
