Programme | DATE 2021

DATE is pleased to present a special hybrid format for its 2022 event, as the situation related to COVID-19 is improving but safety measures and restrictions will remain uncertain for the upcoming months across Europe and worldwide. In transition towards a future post-pandemic event again, DATE 2022 will host a two-day live event in presence in the city of Antwerp (just north of Brussels in Belgium), to bring the community together again, followed by other activities carried out entirely online in the subsequent days. This setup combines the in-presence experience with the opportunities of on-line activities, fostering the networking and social interactions around an interesting program of selected talks and panels on emerging topics to complement the traditional DATE high-quality scientific, technical and educational activities.

Monday Tutorials
Tuesday Special Day on "Sustainable High Performance Computing": Sessions 1.1 Innovative technologies & architectures for tomorrow’s compute platforms, 1.2 IT Sustainability (Embedded tutorials), 2.1 Emerging trends in the HPC industry landscape, 3.1 Sustainable solutions at large: bettering energy efficiency in HPC
Wednesday Special Day on "Cyber-Physical Systems for I4.0 and Smart Industrial Processes": Sessions 4.1 Digital Twins, 5.1 AI, ML and Data Analytics, 6.1 CPS-related sensors, platforms and software, 7.1 Panel: Is EDA Ready for Cyber-Physical Systems?
Special Initiative on "Autonomous Systems Design": Sessions 9.1 Autonomous Systems Design: Opening Panel, 10.1 Reliable Autonomous Systems: Dealing with Failure & Anomalies, IP.ASD_1 Interactive Presentations, 11.1 Safety Assurance of Autonomous Vehicles, K.5 Keynote, 12.1 Designing Autonomous Systems: Experiences, Technology and Processes, IP.ASD_2 Interactive Presentations, 13.1 Predictable Perception for Autonomous Systems, and Workshop W05 Special Initiative on Autonomous Systems Design (ASD)
Friday Workshops

Time	Label	Presentation Title Authors
07:00 CEST	O.1.1	WELCOME ADDRESSES Speakers: Franco Fummi¹ and Ian O'Connor² ¹Universita' di Verona, IT; ²Lyon Institute of Nanotechnology, FR Abstract Welcome messages by the general chair and the program chair.
07:35 CEST	O.1.2	PRESENTATION OF AWARDS Speakers: Franco Fummi¹ and Ian O'Connor² ¹Universita' di Verona, IT; ²Lyon Institute of Nanotechnology, FR Abstract Presentation of awards.

Time	Label	Presentation Title Authors
07:50 CEST	K.1.1	QUANTUM SUPREMACY USING A PROGRAMMABLE SUPERCONDUCTING PROCESSOR Speaker and Author: John Martinis, Google, UCSB and Quantala, US Abstract The promise of quantum computers is that certain computational tasks might be executed exponentially faster on a quantum processor than on a classical processor. A fundamental challenge is to build a high-fidelity processor capable of running quantum algorithms in an exponentially large computational space. Here we report the use of a processor with programmable superconducting qubits to create quantum states on 53 qubits, corresponding to a computational state-space of dimension 2^53 (about 10^16). Measurements from repeated experiments sample the resulting probability distribution, which we verify using classical simulations. Our Sycamore processor takes about 200 seconds to sample one instance of a quantum circuit a million times—our benchmarks currently indicate that the equivalent task for a state-of-the-art classical supercomputer would take approximately 10,000 years. This dramatic increase in speed compared to all known classical algorithms is an experimental realization of quantum supremacy for this specific computational task, heralding a much-anticipated computing paradigm.
08:30 CEST	K.1.2	LIVE Q&A Authors: John Martinis¹ and Mathias Soeken² ¹Google, UCSB and Quantala, US; ²Microsoft, CH Abstract Live question and answer session for interaction among speaker and audience

Time	Label	Presentation Title Authors
08:50 CEST	1.1.1	STORAGE CLASS MEMORY WITH COMPUTING ROW BUFFER: A DESIGN SPACE EXPLORATION Speaker: Valentin EGLOFF, CEA-List, FR Authors: Valentin Egloff¹, Jean-Philippe Noel¹, Maha Kooli¹, Bastien Giraud¹, Lorenzo Ciampolini¹, Roman Gauchi¹, Cesar Fuguet¹, Eric Guthmuller¹, Mathieu Moreau² and Jean-Michel Portal² ¹University Grenoble Alpes, CEA, List, FR; ²Aix Marseille Univ, Université de Toulon, CNRS, IM2NP, FR Abstract Today computing centric von Neumann architecture face strong limitations in the data-intensive context of numerous applications, such as deep learning. One of these limitations corresponds to the well known von Neumann bottleneck. To overcome this bottleneck, the concepts of In-Memory Computing (IMC) and Near-Memory Computing (NMC) have been pro- posed. IMC solutions based on volatile memories, such as SRAM and DRAM, with nearly infinite endurance, solve only partially the data transfer problem from the Storage Class Memory (SCM). Computing in SCM is extremely limited by the intrinsic poor endurance of the Non-Volatile Memory (NVM) technologies. In this paper, we propose to take the best of both solutions, by introducing a Computing Row Buffer (C-RB), using a Computing SRAM (C-SRAM) model, in place of the standard Row Buffer (RB) in the SCM. The principle is to keep operations on large vectors in the C-RB of the SCM, reducing data movement to the CPU, thus drastically saving energy consumption of the overall system. To evaluate the proposed architecture, we use an instruction accurate platform based on Intel Pin software. Pin instruments run time binaries in order to get applications’ full memory traces of our solution. We achieve energy reduction up to 7.9x on average and up to 45x for the best case and speedup up to 3.8x on average and up to 13x for the best case, and a reduction of write access in the SCM up to 18%, compared to SIMD 512-bit architecture.
09:10 CEST	1.1.2	FROM A FPGA PROTOTYPING PLATFORM TO A COMPUTING PLATFORM: THE MANGO EXPERIENCE Speaker: Jose Flich, Universitat Politècnica de València, ES Authors: Josè Flich¹, Rafael Tornero², David Rodriguez³, Jose Maria Martínez², Davide Russo⁴ and Carles Hernández² ¹Associate Professor, Universitat Politècnica de València, ES; ²TU Valencia, ES; ³Universitat Jaume I, ES; ⁴University Federico II - Naples, IT Abstract In this paper we describe the evolution of the FPGA-based cluster used in the MANGO project from a hardware prototyping platform of HPC architectures to a computing platform targeting HPC and AI applications on different European projects such as RECIPE and DeepHealth. Our main goal is to reinvest on the MANGO cluster by providing a duality in its use for both large-scale hardware prototyping and high-performance computation. From our work experience we can reach several interesting conclusions about the complexities and hurdles that lay below FPGA technologies, and therefore, shedding some light onto the real complexities that difficult the adoption of FPGAs on either large-scale pure HPC systems or on hybrid systems (HPC + BigData/AI).
09:30 CEST	1.1.3	HETEROGENEOUS COMPUTING SYSTEMS FOR COMPLEX SCIENTIFIC DISCOVERY WORKFLOWS Speaker: Christoph Hagleitner, IBM, CH Authors: Christoph Hagleitner¹, Dionysios Diamantopoulos², Burkhard Ringlein³, Constantinos Evangelinos⁴, Edward Pyzer-Knapp⁵, Michael Johnston⁶, Charles Johns⁴, Rong A. Chang⁴, Bruce D'Amora⁴, James Kahle⁴ and James Sexton⁴ ¹IBM, CH; ²IBM Research, CH; ³IBM Research -- Zurich, CH; ⁴IBM, US; ⁵IBM, GB; ⁶IBM, IE Abstract With Moore's law progressively running out of steam, heterogeneous computing architectures are already powering the #1 supercomputers since a few years and are now finding broader adoption. The trend towards sustainable computing also requires domain-specific heterogeneous hardware architectures, which promises further gains in energy efficiency. At the same time, todays HPC applications have evolved from monolithic simulations in a single domain to complex workflows crossing multiple disciplines. In this paper, we explore how these trends affect the system design decisions and what this means for future computing architectures.
	1.1.4	LIVE JOINT Q&A Authors: Valentin Egloff¹, Jose Flich², Christoph Hagleitner³, Adrià Armejach⁴ and Nehir Sonmez⁴ ¹CEA-List, FR; ²Universitat Politècnica de València, ES; ³IBM, CH; ⁴BSC, ES Abstract 30 minutes of live joint question and answer time for interaction among speakers and audience.

Time	Label	Presentation Title Authors
08:50 CEST	1.2.1	MOORE’S LAW AND ICT INNOVATION IN THE ANTHROPOCENE Speaker: David Bol, Université catholique de Louvain, BE Authors: David Bol, Thibault Pirson and Rémi Dekimpe, Université catholique de Louvain, BE Abstract In information and communication technologies (ICTs), innovation is intrinsically linked to empirical laws of exponential efficiency improvement such as Moore's law. By following these laws, the industry achieved an amazing relative decoupling between the improvement of key performance indicators (KPIs), such as the number of transistors, from physical resource usage such as silicon wafers. Concurrently, digital ICTs came from almost zero greenhouse gas emission (GHG) in the middle of the twentieth century to direct annual carbon footprint of approximately 1400 MT CO2e today. Given the fact that we have to strongly reduce global GHG emissions to limit global warming below 2°C, it is not clear if the simple follow-up of these trends can decrease the direct GHG emissions of the ICT sector on a trajectory compatible with Paris agreement. In this paper, we analyze the recent evolution of energy and carbon footprints from three ICT activity sub-sectors: semiconductor manufacturing, wireless Internet access and datacenter usage. By adopting a Kaya-like decomposition in technology affluence and efficiency factors, we find out that the KPI increase failed to reach an absolute decoupling with respect to total energy consumption because the technology affluence increases more than the efficiency. The same conclusion holds for GHG emissions except for datacenters, where recent investment in renewable energy sources lead to an absolute GHG reduction over the last years, despite a moderate energy increase.
09:20 CEST	1.2.2	FEW HINTS TOWARDS MORE SUSTAINABLE ARTIFICIAL INTELLIGENCE Speaker and Author: Marc Duranton, CEA, FR Abstract Artificial Intelligence (AI) is now everywhere and its domains of application grow every day. But its demand in data and in computing power is also growing at an exponential rate, faster than used to be the “Moore’s law”. The largest structures, like GPT-3, have impressive results but also trigger questions about the resources required for their learning phase, in the order of magnitude of hundreds of MWh. Once the learning done, the use of Deep Learning solutions (the “inference” phase) is far less energy demanding, but the systems are often duplicated in quantities (e.g. for consumer applications) and reused multiple times, so the cumulative energy consumption is also important. It is therefore of paramount importance to improve the efficiency of AI solutions in all their lifetime. This can only be achieved by combining efforts on several domains: on the algorithmic side, on the codesign application/algorithm/hardware, on the hardware architecture and on the (silicon) technology for example. The aim of this short tutorial is to raise awareness on the energy consumption of AI and to show different tracks to improve this problem, from distributed and federated learning, to optimization of Neural Networks and their data representation (e.g. using “Spikes” for information coding), to architectures specialized for AI loads, including systems where memory and computation are near, and systems using emerging memories or 3D stacking.
09:50 CEST	1.2.3	LIVE JOINT Q&A Authors: Gilles Sassatelli¹, Miquel Moreto², David Bol³ and Marc Duranton⁴ ¹LIRMM, FR; ²BSC, ES; ³Université catholique de Louvain, BE; ⁴CEA, FR Abstract 30 minutes of live joint question and answer time for interaction among speakers and audience.

Time	Label	Presentation Title Authors
08:50 CEST	1.3.1	SOFTWARE MECHANISMS FOR CONTROLLING QOS Speaker: Jörg Seitter, Robert Bosch GmbH, DE Authors: Falk Rehm and Jörg Seitter, Bosch, DE Abstract Available Common of the shelf (COTS) platforms have little support for configuring Quality of Service (QoS) for various shared resources, for instance, the interconnect or the DRAM. In order to achieve predictable performance, one has, thus, to resort to software-based methods for controlling interference and reduce shared resource contention. Examples include memory bandwidth regulation and cache coloring, which can be implemented on hypervisor or operating system level. The Bosch talk will give insights on currently developed VIPs including potential pitfalls and different possible software-based mechanisms that are investigated for increasing performance predictability.
09:05 CEST	1.3.2	RESOURCE CONTENTION AVOIDANCE MECHANISMS IN HIGH-PERFORMANCE ARM-BASED SYSTEMS Speaker and Author: Jan-Peter Larsson, Arm, GB Abstract There exist a variety of software techniques that reduce shared resource contention. The drawback of such techniques is that they often require detailed knowledge of the hardware platform and its underlying IP, involve workload porting, or impose performance overheads that reduce the overall efficiency of the system. Hardware can and should do more to assist software in this task: by providing identification, monitoring and control mechanisms that help system software observe the behavior of competing workloads and apportion the shared resources among them, hardware-based resource contention avoidance mechanisms can improve on the efficiency and efficacy of purely software-based approaches. In this talk, we provide an overview of two Arm technologies: L3 cache partitioning in the DynamIQ Shared Unit (DSU), and control and monitoring features in the Armv8.4-A Memory Partitioning and Monitoring (MPAM) architecture extension. The DSU provides an L3 cache partitioning scheme under software control that can limit cache contention between competing workloads in a DynamIQ processor cluster. MPAM is an example of an architectural approach to resource contention avoidance and provides workload identification and attribution of memory traffic throughout the system, enabling software-controlled apportioning of system resources like cache capacity and memory bandwidth as well as monitoring of the performance of individual workloads. Finally, we provide examples of how these two complementary Arm technologies can work in tandem with system software to reduce shared resource contention, and we present the principles we believe will increase the determinism and predictability of real-time workloads that execute on high-performance Arm-based platforms.
09:20 CEST	1.3.3	ADMISSION CONTROL FOR GUARANTEEING E2E QOS IN MPSOCS Speaker and Author: Selma Saidi, TU Dortmund, DE Abstract An application in MPSoCs must generally acquire several shared (interconnect and memory) resources with independent arbiters and often provided by different vendors. One major challenge is to control the effect of interference on shared resources in an end-to-end fashion. In this talk, we discuss an alternative solution for controlling accesses to shared resources using admission control mechanisms. The goal is to decouple the data layer, where transmission is performed, from the control layer responsible for allocation and arbitration of available resources. Interference analysis can then account for applications requests arrival at the resource management control unit instead of individual flits/packets arrival at every (sub) resource. The proposed approach allows to simplify system performance analysis by reducing the complexity of coupling different resources timing analysis, which usually leads to pessimistic formal guarantees or decreased performance and utilization
09:35 CEST	1.3.4	SUPPORTING SYSTEM DESIGN WITH FORMAL PERFORMANCE ANALYSIS Speaker: Giovanni Stea, University of Pisa, IT Authors: Giovanni Stea¹ and Raffaele Zippo² ¹University of Pisa, IT; ²University of Florence and University of Pisa, IT Abstract From a software perspective, systems will evolve over development time which drives the need for early predictions of system behaviour. System performance analysis is an important ingredient for successful system design. The high dynamic behaviour due to caches, memory controllers, etc. makes system analysis for certain abstraction level very difficult. The talk of University of Pisa will discuss how the network calculus formalism can be used to predict performance in the context of vehicle integration platforms. More precisely, it will present an approach for finding worst-case delay bounds in the access to a shared DRAM controller, using the First-ready, First-come-first-served (FR-FCFS) arbitration policy. Furthermore, it will show that the algorithm to compute those bounds is simple (it can run in milliseconds time) and can be adaptable to several DRAM models (all it takes is to incorporate the DRAM timing parameters and constraint). Benchmark results will show that the distance between lower and upper bounds obtained by the approach is immaterial for practical purposes (few percentage points at most).

Time	Label	Presentation Title Authors
08:50 CEST	1.4.1	SCALAR REPLACEMENT IN THE PRESENCE OF MULTIPLE WRITE ACCESSES FOR ACCELERATOR DESIGN WITH HIGH-LEVEL SYNTHESIS Speaker and Author: Kenshu Seto, Tokyo City University, JP Abstract High-level synthesis (HLS) reduces design time of domain-specific accelerators from loop nests. Usually, naive usage of HLS leads to accelerators with insufficient performance, so very time-consuming manual optimizations of input programs are necessary in such cases. Scalar replacement is a promising automatic memory access optimization that removes redundant memory accesses. However, it cannot handle loops with multiple write accesses to the same array, which poses a severe limitation of its applicability. In this paper, we propose a new memory access optimization technique that breaks the limitation. Experimental results show that the proposed method achieves 2.1x performance gain on average for the benchmark programs which the state-of-the-art memory optimization techniques cannot optimize.
09:05 CEST	1.4.2	HOST: HLS OBFUSCATIONS AGAINST SMT ATTACK Speaker: Chandan Karfa, IIT Guwahati, IN Authors: Chandan Karfa¹, TM Abdul Khader², Yom Nigam², Ramanuj Chouksey² and Ramesh Karri³ ¹Indian Institute of Technology Guwahati, IN; ²IIT Guwahati, IN; ³NYU, US Abstract The fab-less IC design industry is at risk of IC counterfeiting and Intellectual Property (IP) theft by untrusted third party foundries. Logic obfuscation thwarts IP theft by locking the functions of gate-level netlists using a locking key. The complexity of circuit designs and migration to high level synthesis (HLS) expands the scope of logic locking to a higher abstraction. Automated RTL locking during HLS integrates obfuscation into the back-end HLS algorithms. This is tedious and requires implementing them in the source code of the HLS tools. Furthermore, recent work proposed an SMT attack on HLS-based obfuscation. In this work, we propose RTL locking tool HOST to thwart the SMT attack. The HOST approach is agnostic to the HLS tool. Experimental results show that the HOST obfuscations have low overhead and thwart SMT attacks.
09:20 CEST	IP1_1.1	PARAMETRIC THROUGHPUT ORIENTED LARGE INTEGER MULTIPLIERS FOR HIGH LEVEL SYNTHESIS Speaker: Emanuele Vitali, Politecnico di Milano, IT Authors: Emanuele Vitali, Davide Gadioli, Fabrizio Ferrandi and Gianluca Palermo, Politecnico di Milano, IT Abstract The multiplication of large integers represents a significant computation effort in some cryptographic techniques. The use of dedicated hardware is an appealing solution to improve performance or efficiency. We propose a methodology to generate throughput oriented hardware accelerators for large integers multiplication leveraging High-Level Synthesis. The proposed micro-architectural template is composed of a combination of different multiplication algorithms. It exploits the recursive splitting of Karatsuba, reuse strategies, and the efficiency of Comba to control the extra-functional properties of the generated multiplier. The goal is to enable the end-user to explore a wide range of possibilities, in terms of performance and resource utilization, without requiring them to know implementation and synthesis details. Experimental results show the large flexibility of the generated architectures and that the generated Pareto-set of multipliers can outperform some state-of-the-art RTL design.
09:21 CEST	IP1_1.2	LOCKING THE RE-USABILITY OF BEHAVIORAL IPS: DISCRIMINATING THE SEARCH SPACE THROUGH PARTIAL ENCRYPTIONS Speaker: Zi Wang, University of Texas at Dallas, US Authors: Zi Wang and Benjamin Carrion Schaefer, University of Texas at Dallas, US Abstract Behavioral IPs (BIPs) have one salient advantage compare to the traditional RTL IPs given in Verilog or VHDL or even gate netlists. The BIP can be used to generate RTLs with very different characteristics by simply specifying different synthesis directives. These synthesis directives are typically specified at the source code in the form of pragmas (comments) and control how to synthesize arrays (e.g. registers or RAM), loops (unroll or fold) and functions (inline or no). This allows a BIP consumer to purchase a BIP once and re-use it in future projects by simply specifying a different mix of these synthesis directives. This would obviously not benefit the BIP provider as the BIP consumer would not need to purchase the BIP again for future projects as oppose to IPs bought at the RT or gatenetlist level. To address this, this work presents a method to enable the BIP provider to lock the search space of the BIP such that the user can only generate micro-architectures within the specified search space. This leads to a significant benefit to both parties: The BIP provider can now discriminate the BIP price based on how much of the search space is made visible to the BIP consumer, while the BIP consumer benefits from a cheaper BIP, albeit limited in its search space. This approach is made possible through partial encryptions of the BIP. Thus, this work presents a method that selectively fixes some synthesis directives and allows the BIP user to modify the rest of the directives such that the micro-architectures generated are guaranteed to be in a given pre-defined search space limit.
09:22 CEST	1.4.3	(Best Paper Award Candidate) CORRELATED MULTI-OBJECTIVE MULTI-FIDELITY OPTIMIZATION FOR HLS DIRECTIVES DESIGN Speaker: Qi SUN, The Chinese University of Hong Kong, HK Authors: QI SUN¹, Tinghuan Chen¹, Siting Liu¹, Jin Miao², Jianli Chen³, Hao Yu⁴ and Bei Yu¹ ¹The Chinese University of Hong Kong, HK; ²Cadence Design Systems, US; ³Fudan University, CN; ⁴Southern University of Science and Technology, China, CN Abstract High-level synthesis (HLS) tools have gained great attention in recent years because it emancipates engineers from the complicated and heavy hardware description language writing, by using high-level languages and HLS directives. However, previous works seem powerless, due to the time-consuming design processes, the contradictions among design objectives, and the accuracy difference between the three stages (fidelities). To find good HLS directives, in this paper, a novel correlated multi-objective non-linear optimization algorithm is proposed to explore the Pareto solutions while making full use of data from different fidelities. A non-linear Gaussian process is proposed to model relationships among the analysis reports from different fidelities for the same objective. For the first time, correlated multivariate Gaussian process models are introduced into this domain to characterize the complex relationships of multiple objectives in each design fidelity. A tree-based method is proposed to erase invalid solutions and obviously non-optimal solutions. Experimental results show that our non-linear and pioneering correlated models can approximate the Pareto-frontier of the directive design space in a shorter time with much better performance and good stability, compared with the state-of-the-art.
09:37 CEST	IP1_4.1	OPPORTUNISTIC IP BIRTHMARKING USING SIDE EFFECTS OF CODE TRANSFORMATIONS ON HIGH-LEVEL SYNTHESIS Speaker: Christian Pilato, Politecnico di Milano, IT Authors: Hannah Badier¹, Christian Pilato², Jean-Christophe Le Lann³, Philippe Coussy⁴ and Guy Gogniat⁵ ¹ENSTA Bretagne, FR; ²Politecnico di Milano, IT; ³ENSTA-Bretagne, FR; ⁴Universite de Bretagne-Sud / Lab-STICC, FR; ⁵Université Bretagne Sud, FR Abstract The increasing design and manufacturing costs are leading to globalize the semiconductor supply chain. However, a malicious attacker can resell a stolen Intellectual Property (IP) core, demanding methods to identify a relationship between a given IP and a potentially fraudulent copy. We propose a method to protect IP cores created with high-level synthesis (HLS): our method inserts a discrete birthmark in the HLS-generated designs that uses only intrinsic characteristics of the final RTL. The core of our process leverages the side effects of HLS due to specific source-code manipulations, although the method is HLS-tool agnostic. We propose two independent validation metrics, showing that our solution introduces minimal resource and delay overheads (<6% and <2%, respectively) and the accuracy in detecting illegal copies is above 96%.

Time	Label	Presentation Title Authors
08:50 CEST	1.5.1	ONLINEHD: ROBUST, EFFICIENT, AND SINGLE-PASS ONLINE LEARNING USING HYPERDIMENSIONAL SYSTEM Speaker: Mohsen Imani, University of California Irvine, US Authors: Alejandro Hérnandez-Cano¹, Namiko Matsumoto², Eric Ping² and Mohsen Imani³ ¹Universidad Nacional Autónoma de México, MX; ²University of California San Diego, US; ³University of California Irvine, US Abstract Hyper-Dimensional computing (HDC) is a brain-inspired learning approach for efficient and robust learning on today’s embedded devices. HDC supports single-pass learning, where it generates a classification model by one-time looking at each training data point. However, the single-pass model provides weak classification accuracy due to model saturation caused by naively accumulating high-dimensional data. Although the retraining model for hundreds of iterations addresses the model saturation and boosts the accuracy, it comes with significant training costs. In this paper, we propose OnlineHD, an adaptive HDC training framework for accurate, efficient, and robust learning. During single-pass training, OnlineHD identifies common patterns and eliminates model saturation. For each data point, OnlineHD updates the model depending on how similar it is to the existing model, instead of naive data accumulation. We expand the OnlineHD framework to support highly-accurate iterative training. We also exploit the holographic distribution of patterns in high-dimensional space to make OnlineHD ultra-robust against possible noise and hardware failure. Our evaluations on a wide range of classification problems show that OnlineHD adaptive training provides comparable classification accuracy to the retrained model while getting all efficiency benefits that a single-pass training provides. OnlineHD achieves, on average, 3.5× and 6.9× (3.7× and 5.8×) faster and more efficient training as compared to state-of-the-art machine learning (HDC algorithms), while providing similar classification accuracy and 8.5× higher robustness to a hardware error.
09:05 CEST	1.5.2	ADAPTIVE GENERATIVE MODELING IN RESOURCE-CONSTRAINED ENVIRONMENTS Speaker: Jung-Eun Kim, Yale University, US Authors: Jung-Eun Kim¹, Richard Bradford², Max Del Giudice³ and Zhong Shao³ ¹Department of Computer Science, Yale University, US; ²Collins Aerospace, US; ³Yale University, US Abstract Modern generative techniques, deriving realistic data from incomplete or noisy inputs, require massive computation for rigorous results. These limitations hinder generative techniques from being incorporated in systems in resource-constrained environments, thus motivating methods that grant users control over the time-quality trade-offs for a reasonable “payoff” of execution cost. Hence, as a new paradigm for adaptively organizing and employing recurrent networks, we propose an architectural design for generative modeling achieving flexible quality.We boost the overall efficiency by introducing non-recurrent layers into stacked recurrent architectures. Accordingly, we design the architecture with no redundant recurrent cells so we avoid unnecessary overhead.
09:20 CEST	IP1_2.1	OPERATING BEYOND FPGA TOOL LIMITATIONS: NERVOUS SYSTEMS FOR EMBEDDED RUNTIME MANAGEMENT Speaker: Martin Trefzer, University of York, GB Authors: Matthew Rowlings, Martin Albrecht Trefzer and Andy Tyrrell, University of York, GB Abstract Deep submicron fabrication issues throttle VLSI designs with pessimistic design constraints required to avoid failure of devices in the field. This imposes overly-conservative design approaches, including worst-case corners and speed-grade device binning, resulting in systems performing far below their maximum possible performance. An alternative is to monitor a device's operating state in the field and manage key parameters autonomously at runtime. In a modern SoC consisting of millions of transistors there are a huge number of potential monitoring and actuation points. This makes the autonomous management task difficult when using centralised intelligence for parameter decisions and is inherently non-scalable. An organism's decentralised control, the Nervous System, manages high degrees of scalability. Nervous Systems use a hierarchy of neural circuitry to: a) integrate sensory data, b) manage local feedback paths between sensory inputs (nerve cells) and local actuators (muscle cells), c) combine many integrated local sensory pathways together to form higher-level decisions that affect many actuators spread across the organism. This model maps well to VLSI designs: low-level sensors are formed of small sensory circuits (timing fault detectors, ring oscillators), low-level actuators map to configurable design elements (voltage islands, clock-tree delay elements) and high-level decision units manage global clock frequencies and device voltage rails which affect the whole chip. This paper motivates the adoption of a Nervous System-inspired approach. We explore the problem of device binning by presenting experimental results characterising an Artix-7 FPGA design. Our test circuit is overclocked by twice the maximum design tool frequency and run at 50 degrees Celsius above its maximum operating temperature without error. Our Configurable Intelligence Array is then introduced as a low-overhead intelligence platform, ideal for implementing nervous system signal pathways. This is used for a prototype neural circuit that closes the loop between a timing-fault detector and a programmable PLL.
09:21 CEST	IP1_2.2	ADAPTIVE-LEARNING BASED BUILDING LOAD PREDICTION FOR MICROGRID ECONOMIC DISPATCH Speaker: Rumia Masburah, Student, IN Authors: Rumia Masburah¹, Rajib Lochan Jana², Ainuddin Khan², Shichao Xu³, Shuyue Lan³, Soumyajit Dey¹ and Qi Zhu³ ¹Indian Institute of Technology Kharagpur, IN; ²Indian institute of Technology Kharagpur, IN; ³Northwestern University, US Abstract Given that building loads consume roughly 40% ofthe energy produced in developed countries, smart buildingswith local renewable resources offer a viable alternative towardsachieving a greener future. Building temperature control strategiestypically employ detailed physical models capturing buildingthermal dynamics. Creating such models require a significantamount of time, information and finesse. Even then, due tounknown building parameters and related inaccuracies, futurepower demands by the building loads are difficult to estimate. Thiscreates unique challenges in the domain of microgrid economicpower dispatch for satisfying building power demands throughefficient control and scheduling of renewable and non-renewablelocal resources in conjunction with supply from the main grid. In this work, we estimate the real-time uncertainties in buildingloads using Gaussian Process (GP) learning and establish theeffectiveness of run time model correction in the context ofmicrogrid economic dispatch. Our system architecture employsa Deep Reinforcement Learning (DRL) framework that adap-tively triggers the GP model learning and updating phase forconsistently providing accurate power demand prediction of thebuilding load. We employ a Model Predictive Control (MPC)based microgrid power dispatch scheme enabled with our demandprediction framework and co-simulate it with EnergyPlus buildingload simulator to establish the efficacy of our approach.
09:22 CEST	1.5.3	PERFORMANCE ANALYSIS AND AUTO-TUNING FOR SPARK IN-MEMORY ANALYTICS Speaker: Dimosthenis Masouros, National TU Athens, GR Authors: Dimitra Nikitopoulou¹, Dimosthenis Masouros¹, Sotirios Xydis¹ and Dimitrios Soudris² ¹National TU Athens, GR; ²NTUA, GR Abstract Recently the Apache Spark in-memory computing framework has gained a lot of attention, due to its increased performance on large-scale data processing. Although Spark is highly configurable, its manually tuning is time consuming, due to the high-dimensional configuration space. Prior research has emerged frameworks able to analyze and model the performance of Spark applications, however they either rely on empirical selection of important parameters or/and follow a pure application-specific modeling approach. In this paper, we propose an end-to-end performance auto-tuning framework for Spark in-memory analytics. By adopting statistical hypothesis testing techniques, we manage to extract the higher order effects among differing parameters and their significance in performance optimization. In addition, we propose a new systematic meta-model driven approach utilizing cluster-, rather than application-wise performance modeling for traversing the configuration search space. We evaluate our approach using real scale analytic benchmarks from HiBench suite and show that the proposed framework achieves an average performance gain of x3.07 for known and x2.01 for unknown applications, compared to the default configuration.

Time	Label	Presentation Title Authors
08:50 CEST	1.6.1	GLAIVE: GRAPH LEARNING ASSISTED INSTRUCTION VULNERABILITY ESTIMATION Speaker: Jiajia Jiao, Cornell University, USA/Shanghai Maritime University, China, CN Authors: Jiajia Jiao¹, Debjit Pal², Chenhui Deng² and Zhiru Zhang³ ¹Shanghai Maritime University & Cornell University, CN; ²Computer Systems Laboratory, Cornell University, US; ³Cornell University, US Abstract Due to the continuous technology scaling and lowering of operating voltages, modern computer systems are highly vulnerable to soft errors induced by the high-energy particles. Soft errors can corrupt program outputs leading to silent data corruption or a Crash. To protect computer systems against such failures, architects need to precisely and quickly identify vulnerable program instructions that need to be protected. Traditional techniques for program reliability estimation either use expensive and time-consuming fault injection or inaccurate analytical models to identify the program instructions that need to be protected against soft errors. In this work, we present GLAIVE, a graph learning-assisted model for fast, accurate, and transferable soft-error induced instruction vulnerability estimation. GLAIVE leverages a synergy between static analysis and datadriven statistical reasoning to automatically learn signatures of instruction-level vulnerabilities and their propagation to program outputs using a fine-grain error propagation information from the bit-level program graphs of a set of realistic benchmarks. Our experiments show that the learned knowledge of instruction vulnerability is transferable to unseen programs. We further show that GLAIVE can achieve an average 221x speedup and up to 33.09% lower program vulnerability estimation error as compared to a baseline fault-injection technique, up to 30.29% higher vulnerability estimation accuracy, and on average can cover up to 90.23% vulnerable instructions for a given protection budget compared to a set of baseline machine learning algorithms.
09:05 CEST	1.6.2	TRIGON: A SINGLE-PHASE-CLOCKING LOW POWER HARDENED FLIP-FLOP WITH TOLERANCE TO DOUBLE-NODE-UPSET FOR HARSH ENVIRONMENTS APPLICATIONS Speaker: Yan Li, Karlsruhe Institute of Technology, DE Authors: Yan Li¹, Jun Han², Xiaoyang Zeng¹ and Mehdi Tahoori³ ¹State Key Laboratory of ASIC and System, Fudan University, CN; ²Fudan University, CN; ³Karlsruhe Institute of Technology, DE Abstract Single Event Upset (SEU) is one of the most susceptible reliability issues for CMOS circuits in a harsh environment, such as space or even a sea-level environment. Especially in the advanced nanoscale node, the phenomenon of Multi-node-upset (MNU) becomes more prominent. Although a lot of work has been proposed to solve this problem, most of them ignored the need for low power consumption. Particularly, most existing solutions are not effective anymore when operating in low supply voltage. Therefore, this paper proposes a novel Flip-Flop called TRIGON based on a single-phase-clocking structure to achieve low power consumption while being able to tolerate Double-node-upset (DNU), even when operating at lower supply voltages. The experimental results show that TRIGON has a significant reduction in the area and Power-delay-area-product (PDAP). Particularly, it achieves about 80% energy saving on average when the input is static compared with the state-of-the-art circuits.
09:20 CEST	IP1_3.1	FORSETI: AN EFFICIENT BASIC-BLOCK-LEVEL SENSITIVITY ANALYSIS FRAMEWORK TOWARDS MULTI-BIT FAULTS Speaker: Jinting Ren, Chongqing University, CN Authors: Jinting Ren, Xianzhang Chen, Duo Liu, Moming Duan, Renping Liu and Chengliang Wang, Chongqing University, CN Abstract The per-instruction sensitivity analysis framework is developed to evaluate the resiliency of a program and identify the segments of the program needing protection. However, for multi-bit hardware faults, the per-instruction sensitivity analysis frameworks can cause large overhead for redundant analyses. In this paper, we propose a basic-block-level sensitivity analysis framework, Forseti, to reduce the analysis overhead in analyzing impacts of modern microprocessors' multi-bit faults on programs. We implement Forseti in LLVM and evaluate it with five typical workloads. Extensive experimental results show that Forseti can achieve more than 90% sensitivity classification accuracy and 6.16x speedup over instruction-level analysis.
09:21 CEST	IP1_3.2	MODELING SILICON-PHOTONIC NEURAL NETWORKS UNDER UNCERTAINTIES Speaker: Sanmitra Banerjee, Duke University, US Authors: Sanmitra Banerjee¹, Mahdi Nikdast² and Krishnendu Chakrabarty¹ ¹Duke University, US; ²Colorado State University, US Abstract Silicon-photonic neural networks (SPNNs) offer substantial improvements in computing speed and energy efficiency compared to their digital electronic counterparts. However, the energy efficiency and accuracy of SPNNs are highly impacted by uncertainties that arise from fabrication-process and thermal variations. In this paper, we present the first comprehensive and hierarchical study on the impact of random uncertainties on the classification accuracy of a Mach--Zehnder Interferometer (MZI)-based SPNN. We show that such impact can vary based on both the location and characteristics (e.g., tuned phase angles) of a non-ideal silicon-photonic device. Simulation results show that in an SPNN with two hidden layers and 1374 tunable-thermal-phase shifters, random uncertainties even in mature fabrication processes can lead to a catastrophic 70% accuracy loss.
09:22 CEST	1.6.3	ENHANCEMENTS OF MODEL AND METHOD IN LITHOGRAPHY HOTSPOT IDENTIFICATION Speaker: Rui Zhang, HiSilicon Technologies Co., Ltd., CN Authors: Xuanyu Huang¹, Rui Zhang², Yu Huang², Peiyao Wang² and Mei Li² ¹Center for Nano and Micro Mechanics, Tsinghua University, China, CN; ²HiSilicon Technologies Co., Ltd. Shenzhen, China, CN Abstract The manufacturing of integrated circuits has been continuously improved through the advancement of fabrication technology nodes. However, the lithographic hotspots (HSs) caused by optical diffraction problems seriously affect the yield of ICs. Although lithography simulation can accurately capture the HSs through physically simulating the lithography process, it requires a lot of computing resources, which usually takes > 100 CPU_h=mm2. Due to the image recognition nature, the state-of-the-art HS identification algorithms based on deep learning have obvious advantages in reducing run time comparing to the traditional algorithms. However, its accuracy still needs to be enhanced since there are many false alarms of non-hotspots (NHSs) and escapes of the real HSs, which makes it difficult to be a signoff technique. In this paper, we propose two enhancements in HS identification. First, a hybrid deep learning model is proposed in lithography HS identification, which includes a CNN model to combine physical features. Second, an ensemble learning method is proposed based on multiple sub-models. The proposed enhanced model and method can achieve high HS identification accuracy on the benchmarks 1-4 of the ICCAD 2012 dataset with recall > 98.8%. In addition, it can achieve even 100% recall on the benchmarks 1 and 3 while maintaining the precision at a high level of over 93%. Moreover, for the first time it can achieve not only 100% recall on Benchmark 5, but also high precision of 61.8%, which is much higher than any published deep learning methods for HS identification, as far as we know. The proposed enhanced model and methodology can be applied in industrial IC designs due to its effectiveness and efficiency.

Time	Label	Presentation Title Authors
08:50 CEST	1.7.1	MLCOMP: A METHODOLOGY FOR MACHINE LEARNING-BASED PERFORMANCE ESTIMATION AND ADAPTIVE SELECTION OF PARETO-OPTIMAL COMPILER OPTIMIZATION SEQUENCES Speaker: Alessio Colucci, TU Wien, AT Authors: Alessio Colucci¹, Dávid Juhász², Martin Mosbeck¹, Alberto Marchisio³, Semeen Rehman², Manfred Kreutzer⁴, Guenther Nadbath⁴, Axel Jantsch² and Muhammad Shafique⁵ ¹Vienna University of Technology (TU Wien), AT; ²TU Wien, AT; ³TU Wien (TU Wien), AT; ⁴ABIX GmbH, AT; ⁵New York University Abu Dhabi (NYUAD), AE Abstract Embedded systems have proliferated in various consumer and industrial applications with the evolution of Cyber-Physical Systems and the Internet of Things. These systems are subjected to stringent constraints so that embedded software must be optimized for multiple objectives simultaneously, namely reduced energy consumption, execution time, and code size. Compilers offer optimization phases to improve these metrics. However, proper selection and ordering of them depends on multiple factors and typically requires expert knowledge. State-of-the-art optimizers facilitate different platforms and applications case by case, and they are limited by optimizing one metric at a time, as well as requiring a time-consuming adaptation for different targets through dynamic profiling. To address these problems, we propose the novel MLComp methodology, in which optimization phases are sequenced by a Reinforcement Learning-based policy. Training of the policy is supported by Machine Learning-based analytical models for quick performance estimation, thereby drastically reducing the time spent for dynamic profiling. In our framework, different Machine Learning models are automatically tested to choose the best-fitting one. The trained Performance Estimator model is leveraged to efficiently devise Reinforcement Learning-based multi-objective policies for creating quasi-optimal phase sequences. Compared to state-of-the-art estimation models, our Performance Estimator model achieves lower relative error (<2%) with up to 50x faster training time over multiple platforms and application domains. Our Phase Selection Policy improves execution time and energy consumption of a given code by up to 12% and 6%, respectively. The Performance Estimator and the Phase Selection Policy can be trained efficiently for any target platform and application domain.
09:05 CEST	1.7.2	DATAFLOW RESTRUCTURING FOR ACTIVE MEMORY REDUCTION IN DEEP NEURAL NETWORKS Speaker: Antonio Cippolletta, Politenico di Torino, IT Authors: Antonio Cipolletta and Andrea Calimera, Politecnico di Torino, IT Abstract The volume reduction of the activation maps produced by the hidden layers of a Deep Neural Network (DNN) is a critical aspect in modern applications as it affects the on-chip memory utilization, the most limited and costly hardware resource. Despite the availability of many compression methods that leverage the statistical nature of deep learning to approximate and simplify the inference model, e.g., quantization and pruning, there is room for deterministic optimizations that instead tackle the problem from a computational view. This work belongs to this latter category as it introduces a novel method for minimizing the active memory footprint. The proposed technique, which is data-, model-, compiler-, and hardware-agnostic, does implement a functional-preserving, automated graph restructuring where the memory peaks are suppressed and distributed over time, leading to flatter profiles with less memory pressure. Results collected on a representative class of Convolutional DNNs with different topologies, from Vgg16 and SqueezeNetV1.1 to the recent MobileNetV2, ResNet18, and InceptionV3, provide clear evidence of applicability, showing remarkable memory savings (62.9% on average) with low computational overhead (8.6% on average).
09:20 CEST	IP1_4.2	EFFICIENT TENSOR CORES SUPPORT IN TVM FOR LOW-LATENCY DEEP LEARNING Speaker: Wei Sun, Eindhoven University of Technology, NL Authors: Wei Sun¹, Savvas Sioutas¹, Sander Stuijk¹, Andrew Nelson² and Henk Corporaal³ ¹Eindhoven University of Technology, NL; ²TU Eindhoven, NL; ³TU/e (Eindhoven University of Technology), NL Abstract Deep learning algorithms are gaining popularity in autonomous systems. These systems typically have stringent latency constraints that are challenging to meet given the high computational demands of these algorithms. Nvidia introduced Tensor Cores (TCs) to speed up some of the most commonly used operations in deep learning algorithms. Compilers (e.g., TVM) and libraries (e.g., cuDNN) focus on the efficient usage of TCs when performing batch processing. Latency sensitive applications can however not exploit large batch processing. This paper presents an extension to the TVM compiler that generates low latency TCs implementations particularly for batch size 1. Experimental results show that our solution reduces the latency on average by 14% compared to the cuDNN library on a Desktop RTX2070 GPU, and by 49% on an Embedded Jetson Xavier GPU.
09:21 CEST	1.7.3	REDUCING MEMORY ACCESS CONFLICTS WITH LOOP TRANSFORMATION AND DATA REUSE ON COARSE-GRAINED RECONFIGURABLE ARCHITECTURE Speaker: Yuge Chen, Department of Micro/NaNo Electronics, Shanghai Jiao Tong University, CN Authors: Yuge Chen¹, Zhongyuan Zhao², Jianfei Jiang³, Guanghui He¹, Zhigang Mao¹ and Weiguang Sheng¹ ¹Department of Micro-Nano Electronics, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, CN; ²Shanghai JiaoTong University, CN; ³Shanghai Jiao Tong University, CN Abstract Coarse-Grained Reconfigurable Arrays (CGRAs) are promising to have low power consumption and high energy-efficiency characteristics as accelerators. Recent years, many research works focus on improving the programmability of the CGRAs by enabling the fast reconfiguration during execution. The performance of these CGRAs critically hinges upon the scheduling power of the compiler. One of the critical challenges is to reduce memory access conflicts using static compilation techniques. Memory accessing conflict brings the synchronization overhead which causes the pipelining stall and reduces CGRA performance. Existing compilers usually tackle this challenge by orchestrating the data placement of the on-chip global memory (OGM) in CGRA to let the parallel memory accesses avoid the bank conflict. However, we find bank conflict is not the only reason that causes the memory access conflicts. In some CGRAs, the bandwidth of the data network between OGM and processing element array (PEA) is also limited due to the low power design principle. The unbalanced network bandwidth loads is another reason that causes memory access conflicts. Furthermore, the redundant data access across iterations is one of the primary causes of memory access conflicts. Based on these observations, we provide a comprehensive and generalized compilation flow to reduce the memory conflicts. Firstly, we develop a loop transformation model to maximize the inter-iteration data reuse of the loops to reduce the memory accessing operations under the software pipelining scheme. Secondly, we enhance the bandwidth utilization of the network between OGM and PEA and avoid the bank conflict by providing a conflict-aware spatial mapping algorithm which can be easily integrated into existing CGRA modulo scheduling compilation flow. Experimental results show our method is capable of improving performance by an average of 44\% comparing with state-of-the-art CGRA compiling flow.

Time	Label	Presentation Title Authors
08:50 CEST	1.8.1	ENABLING EARLY AND FAST THERMAL SIMULATION FOR 3D MULTI-DIE SYSTEM DESIGNS Speaker: Iyad Rayane, Zuken, FR Co-Author: Koga Kazunari, Zuken, FR Abstract As design complexity increases with 3DICs and time-to-market becomes a critical component in the automotive, wearables and IoT segments, reducing design cycle time while maintaining accuracy of analysis has become all the more important. To address this, a system level co-design approach in step with multi-physics analysis is presented. To mitigate errors due to manual exchange of data between various engineering teams spread across chip, package and board with design and analysis adding further level of exchange, a design flow incorporating simplification at the layout level is shown. The flow enables various levels of simplified models to be used, wherein data transfer between the complex 3D structures in layout to the thermal analysis tool is automated. The efficacy of the model simplification is verified through a test case showing comparable results for the simplified and full models.
09:15 CEST	1.8.2	FUTURE VISION OF ALTAIR FOR EDA APPLICATIONS Speaker: Philippe Le Marrec, Altair, FR Abstract Nowadays, design of EDA applications are not only focused on hardware/software parts and need team collaborations. In many cases as in mechatronic, powertrain and control systems, the environment has to be used with the design itself at different level of abstraction. Altair is providing environments which help users to design these multi-physics environments and interact with dedicated solvers. Cloud solutions and data analytics can also be combined to handle best design tuning for powerful multi-physics simulations.

Time	Label	Presentation Title Authors
16:00 CEST	2.8.1	EXHIBITION KEYNOTE - DIGITAL TWIN: THE FUTURE IS NOW Speaker: Thomas Heurung, Siemens EDA, DE Abstract Virtually every discussion on business trends talks about digitalization—whether it's the digital thread, digital twin, the digital enterprise, or the digitalization of everything. The goal is to harness the power of the exponential to integrate data in unprecedented ways to deliver new value and performance. As a result, the next generation of SoCs will be driven by business workloads and energy efficiencies, where software performance will define semiconductor success. That is why a digital twin is becoming a necessity to virtually verify and validate the system performance of SoCs both pre-silicon and then throughout the lifecycle of the SoC. Thomas Heurung, technical director Europe for Siemens EDA, will explain how the digital twin is helping to drive Tera-scale IC and application systems. First by accelerating design creation of custom accelerators.Then, by supporting the shift-left in SoC verification, leading to true system validation from IP to software to systems. And ultimately the digital twin enables a digitalization of the SoC environment both pre-silicon and throughout the lifecycle of the IC.
16:30 CEST	2.8.2	BOOK PUBLISHING 101: THE WHY, HOW AND WHAT Speaker: Charles Glaser, Springer Nature, US Abstract Are you interested in learning more about the why, how (and what) of publishing a book? Publishing a book is a powerful tool, allowing you to communicate your ideas to a global audience, building your reputation in the field and accelerating your career. Join this upcoming webinar to hear from Springer Nature Editorial Director, Charles Glaser, about the stages in the publishing process and how Springer have helped many authors just like you, publish a book. Q&A will follow to answer all your book publishing questions.

Time	Label	Presentation Title Authors
10:20 CEST	K.3.1	SUSTAINABLE HIGH-PERFORMANCE COMPUTING VIA DOMAIN-SPECIFIC ACCELERATORS Speaker and Author: William J. Dally, Stanford University and NVIDIA, US Abstract High-Performance computers require continued scaling of performance and efficiency to handle more demanding applications and scales. With the end of Moore’s Law and Dennard Scaling, continued performance scaling will come primarily from specialization. Specialized hardware engines can achieve performance and efficiency from 10x to 10,000x a CPU through specialization, parallelism, and optimized memory access. Graphics processing units are an ideal platform on which to build domain-specific accelerators. They provide very efficient, high performance communication and memory subsystems - which are needed by all domains. Specialization is provided via “cores”, such as tensor cores that accelerate deep learning or ray-tracing cores that accelerate specific applications.
10:50 CEST	K.3.2	LIVE Q&A Author: William J. Dally, Stanford University and NVIDIA, US Abstract Live question and answer session for interaction among speaker and audience

Time	Label	Presentation Title Authors
15:00 CEST	K.2.1	SUPERCONDUCTING QUANTUM MATERIALS AND SYSTEMS (SQMS) – A NEW DOE NATIONAL QUANTUM INFORMATION SCIENCE RESEARCH CENTER Speaker and Author: Anna Grassellino, National Quantum Information Science Superconducting Quantum Materials and Systems Center, Fermilab, US Abstract In this talk I will describe the mission, goals and the partnership strengths of the new US National Quantum Information Research Center SQMS. SQMS brings the power of DOE laboratories, together with industry, academia and other federal entities, to achieve transformational advances in the major cross-cutting challenge of understanding and eliminating the decoherence mechanisms in superconducting 2D and 3D devices, with the final goal of enabling construction and deployment of superior quantum systems for computing and sensing. SQMS combines the strengths of an array of experts and world-class facilities towards these common goals. Materials science experts will work in understanding and mitigating the key limiting mechanisms of coherence in the quantum regime. Coherence time is the limit on how long a qubit can retain its quantum state before that state is ruined by noise. It is critical to advancing quantum computing, sensing and communication. SQMS is leading the way in extending coherence time of superconducting quantum systems thanks to world-class materials science and through the world leading expertise in superconducting RF cavities which are integrated with industry-designed and -fabricated computer chips. Leveraging new understanding from the materials development, quantum device and quantum computing researchers will pursue device integration and quantum controls development for 2-D and 3-D superconducting architectures. One of the ambitious goals of SQMS is to build and deploy a beyond-state-of-the-art quantum computer based on superconducting technologies. Its unique high connectivity will provide unprecedented opportunity to explore novel quantum algorithms. SQMS researchers will ultimately build quantum computer prototypes based on 2-D and 3-D architectures, enabling new quantum simulation for science applications.
15:40 CEST	K.2.2	LIVE Q&A Authors: Anna Grassellino¹ and Mathias Soeken² ¹National Quantum Information Science Superconducting Quantum Materials and Systems Center, Fermilab, US; ²Microsoft, CH Abstract Live question and answer session for interaction among speaker and audience

Time	Label	Presentation Title Authors
16:00 CEST	2.1.1	TRENDS IN HPC DRIVEN BY THE RACE TO EXASCALE Speaker and Author: Craig Prunty, SiPEARL, FR Abstract HPC in Europe (and worldwide) is pushing to rapidly to exascale, where exascale is defined by double precision Linpack score. This push for performance, while remaining within cost and power consumption envelopes, and the need to support a wide variety of workloads is enforcing a diversion in processing elements, CPU and accelerators. CPU provide general processing capability across the bulk of workloads, with a focus in HPC toward balanced compute and memory bandwidth, demonstrated by HPCG performance. Accelerators are pushing the envelope on vector processing, with high Linpack scores, and also offer a platform for AI. This is leading to some interesting trends in HPC, including Modular system architectures, tight coupling between accelerators and general purpose processors, and the emergence of AI to address some of the Exascale challenges.
16:20 CEST	2.1.2	COYOTE: AN OPEN SOURCE SIMULATION TOOL TO ENABLE RISC-V IN HPC Speaker: Borja Perez, Barcelona Supercomputing Center, ES Authors: Borja Perez, Alexander Fell and John Davis, Barcelona Supercomputing Center, ES Abstract The confluence of technology trends and economics has reincarnated computer architecture and specifically, software-hardware co-design. We are entering a new era of a completely open ecosystem, from applications to chips and everything in between. The software-hardware co-design of supercomputers for tomorrow requires flexible tools today that will take us to the Exascale and beyond. The MareNostrum Experimental Exascale Platform (MEEP) addresses this by proposing a flexible FPGA-based emulation platform, designed to explore hardware-software co-designs for future RISC-V supercomputers. This platform is part of an open ecosystem, allowing its infrastructure to be reused in other projects. MEEP’s inaugural emulated system will be a RISC-V based self-hosted HPC vector and systolic array accelerator, with a special aim at efficient data movement. Early development stages for such an architecture require fast, scalable and easy to modify simulation tools, with the right granularity and fidelity, enabling rapid design space exploration. Being a part of MEEP, this paper introduces Coyote, a new open source, execution-driven simulator based on the open source RISC-V ISA and which can provide detailed results at various levels and granularities. Coyote focuses on data movement and the modelling of the memory hierarchy of the system, which is one of the main hurdles for high performance sparse workloads, while omitting lower level details. As a result, performance evaluation shows that Coyote achieves an aggregate simulation of up to 5 MIPS when modelling up to 128 cores. This enables the fast comparison of different designs for future RISC-V based HPC architectures.
16:40 CEST	2.1.3	MONT-BLANC 2020: TOWARDS SCALABLE AND POWER EFFICIENT EUROPEAN HPC PROCESSORS Speaker: Said Derradji, Atos, FR Authors: Adrià Armejach¹, Bine Brank², Jordi Cortina³, François Dolique⁴, Timothy Hayes⁵, Nam Ho², Pierre-Axel Lagadec⁶, Romain Lemaire⁴, Guillem Lopez-Paradis⁷, Laurent Marliac⁶, Miquel Moreto⁷, Pedro Marcuello³, Dirk Pleiter², Xubin Tan³ and Said Derradji⁶ ¹BSC & UPC, ES; ²Forschungszentrum Juelich, Juelich Supercomputing Centre, DE; ³Semidynamics Technology Services, ES; ⁴CEA-Leti, FR; ⁵Arm, GB; ⁶Atos, FR; ⁷BSC, ES Abstract The Mont-Blanc 2020 (MB2020) project has triggered the development of the next generation industrial processor for Big Data and High Performance Computing (HPC). MB2020 is paving the way to the future low-power European processor for exascale, defining the System-on-Chip (SoC) architecture and implementing new critical building blocks to be integrated in such an SoC. In this paper, we first present an overview of the MB2020 project, then we describe our experimental infrastructure, the requirements of relevant applications, and the IP blocks developed in the project. Finally, we present our emulation-based final demonstrator and explain how it integrates within our first generation of HPC processors.
17:00 CEST	2.1.4	LIVE JOINT Q&A Authors: Nehir Sonmez¹, Miquel Moreto¹, Craig Prunty², Borja Perez¹ and Said Derradji³ ¹BSC, ES; ²SiPEARL, FR; ³Atos, FR Abstract 30 minutes of live joint question and answer time for interaction among speakers and audience.

Time	Label	Presentation Title Authors
16:00 CEST	2.2.1	UNDERSTANDING CHIPLETS TODAY TO ANTICIPATE FUTURE INTEGRATION OPPORTUNITIES AND LIMITS Speaker: Gabriel Loh, Advanced Micro Devices, Inc., US Authors: Gabriel Loh, Samuel Naffziger and Kevin M. Lepak, Advanced Micro Devices, Inc., US Abstract Chiplet-based architectures have recently started attracting a lot of attention, and we are seeing real-world architectures utilizing chiplet technologies in high-volume commercial production in multiple mainstream markets. In this special session paper, we provide a technical overview of the current state of chiplet technology including its benefits and limitations. This provides background and grounding in the current state-of-the-art and also lays out a range of technical areas to consider for the remaining forward-looking papers in this special session. We discuss the benefits and costs of different approaches to splitting and modularizing a monolithic chip into chiplets. In particular, we cover supporting high bandwidth and low latency communication between the die, mixed integration of multiple process technology nodes, and silicon and IP reuse. We then explore future challenges for chiplet architectures looking into the next decade of innovation.
16:15 CEST	2.2.2	HETEROGENEOUS 3D ICS: CURRENT STATUS AND FUTURE DIRECTIONS FOR PHYSICAL DESIGN TECHNOLOGIES Speaker: Gauthaman Murali, Georgia Institute of Technology, US Authors: Gauthaman Murali and Sung Kyu Lim, Georgia Institute of Technology, US Abstract One of the advantages of 3D IC technology is its ability to integrate different devices such as CMOS, SRAM, and RRAM, or multiple technology nodes of single or different devices onto a single chip due to the presence of multiple tiers. This ability to create heterogeneous 3D ICs finds a wide range of applications, from improving processor performance by integrating better memory technologies to building compute-in-memory ICs to support advanced machine learning algorithms. This paper discusses the current trends and future directions for the physical design of heterogeneous 3D ICs. We summarize various physical design and optimization flows, integration techniques, and existing academic works on heterogeneous 3D ICs.
16:30 CEST	2.2.3	ADVANCES IN TESTING AND DESIGN-FOR-TEST SOLUTIONS FOR M3D INTEGRATED CIRCUITS Speaker: Krishnendu Chakrabarty, Duke University, US Authors: Sanmitra Banerjee¹, Arjun Chaudhuri², Krishnendu Chakrabarty¹ and Shao-Chun Hung¹ ¹Duke University, US; ²GLOBALFOUNDRIES US Inc., US Abstract Monolithic 3D (M3D) integration has the potential to achieve significantly higher device density compared to TSV-based 3D stacking. Sequential integration of transistor layers enables high-density vertical interconnects, known as inter-layer vias (ILVs). However, high integration density and aggressive scaling of the inter-layer dielectric make M3D integrated circuits especially prone to process variations and manufacturing defects. We explore the impact of these fabrication imperfections on chip-performance and present the associated test challenges. We introduce two M3D-specific design-for-test solutions – a low-cost built-in self-test architecture for the defect-prone ILVs and a tier-level fault localization method for yield learning. We describe the impact of defects on the efficiency of delay fault testing and highlight solutions for test generation under constraints imposed by the 3D power distribution network.
16:45 CEST	2.2.4	3D++: UNLOCKING THE NEXT GENERATION OF HIGH-PERFORMANCE AND ENERGY-EFFICIENT ARCHITECTURES USING M3D INTEGRATION Speaker: Partha Pande, Washington State University, US Authors: Biresh Kumar Joardar, Aqeeb Iqbal Arka, Jana Doppa and Partha Pratim Pande, Washington State University, US Abstract Three-dimensional (3D) integration has frequently been described as a means to overcome scaling bottlenecks, and advance both “More Moore” and “More Than Moore” through the use of vertical interconnects and die/wafer stacking. Recent industry trends show the viability of 3D integration in real products. Flash memory producers have also demonstrated multiple layers of memory on top of each other. However, conventional TSV-based 3D designs cannot achieve the full-potential of vertical integration and perform sub-optimally. Monolithic 3D (M3D) is an emerging vertical integration technology that promises significant power-performance-area benefits compared to TSVs. Hence, it is important to understand the necessary design trade-offs and challenges associated with this new paradigm. In this paper, we present both the advantages and the various design challenges in M3D-enabled system design considering Processing-in-Memory (PIM) and manycore systems as suitable case-studies.

Time	Label	Presentation Title Authors
16:00 CEST	2.3.1	HARDWARE BENCHMARKING OF ROUND 2 CANDIDATES IN THE NIST LIGHTWEIGHT CRYPTOGRAPHY STANDARDIZATION PROCESS Speaker: Kris Gaj, George Mason University, US Authors: Kamyar Mohajerani,, Richard Haeussler, Rishub Nagpal, Farnoud Farahmand, Abubakr Abdulgadir, Jens-Peter Kaps and Kris Gaj, George Mason University, US Abstract Twenty five Round 2 candidates in the NIST Lightweight Cryptography (LWC) process have been implemented in hardware by groups from all over the world. All implementations compliant with the LWC Hardware API, proposed in 2019, have been submitted for hardware benchmarking to George Mason University’s LWC benchmarking team. The received submissions were first verified for correct functionality and compliance with the hardware API’s specification. Then, the execution times in clock cycles, as a function of input sizes, have been determined using behavioral simulation. The compatibility of all implementations with FPGA toolsets from three major vendors, Xilinx, Intel, and Lattice Semiconductor was verified. Optimized values of the maximum clock frequency and resource utilization metrics, such as the number of look-up tables (LUTs) and flip-flops (FFs), were obtained by running optimization tools, such as Minerva, ATHENa, and Xeda. The raw post-place and route results were then converted into values of the corresponding throughputs for long, medium-size, and short inputs. The results were presented in the form of easy to interpret graphs and tables, demonstrating the relative performance of all investigated algorithms. An effort was made to make the entire process as transparent as possible and results easily reproducible by other groups.
16:15 CEST	2.3.2	A DEEPER LOOK AT ENERGY CONSUMPTION OF LIGHTWEIGHT BLOCK CIPHERS Speaker: Francesco Regazzoni, University of Amsterdam and ALaRI - USI, CH Authors: Andrea Caforio¹, Fatih Balli¹, Subhadeep Banik¹ and Francesco Regazzoni² ¹EPFL, CH; ²University of Amsterdam and ALaRI - USI, CH Abstract In the last few years, the field of lightweight cryptography has seen an influx in the number of block ciphers and hash functions being proposed. In the past there have been numerous papers that have looked at circuit level implementation of block ciphers with respect to lightweight metrics like area power and energy. In the paper by Banik et al. (SAC'15), for example, by studying the energy consumption model of a CMOS gate, it was shown that the energy consumed per cycle during the encryption operation of an $r$-round unrolled architecture of any block cipher is a quadratic function in $r$. However, most of these explorative works were at a gate level, in which a circuit synthesizer would construct a circuit using gates from a standard cell library, and the area power and energy would be estimated by estimating the switching statistics of the nodes in the circuit. Since only a part of the EDA design flow was done, it did not account for issues that might arise when the circuit is finally mapped into silicon post route. Metrics like area, power and energy would need to be re-estimated due to the effect of the parasitics introduced in the circuit by the connecting wires, nodes and interconnects. In this paper, we look to plug this very gap in literature by re-examining the designs of lightweight block ciphers with respect to their performances after completing the placement and routing process. This is a timely exercise to do since three of the block ciphers we analyze in the paper are used in around 13 of the 32 candidates in the second round of the NIST lightweight competition being conducted currently.
16:30 CEST	2.3.3	MACHINE LEARNING ASSISTED DIFFERENTIAL DISTINGUISHERS FOR LIGHTWEIGHT CIPHERS Speaker: Anubhab Baksi, Nanyang Technological University, SG Authors: Anubhab Baksi¹, Jakub Breier², Yi Chen³ and Xiaoyang Dong³ ¹Nanyang Technological University, Singapore, SG; ²Silicon Austria Labs, AT; ³Tsinghua University, CN Abstract At CRYPTO 2019, Gohr first introduces the deep learning based cryptanalysis on round-reduced SPECK. Using a deep residual network, Gohr trains several neural network based distinguishers on 8-round SPECK-32/64. The analysis follows an 'all-in-one' differential cryptanalysis approach, which considers all the output differences effect under the same input difference. Usually, the all-in-one differential cryptanalysis is more effective compared to that using only one single differential trail. However, when the cipher is non-Markov or its block size is large, it is usually very hard to fully compute. Inspired by Gohr's work, we try to simulate the all-in-one differentials for non-Markov ciphers through machine learning. Our idea here is to reduce a distinguishing problem to a classification problem, so that it can be efficiently managed by machine learning. As a proof of concept, we show several distinguishers for four high profile ciphers, each of which works with trivial complexity. In particular, we show differential distinguishers for 8-round Gimli-Hash, Gimli-Cipher and Gimli-Permutation; 3-round Ascon-Permutation; 10-round Knot-256 permutation and 12-round Knot-512 permutation; and 4-round Chaskey-Permutation. Finally, we explore more on choosing an efficient machine learning model and observe that only a three layer neural network can be used. Our analysis shows the attacker is able to reduce the complexity of finding distinguishers by using machine learning techniques.
16:45 CEST	2.3.4	DNFA: DIFFERENTIAL NO-FAULT ANALYSIS OF BIT PERMUTATION BASED CIPHERS ASSISTED BY SIDE-CHANNEL Speaker: Shivam Bhasin, Temasek Laboratories @ NTU, SG Authors: Xiaolu Hou¹, Jakub Breier² and Shivam Bhasin³ ¹Nanyang Technological University, SG; ²Silicon Austria Labs, AT; ³Temasek Laboratories, Nanyang Technological University, SG Abstract Physical security of NIST lightweight cryptography competition candidates is gaining importance as the standardization process progresses. Side-channel attacks (SCA) are a well-researched topic within the physical security of cryptographic implementations. It was shown that collisions in the intermediate values can be captured by side-channel measurements to reduce the complexity of the key retrieval to trivial numbers. In this paper, we target a specific bit permutation vulnerability in the block cipher GIFT that allows the attacker to mount a key recovery attack. We present a novel SCA methodology called DCSCA - Differential Ciphertext SCA, which follows principles of differential fault analysis, but instead of the usage of faults, it utilizes SCA and statistical distribution of intermediate values. We simulate the attack on a publicly available bitslice implementation of GIFT, showing the practicality of the attack. We further show the application of the attack on GIFT-based AEAD schemes (GIFT-COFB, ESTATE, HYENA, and SUNDAE-GIFT) proposed for the NIST LWC competition. DCSCA can recover the master key with 2^{13.39} AEAD sessions, assuming 32 encryptions per session.

Time	Label	Presentation Title Authors
16:00 CEST	2.4.1	(Best Paper Award Candidate) AS ACCURATE AS NEEDED, AS EFFICIENT AS POSSIBLE: APPROXIMATIONS IN DD-BASED QUANTUM CIRCUIT SIMULATION Speaker: Stefan Hillmich, Johannes Kepler University Linz, AT Authors: Stefan Hillmich¹, Richard Kueng¹, Igor L. Markov² and Robert Wille¹ ¹Johannes Kepler University Linz, AT; ²University of Michigan, US Abstract Quantum computers promise to solve important problems faster than conventional computers. However, unleashing this power has been challenging. In particular, design automation runs into (1) the probabilistic nature of quantum computation and (2) exponential requirements for computational resources on non-quantum hardware. In quantum circuit simulation, Decision Diagrams (DDs) have previously shown to reduce the required memory in many important cases by exploiting redundancies in the quantum state. In this paper, we show that this reduction can be amplified by exploiting the probabilistic nature of quantum computers to achieve even more compact representations. Specifically, we propose two new DD-based simulation strategies that approximate the quantum states to attain more compact representations, while, at the same time, allowing the user to control the resulting degradation in accuracy. We also analytically prove the effect of multiple approximations on the attained accuracy and empirically show that the resulting simulation scheme enables speed-ups up to several orders of magnitudes.
16:15 CEST	2.4.2	STOCHASTIC QUANTUM CIRCUIT SIMULATION USING DECISION DIAGRAMS Speaker: Thomas Grurl, University of Applied Sciences Upper Austria, AT Authors: Thomas Grurl¹, Richard Kueng², Jürgen Fuß¹ and Robert Wille² ¹University of Applied Sciences Upper Austria, AT; ²Johannes Kepler University Linz, AT Abstract Recent years have seen unprecedented advance in the design and control of quantum computers. Nonetheless, their applicability is still restricted and access remains expensive. Therefore, a substantial amount of quantum algorithms research still relies on simulating quantum circuits on classical hardware. However, due to the sheer complexity of simulating real quantum computers, many simulators unrealistically simplify the problem and instead simulate perfect quantum hardware, i.e., they do not consider errors caused by the fragile nature of quantum systems. Stochastic quantum simulation provides a conceptually suitable solution to this problem: physically motivated errors are applied in a probabilistic fashion throughout the simulation. In this work, we propose to use decision diagrams, as well as concurrent executions, to substantially reduce resource-requirements–which are still daunting–for stochastic quantum circuit simulation. Backed up by rigorous theory, empirical studies show that this approach allows for a substantially faster and much more scalable simulation for certain quantum circuits.
16:30 CEST	2.4.3	COMBINING SWAPS AND REMOTE TOFFOLI GATES IN THE MAPPING TO IBM QX ARCHITECTURES Speaker: Philipp Niemann, University of Bremen / DFKI GmbH, DE Authors: Philipp Niemann¹, Chandan Bandyopadhyay² and Rolf Drechsler³ ¹Cyber-Physical Systems, DFKI GmbH, DE; ²University of Bremen, DE; ³University of Bremen/DFKI, DE Abstract Quantum computation received a steadily growing attention in recent years, especially supported by the emergence of publicly available quantum computers like the popular IBM QX series. In order to execute a reversible or quantum circuit on those devices, a mapping is required that replaces each reversible or quantum gate by an equivalent cascade of elementary, i.e. directly executable, gates---a task which tends to induce a significant mapping overhead. Several approaches have been proposed for this task which either rely on the swapping of physically adjacent qubits or the use of precomputed templates, so-called remote CNOT gates. In this paper, we show that combining both, swapping and remote gates, at the reversible circuit level has the prospect of significantly reducing the mapping overhead. We propose a methodology to compute the optimal combination of swaps and templates for Multiple-Controlled Toffoli gates. By using a formulation as a single-source shortest-path problem, a complete database of optimal combinations can be computed efficiently. Experimental results indicate that the mapping overhead can be significantly reduced.

Time	Label	Presentation Title Authors
16:00 CEST	2.5.1	AUTOMATIC SCALABLE SYSTEM FOR THE COVERAGE DIRECTED GENERATION (CDG) PROBLEM Speaker: Avi Ziv, IBM Research - Haifa, IL Authors: Raviv Gal¹, Eldad Haber², Wesam Ibraheem³, Brian Irwin², Ziv Nevo³ and Avi Ziv³ ¹IBM Research, Haifa, IL; ²University of British Columbia, CA; ³IBM Research - Haifa, IL Abstract We present AS-CDG, a novel automatic scalable system for data-driven coverage directed generation. The goal of AS-CDG is to find the test-templates that maximize the probability of hitting uncovered events. The system contains two components, one for a coarse-grained search that finds relevant parameters and the other for a fine-grained search for the settings of these parameters. To overcome the lack of evidence in the search, we replace the real target with an approximated target induced by neighboring events, for which we have evidence. Usage results on real-life units of high-end processors illustrate the ability of the proposed system to automatically find the desired test-templates and hit the previously uncovered target events.
16:15 CEST	2.5.2	POST SILICON VALIDATION OF THE MMU Speaker: Hillel Mendelson, IBM Research, IL Authors: Tom Kolan¹, Hillel Mendelson², Vitali Sokhin¹, Shai Doron¹, Hernan Theiler¹, Shay Aviv¹, Hagai Hadad², Natalia Freidman², Elena Tsanko³, John Ludden³ and Bryant Cockcroft³ ¹IBM Research - Haifa, IL; ²IBM, IL; ³IBM, US Abstract Post silicon validation is a unique challenge in the design verification process. On one hand, it utilizes real silicon and is therefore able to cover a larger state-space. On the other, it suffers from debugging challenges due to a lack of observability into the design. These challenges dictate distinctive design choices, such as the simplicity of validation tools and a built-for-debugging software design methodology. The Memory Management Unit (MMU) is central to any design that uses virtual-memory, and creates complex verification challenges, especially in many-core designs. We propose a novel method for post silicon validation of the MMU that brings together previously undescribed techniques, based on several papers and patents. This method was implemented in Threadmill, a bare metal exerciser and was used in the verification of high-end industry-level POWER and ARM SoCs. It succeeded in increasing RTL coverage, hitting several hidden bugs, and saving hundreds of work-hours in the validation process.
16:30 CEST	IP2_1.1	AN EFFECTIVE METHODOLOGY FOR INTEGRATING CONCOLIC TESTING WITH SYSTEMC-BASED VIRTUAL PROTOTYPES Speaker: Sören Tempel, University of Bremen, DE Authors: Sören Tempel¹, Vladimir Herdt² and Rolf Drechsler³ ¹University of Bremen, DE; ²DFKI, DE; ³University of Bremen/DFKI, DE Abstract In this paper we propose an effective methodology for integrating Concolic Testing (CT) with SystemC-based Virtual Prototypes (VPs) for verification of embedded SW binaries. Our methodology involves three steps: 1) integrating CT support with the Instruction Set Simulator (ISS) of the VP, 2) utilizing the standard TLM-2.0 extension mechanism for transporting concolic values alongside generic TLM transactions, and 3) providing lightweight concolic overlays for SystemC-based peripherals that enable non-intrusive CT support for peripherals and thus significantly reduce the CT integration effort. Our RISC-V experiments using the RIOT operating system demonstrate the effectiveness of our approach.
16:31 CEST	IP2_1.2	A CONTAINERIZED ROS-COMPLIANT VERIFICATION ENVIRONMENT FOR ROBOTIC SYSTEMS Speaker: Samuele Germiniani, University of Verona, IT Authors: Stefano Aldegheri, Nicola Bombieri, Samuele Germiniani, Federico Moschin and Graziano Pravadelli, University of Verona, IT Abstract This paper proposes an architecture and a related automatic flow to generate, orchestrate and deploy a ROS-compliant verification environment for robotic systems. The architecture enables assertion-based verification by exploiting monitors automatically synthesized from LTL assertions. The monitors are encapsulated in plug-and-play ROS nodes that do not require any modification to the system under verification (SUV). To guarantee both verification accuracy and real-time constraints of the system in a resource-constrained environment even after the monitor integration, we define a novel approach to move the monitor evaluation across the different layers of an edge-to-cloud computing platform. The verification environment is containerized for both cloud and edge computing using Docker to enable system portability and to handle, at run-time, the resources allocated for verification. The effectiveness and efficiency of the proposed architecture have been evaluated on a complex distributed system implementing a mobile robot path planner based on 3D simultaneous localization and mapping.
16:32 CEST	2.5.3	SIM²PIM: A FAST METHOD FOR SIMULATING HOST INDEPENDENT & PIM AGNOSTIC DESIGNS Speaker: Luigi Carro, UFRGS, BR Authors: Paulo Cesar Santos¹, Bruno Endres Forlin² and Luigi Carro² ¹UFRGS - Universidade Federal do Rio Grande do Sul, BR; ²UFRGS, BR Abstract Processing-in-Memory (PIM), with the help of modern memory integration technologies, has emerged as a practical approach to mitigate the memory wall and improve performance and energy efficiency in contemporary applications. However, there is a need for tools capable of quickly simulating different PIMs designs and their suitable integration with different hosts. This work presentsSim2PIM, a Simple Simulator for PIM devices that seamlessly integrates any PIM architecture with the host processor and memory hierarchy. Sim2PIM’s simulation environment allows the user to describe a PIM architecture in different user-defined abstraction levels. The application code runs natively on the Host, with minimal overhead from the simulator integration, allowing Sim2PIM to collect precise metrics from the Hardware Performance Counters (HPCs). Our simulator is available to download at http://pim.computer/.

Time	Label	Presentation Title Authors
16:00 CEST	2.6.1	(Best Paper Award Candidate) COMPACT: FLOW-BASED COMPUTING ON NANOSCALE CROSSBARS WITH MINIMAL SEMIPERIMETER Speaker: Sven Thijssen, University of Central Florida, US Authors: Sven Thijssen¹, Sumit Kumar Jha² and Rickard Ewetz¹ ¹University of Central Florida, US; ²University of Texas at San Antonio, US Abstract In-memory computing is a promising solution strategy for data-intensive applications to circumvent the von Neumann bottleneck. Flow-based computing is the concept of performing in-memory computing using sneak paths in nanoscale crossbar arrays. The limitation of previous work is that the resulting crossbar representations have large dimensions. In this paper, we present a framework called COMPACT for mapping Boolean functions to crossbar representations with minimal semiperimeter (the number of wordlines plus bitlines). The COMPACT framework is based on an analogy between binary decision diagrams (BDDs) and nanoscale memristor crossbar arrays. More specifically, nodes and edges in a BDD correspond to wordlines/bitlines and memristors in a crossbar array, respectively. The relation enables a function represented by a BDD with n nodes and an odd cycle transversal of size k to be mapped to a crossbar with a semiperimeter of n+k. The k extra wordlines/bitlines are introduced due to crossbar connection constraints, i.e. wordlines (bitlines) cannot directly be connected to wordlines (bitlines). For multi-input multi-output functions, COMPACT can also be applied to shared binary decision diagrams (SBDDs), which further reduces the size of the crossbar representations. Compared with the state-of-the-art mapping technique, the semiperimeter is reduced from 2.13n to 1.09n on the average, which translates into crossbar representations with 78% smaller area. The power consumption and the computation delay are on the average reduced by 7% and 52%, respectively.
16:15 CEST	2.6.2	SQUEEZELIGHT: TOWARDS SCALABLE OPTICAL NEURAL NETWORKS WITH MULTI-OPERAND RING RESONATORS Speaker: Jiaqi Gu, University of Texas at Austin, US Authors: Jiaqi Gu¹, Chenghao Feng¹, Zheng Zhao¹, Zhoufeng Ying¹, Mingjie Liu², Ray T. Chen¹ and David Z. Pan¹ ¹University of Texas at Austin, US; ²University of Texas Austin, US Abstract Optical neural networks (ONNs) have demonstrated promising potentials for next-generation artificial intelligence acceleration with ultra-low latency, high bandwidth, and low energy consumption. However, due to high area cost and lack of efficient sparsity exploitation, previous ONN designs fail to provide scalable and efficient neuromorphic computing, which hinders the practical implementation of photonic neural accelerators. In this work, we propose a novel design methodology to enable a more scalable ONN architecture. We propose a nonlinear optical neuron based on multi-operand ring resonators to achieve neuromorphic computing with a compact footprint, low wavelength usage, learnable neuron balancing, and built-in nonlinearity. The structured sparsity is exploited to support more efficient ONN engines via a fine-grained structured pruning technique. A robustness-aware learning method is adopted to guarantee the variation-tolerance of our ONN. Simulation and experimental results show that the proposed ONN achieves one-order-of-magnitude improvement in compactness and efficiency over previous designs with high fidelity and robustness.
16:30 CEST	IP2_2.1	RECEPTIVE-FIELD AND SWITCH-MATRICES BASED RERAM ACCELERATOR WITH LOW DIGITAL-ANALOG CONVERSION FOR CNNS Speaker: Xun Liu, North China University of Technology, CN Authors: Yingxun Fu¹, Xun Liu², Jiwu Shu³, Zhirong Shen⁴, Shiye Zhang¹, Jun Wu¹ and Li Ma¹ ¹North China University of Technology, CN; ²North China Univercity of Technology, CN; ³Tsinghua University, CN; ⁴Xiamen University, CN Abstract Process-in-Memory (PIM) based accelerator becomes one of the best solutions for the execution of convolution neural networks (CNN). Resistive random access memory (ReRAM) is a classic type of non-volatile random-access memory, which is very suitable for implementing PIM architectures. However, existing ReRAM-based accelerators mainly consider to improve the calculation efficiency, but ignore the fact that the digital-analog signal conversion process spends a lot of energy and executing time. In this paper, we propose a novel ReRAM-based accelerator named Receptive-Field and Switch-Matrices based CNN Accelerator (RFSM). In RFSM, we first propose a receptive-field based convolution strategy to analyze the data relationships, and then gives a dynamic and configurable crossbar combination method to reduce the digital-analog conversion operations. The evaluation result shows that, compared to existing works, RFSM gains up to 6.7x higher speedup and 7.1x lower energy consumption.
16:31 CEST	IP2_2.2	(Best Paper Award Candidate) AN ON-CHIP LAYER-WISE TRAINING METHOD FOR RRAM BASED COMPUTING-IN-MEMORY CHIPS Speaker: Yiwen Geng, Tsinghua University, CN Authors: Yiwen Geng, Bin Gao, Qingtian Zhang, Wenqiang Zhang, Peng Yao, Yue Xi, Yudeng Lin, Junren Chen, Jianshi Tang, Huaqiang Wu and He Qian, Institute of Microelectronics, Tsinghua University, CN Abstract RRAM based computing-in-memory (CIM) chips have shown great potentials to accelerate deep neural networks on edge devices by reducing data transfer between the memory and the computing unit. However, due to the non-ideal characteristics of RRAM, the accuracy of the neural network on the RRAM chip is usually lower than the software. Here we propose an on-chip layer-wise training (LWT) method to alleviate the adverse effect of RRAM imperfections and improve the accuracy of the chip. Using a locally validated dataset, LWT can reduce the communication between the edge and the cloud, which benefits for the personalized data privacy. The simulation results on the CIFAR-10 dataset show that the LWT method can improve the accuracy of VGG-16 and ResNet-18 more than 5% and 10%, respectively, with only 30% operations and 35% buffer compared with the back-propagation method. Moreover, the pipe-LWT method is presented to improve the throughput by three times further.
16:32 CEST	2.6.3	A 3-D LUT DESIGN FOR TRANSIENT ERROR DETECTIONVIA INTER-TIER IN-SILICON RADIATION SENSOR Speaker: Sarah Azimi, Politecnico di Torino, IT Authors: Sarah Azimi, Corrado De Sio and Luca Sterpone, Politecnico di Torino, IT Abstract Three-dimensional Integrated Circuits (3-D ICs) have gained much attention as a promising approach to increase IC performance due to their several advantages in terms of integration density, power dissipation, and achievable clock frequencies. However, achieving a 3-D ICs resilient to soft errors resulting from radiation effects is a challenging problem. Traditional Radiation-Hardened-by-Design (RHBD) techniques are costly in terms of area, power, and performance overheads. In this work, we propose a new 3-D LUT design integrating error detection capabilities. The LUT has been designed on a two tiers IC model improving radiation resiliency by selective upsizing of sensitive transistors. Besides, an in-silicon radiation sensor adopting inverters chain has been implemented within the free volume of the 3-D structure. The proposed design shows a 37% reduction in sensitivity to SETs and an effective error detection rate of 83% without introducing any area overhead

Time	Label	Presentation Title Authors
16:00 CEST	2.7.1	RESPONSE TIME ANALYSIS OF LAZY ROUND ROBIN Speaker: Yue Tang, The Northeastern University, CN Authors: Yue Tang¹, Nan Guan², Zhiwei Feng¹, Xu Jiang¹ and Wang Yi³ ¹Northeastern University, CN; ²The Hong Kong Polytechnic University, HK; ³Northeastern University and Uppsala University, CN Abstract The Round Robin scheduling policy is used in many real-time embedded systems because of its simplicity and low overhead. In this paper, we study a variation of Round Robin used in practical systems, named Lazy Round Robin, which is simpler to implement and has lower runtime overhead than ordinary Round Robin. The key difference between Round Robin and Lazy Round Robin lies in when the scheduler reacts to newly released task instances. The Round Robin scheduler checks whether a newly released task instance is eligible for execution in the remaining part of the current round, while the Lazy Round Robin scheduler does not react to any task release until the end of the current round. This paper develops techniques to calculate upper bounds of response time of tasks scheduled by Lazy Round Robin. Experiments are conducted to evaluate our analysis techniques and compare the real-time performance of Round Robin and Lazy Round Robin.
16:15 CEST	2.7.2	IMPROVING THE TIMING BEHAVIOUR OF MIXED-CRITICALITY SYSTEMS USING CHEBYSHEV'S THEOREM Speaker: Behnaz Ranjbar, TU Dresden, DE Authors: Behnaz Ranjbar¹, Ali Hoseinghorban², Siva Satyendra Sahoo¹, Alireza Ejlali² and Akash Kumar¹ ¹TU Dresden, DE; ²Sharif University of Technology, IR Abstract In Mixed-Criticality (MC) systems, there are often multiple Worst-Case Execution Times (WCETs) for the same task, corresponding to system operation mode. Determining the appropriate WCETs for lower criticality modes is non-trivial; while on the one hand, a low WCET for a mode can improve the processor utilization in that mode, on the other hand, using a larger WCET ensures that the mode switches are minimized, thereby maximizing the quality-of-service for all tasks, albeit at the cost of processor utilization. Although there are many studies to determine WCET in the highest criticality mode, no analytical solutions are proposed to determine WCETs in other lower criticality modes. In this regard, we propose a scheme to determine WCETs by Chebyshev theorem to make a trade-off between the number of scheduled tasks at design-time and the number of dropped low-criticality tasks at runtime as a result of frequent mode switches. Our experimental results show that our scheme improves the utilization of state-of-the-art MC systems by up to 85.29%, while maintaining 9.11% mode switching probability in the worst-case scenario.
16:30 CEST	2.7.3	VIRTUAL GANG SCHEDULING OF PARALLEL REAL-TIME TASKS Speaker: Waqar Ali, University of Kansas, US Authors: Waqar Ali¹, Rodolfo Pellizzoni² and Heechul Yun³ ¹University of Kansas at Lawrence, US; ²University of Waterloo, CA; ³University of Kansas, US Abstract We consider the problem of executing parallel real-time tasks according to gang scheduling on a multicore system in the presence of shared resource interference. Specifically, we consider sets of gang-tasks with precedence constraints in the form of a DAG. We introduce the novel concept of a virtual gang: a group of parallel tasks that are scheduled together as a single entity. Employing virtual gangs allows us to tightly bound the effect of shared resource interference. It also transforms the original, complex scheduling problem into a form that can be easily implemented and is amenable to exact schedulability analysis, further reducing pessimism. We present and evaluate both optimal and heuristic methods for forming virtual gangs based on a known interference model and while respecting all precedence constraints among tasks. When precedence constraints are not considered, we also compare our approach against existing response-time analysis for globally scheduled gang-tasks, as well as general parallel tasks. The results show that our approach significantly outperforms state-of-the-art multicore schedulability analyses when shared-resource interference is considered. Even in the absence of interference, it performs better than the state-of-the-art for highly parallel tasksets.

Time	Label	Presentation Title Authors
17:30 CEST	3.1.1	FUTURE OF HPC: DIVERSIFYING HETEROGENEITY Speaker: Dejan Milojicic, Hewlett Packard Enerprise, US Authors: Dejan Milojicic¹, Paolo Faraboschi¹, Nicolas Dube² and Duncan Roweth³ ¹Hewlett Packard Labs, US; ²Hewlett Packard Enterprise, CA; ³Hewlett Packard Enterprise, GB Abstract Abstract—After the end of Dennard scaling and with the imminent end of Moore’s Law, it has become challenging to continue scaling HPC systems within a given power envelope. This is exacerbated most in large systems, such as high end supercomputers. To alleviate this problem, general purpose is no longer sufficient, and HPC systems and components are being augmented with special-purpose hardware. By definition, because of the narrow applicability of specialization, broad supercomputing adoption requires using different heterogeneous components, each optimized for a specific application domain. In this paper, we discuss the impact of the introduced heterogeneity of specialization across the HPC stack: interconnects including memory models, accelerators including power and cooling, use cases and applications including AI, and delivery models, such as traditional, as-a-Service, and federated. We believe that a stack that supports diversification across hardware and software is required to continue scaling performance and maintaining energy efficiency.
17:45 CEST	3.1.2	A DATA CENTER DEMAND RESPONSE POLICY FOR REAL-WORLD WORKLOAD SCENARIOS IN HPC Speaker: Daniel Wilson, boston university, US Authors: yijia zhang, Daniel C. Wilson, Ioannis Ch. Paschalidis and Ayse K. Coskun, Boston University, US Abstract Demand response programs offer an opportunity for large power consumers to save on electricity costs by modulating their power consumption in response to demand changes in the electricity grid. Multiple types of such programs exist; for example, regulation service programs enable a consumer to bid for a sustainable amount of power draw over a time period, along with a reserve amount they are able to provide at request of the electricity service provider. Data centers offer unique capabilities to participate in these programs since they have significant capacity to modify their power consumption through workload scheduling and CPU power limiting. This paper proposes a novel power management policy and a bidding policy that enable data centers to participate in regulation service programs under real-world constraints. The power management policy schedules computing jobs and applies server power-capping under both the constraints of power programs and the constraints of job Quality-of-Service (QoS). Simulations with workload traces from a real data center show that the proposed policies enable data centers to meet both the requirement of regulation service programs and the QoS requirement of jobs. We demonstrate that, by applying our policies, data centers can save their electricity costs by 10% while abiding by all the QoS constraints in a real-world scenario.
18:00 CEST	3.1.3	ACCELERATING DATA CENTER DECARBONIZATION AND MAXIMIZING RENEWABLE USAGE WITH GRID EDGE SOLUTIONS Speaker: John Glassmire, Hitachi ABB Power Grids, US Authors: John Glassmire¹, Hamideh Bitaraf¹, Stylianos Papadakis² and Alexandre Oudalov² ¹Hitachi ABB Power Grids, US; ²Hitachi ABB Power Grids, CH Abstract Data centers and other computing clusters have unique electrical power requirements. They demand high reliability with high power quality, while at the same time are being driven by society and industry to use renewables as their only electricity source. To date, many large data center users have focused on offsite renewable portfolio contracts and power purchases agreements to offset data center energy demands. However, this strategy misses several greenhouse gas contributors: the diesel and gas generators that provide back-up, and the reliance on existing fossil fuel generation that often balances renewable output power in utility networks. With grid edge solutions including microgrids and battery energy storage systems, data centers have an opportunity to maximize their usage of renewable generation while minimizing the usage of fossil-driven energy generation. This paper will explore the key considerations for using grid edge technologies to decarbonize the back-up supplies for data centers, as well as explore how they can stabilize the utility networks that supply data centers--even as the penetration of renewable generation in the network reach 100%. We will introduce the strategies for implementation, including an overview of the design, management, control, and optimization of their renewable energy supply. We will explore the economic considerations for these investments, while providing useful benchmarks for achievable goals in each of these areas.
18:15 CEST	3.1.4	DISTRIBUTED GRID COMPUTING MANAGER COVERING WASTE HEAT REUSE CONSTRAINTS Speaker: Rémi Bouzel, Qarnot Computing, FR Authors: Rémi Bouzel, Yanik Ngoko, Paul Benoit and Nicolas Sainthérant, Qarnot Computing, FR Abstract In this paper, we discuss a green and distributed type of datacenters, implemented by Qarnot computing. This approach promotes a new computing paradigm in which computers are considered as machines that produce both computation and heat, and are therefore able to reuse the waste heat generated. It is based on two main technologies: a new model of servers and a new distributed grid computing manager which encloses a heat aware job scheduler. This paper focuses on the infrastructures and cloud computing services that were developed to answer the constraints of this new HPC paradigm. The description covers the job scheduler that ensures security and resilience of Qarnot distributed computing resources in a non-regulated environment. We summarize the key computational challenges met and the strategies developed to solve them. A specific use case is detailed to show that, in spite of its thermal-aware specificity, spawning a job on the Qarnot platform remains as simple as on any other state-of-the-art job scheduler.
18:30 CEST	3.1.5	LIVE JOINT Q&A Authors: David Bol¹, Gilles Sassatelli², Dejan Milojicic³, Ayse Coskun⁴, John Glassmire⁵ and Rémi Bouzel⁶ ¹Université catholique de Louvain, BE; ²LIRMM CNRS / University of Montpellier 2, FR; ³Hewlett Packard Labs, US; ⁴Boston University, US; ⁵Hitachi ABB Power Grids, US; ⁶Qarnot Computing, FR Abstract 30 minutes of live joint question and answer time for interaction among speakers and audience.

M02 Software-Defined Hardware: Digital Design in the 21st Century with Chisel

M03 How Emerging Memory Technology Will Reshape Future Computing

M04 Security in the Post-Quantum Era: Threats and Countermeasures

B.1 BarCamp

M01 Industrial Control Systems Security

Part 1: Introduction and Security of ICS

Part 2: Requirements for ICS security studies

Part 3: Defense strategies for ICS

M05 Automation goes both ways: ML for security and security for ML

M06 CAD for SoC Security

FM01.1 PhD Forum

O.1 Opening Session: Plenary and Awards Ceremony

K.1 Opening Keynote: "Quantum supremacy using a programmable superconducting processor"

1.1 Innovative technologies & architectures for tomorrow’s compute platforms

1.2 IT Sustainability (Embedded tutorials)

1.3 The Road Towards Predictable Automotive High-Performance Platforms

1.4 HLS: from hardware optimization to security

1.5 Adaptive and Learning Systems

1.6 Soft error vulnerability analysis and mitigation, and hotspot identification

1.7 Novel Compilation Flows for Performance and Memory Footprint Optimization

1.8 Industrial Design Methods and Tools: Future EDA Applications and Thermal Simulation for 3D

ET Exhibition Theatre

IP1_1 Interactive Presentations

IP1_2 Interactive Presentations

IP1_3 Interactive Presentations

IP1_4 Interactive Presentations

UB.01 University Booth

UB.02 University Booth

K.3 Keynote - Special day on sustainable HPC

K.2 Opening Keynote: "Superconducting Quantum Materials and Systems (SQMS) – a new DOE National Quantum Information Science Research Center"

2.1 Emerging trends in the HPC industry landscape

2.2 3D integration: Today's practice and road ahead

2.3 A Deep Dive into Future of Lightweight Cryptography: New Standards, Optimizations, and Attacks

2.4 Quantum Computing

2.5 Platform validation with simulation

2.6 Hardware architectures for neural network applications with emerging technologies

2.7 Scheduling and Execution Time Variation

2.8 Exhibition Keynote on Digital Twins and Invitation to Become a Book Author

IP2_1 Interactive Presentations

IP2_2 Interactive Presentations

UB.03 University Booth

UB.04 University Booth

UB.05 University Booth

UB.06 University Booth

3.1 Sustainable solutions at large: bettering energy efficiency in HPC

3.2 Journey with Emerging Technologies and Architectures from Devices to System-Level Management

3.3 Vertical IP Protection of the Next-Generation Devices: Quo Vadis?

3.4 Advances with Emerging Technologies: Biochips, Memory-Centric Computing, and Ion Trap Quantum Architectures

3.5 Regularity and Optimization for Logic Synthesis

3.6 RF and High-Speed design challenges

3.7 Lightweight Machine Learning at the Edge

3.8 Industrial Design Methods and Tools: RISC-V

IP3_1 Interactive Presentations

IP3_2 Interactive Presentations

IP3_3 Interactive Presentations

IP3_4 Interactive Presentations

UB.08 University Booth

UB.09 University Booth

UB.10 University Booth

4.1 Digital Twins

4.2 Computing for Autonomy

4.3 Approximate computing in processors and GPUs

4.4 Raising performance of Hybrid Memory Systems

4.5 Sensing, security and performance in smart automotive and energy systems

4.6 Physical Attacks and Countermeasures

4.7 Improving Efficiency in Training and Inference of Spiking and Non-Spiking Neural Networks

5.1 AI, ML and Data Analytics

5.2 Micro-Architectural Attacks and Defenses

5.3 Systolic Array Architectures for Machine Learning Acceleration

5.4 Approximation in neural networks

5.5 Optimizing the Memory System for Latency and Throughput

5.6 From applications to circuit layout - Industrial perspectives

5.7 Machine learning dependability and test

6.8 How STM32 Enables Digital Transformation in Industries

IP4_1 Interactive Presentations

IP4_2 Interactive Presentations

IP4_3 Interactive Presentations

IP4_4 Interactive Presentations

IP4_5 Interactive Presentations

IP4_6 Interactive Presentations

Time	Label	Presentation Title Authors
17:30 CEST	3.2.1	FEFET AND NCFET FOR FUTURE NEURAL NETWORKS: VISIONS AND OPPORTUNITIES Speaker: Hussam Amrouch, University of Stuttgart, DE Authors: Mikail Yayla¹, Kuan-Hsun Chen¹, Georgios Zervakis², Joerg Henkel³, Jian-Jia Chen¹ and Hussam Amrouch⁴ ¹TU Dortmund, DE; ²Karlsruhe Institute of Technology, DE; ³KIT, DE; ⁴University of Stuttgart, DE Abstract The goal of this special session paper is to introduce and discuss different emerging technologies for logic circuitry and memory as well as new lightweight architectures for neural networks. We demonstrate how the ever-increasing complexity in Artificial Intelligent (AI) applications, resulting in an immense increase in the computational power, necessitates inevitably employing innovations starting from the underlying devices all the way up to the architectures. Two different promising emerging technologies will be presented: (i) Negative Capacitance Field-Effect Transistor (NCFET) as a new beyond-CMOS technology with advantages for offering low power and/or higher accuracy for neural network inference. (ii) Ferroelectric FET (FeFET) as a novel non-volatile, area-efficient and ultra-low power memory device. In addition, we demonstrate how Binary Neural Networks (BNNs) offer a promising alternative for traditional Deep Neural Networks (DNNs) due to its lightweight hardware implementation. Finally, we present the challenges from combining FeFET-based NVM with NNs and summarize our perspectives for future NNs and the vital role that emerging technologies may play.
17:45 CEST	3.2.2	EXPLOITING FEFETS VIA CROSS-LAYER DESIGN FROM IN-MEMORY COMPUTING CIRCUITS TO META LEARNING APPLICATIONS Speaker: Dayane Reis, University of Notre Dame, US Authors: Dayane Reis, Ann Franchesca Laguna, Michael Niemier and Xiaobo Sharon Hu, University of Notre Dame, US Abstract A ferroelectric FET (FeFET), made by integrating a ferroelectric material layer in the gate stack of a MOSFET, is a device that can behave as both a transistor and a non-volatile storage element. This unique property of FeFETs enables area efficient and low-power merged logic and memory functionality, desirable for many data analytic and machine learning applications. To best exploit this unique feature of FeFETs, cross-layer design practices spanning from circuits and architectures to algorithms and applications is needed. The paper presents FeFET-based circuits and architectures that offer, either independently or in a configurable fashion, content addressable memory (TCAM) and general-purpose compute-in-memory (GP-CiM) functionalities. These in-memory computing modules bring new opportunities to accelerating data-intensive applications. We discuss the use of these FeFET based in-memory computing fabrics in meta-learning applications, specifically as attentional memory. System-level task mapping and end-to-end evaluation will be discussed.
18:00 CEST	3.2.3	FUTURE COMPUTING PLATFORM DESIGN: A CROSS-LAYER DESIGN APPROACH Speaker: Hsiang-Yun Cheng, Academia Sinica, TW Authors: Hsiang-Yun Cheng¹, Chun-Feng Wu², Christian Hakert³, Kuan-Hsun Chen³, Yuan-Hao Chang¹, Jian-Jia Chen³, Chia-Lin Yang⁴ and Tei-Wei Kuo⁵ ¹Academia Sinica, TW; ²Academia Sinica and National Taiwan University, TW; ³TU Dortmund University, DE; ⁴National Taiwan University, TW; ⁵National Taiwan University and City University of Hong Kong, TW Abstract Future computing platforms are facing a paradigm shift with the emerging resistive memory technologies. First, they offer fast memory accesses and data persistence in a single large-capacity device deployed on the memory bus, blurring the boundary between memory and storage. Second, they enable computing-in-memory for neuromorphic computing to mitigate costly data movements. Due to the non-ideality of these resistive memory devices at the moment, we envision that cross-layer design is essential to bring such a system into practice. In this paper, we showcase a few examples to demonstrate how cross-layer design can be developed to fully exploit the potential of resistive memories and accelerate its adoption for future computing platforms.
18:15 CEST	3.2.4	INTELLIGENT ARCHITECTURES FOR INTELLIGENT COMPUTING SYSTEMS Speaker and Author: Onur Mutlu, ETH Zurich and Carnegie Mellon University, CH Abstract Computing is bottlenecked by data. Large amounts of application data overwhelm storage capability, communication capability, and computation capability of the modern machines we design today. As a result, many key applications' performance, efficiency and scalability are bottlenecked by data movement. In this invited special session talk, we describe three major shortcomings of modern architectures in terms of 1) dealing with data, 2) taking advantage of the vast amounts of data, and 3) exploiting different semantic properties of application data. We argue that an intelligent architecture should be designed to handle data well. We show that handling data well requires designing architectures based on three key principles: 1) data-centric, 2) data-driven, 3) data-aware. We give several examples for how to exploit each of these principles to design a much more efficient and high performance computing system. We especially discuss recent research that aims to fundamentally reduce memory latency and energy, and practically enable computation close to data, with at least two promising novel directions: 1) processing using memory, which exploits analog operational properties of memory chips to perform massively-parallel operations in memory, with low-cost changes, 2) processing near memory, which integrates sophisticated additional processing capability in memory controllers, the logic layer of 3D-stacked memory technologies, or memory chips to enable high memory bandwidth and low memory latency to near-memory logic. We discuss how to enable adoption of such fundamentally more intelligent architectures, which we believe are key to efficiency, performance, and sustainability. We conclude with some guiding principles for future computing architecture and system designs. This accompanying short paper provides a summary of the invited talk and points the reader to further work that may be beneficial to examine.

Time	Label	Presentation Title Authors
17:30 CEST	3.3.1	IP PROTECTION, PRESENT AND FUTURE SCHEMES Speaker and Author: Ramesh Karri, NYU, US Abstract With the advent of 5G and IoT applications, there is a greater thrust in terms of hardware security due to imminent risks caused by high amount of intercommunication between various subsystems. Security gaps in integrated circuits, thus represent high risks for both---the manufacturers and the users of electronic systems. Particularly in the domain of Intellectual Property (IP) protection, there is an urgent need to devise security measures at all levels of abstraction so that we can be one step ahead of any kind of adversarial attacks. This work presents IP protection measures from multiple perspectives---from system-level down to device-level security measures, from discussing various attack methods such as reverse engineering and hardware Trojan insertions to proposing new-age protection measures such as %DNA-based multi-valued logic locking and secure information flow tracking. This special session will give a holistic overview at the current state-of-the-art measures and how well we are prepared for the next generation circuits and systems.
17:45 CEST	3.3.2	SECURITY VALIDATION AT VP-LEVEL USING INFORMATION FLOW TRACKING Speaker and Author: Rolf Drechsler, University of Bremen/DFKI, DE Abstract Security is a crucial aspect in modern embedded systems that complements functional correctness to build safe and reliable systems. A very effective technique to validate security policies and thus protect a system against a broad range of security related exploits is Information Flow Tracking (IFT). In this talk we present efficient IFT-based techniques at the system-level using Virtual Prototypes (VPs). This enables validation of security policies very early in the design flow and hence enables to prevent costly iterations later on. In particular, we present static and dynamic IFT-based techniques for security validation of the VP as well as the embedded SW running on the VP. Our experiments demonstrate the effectiveness of our approach.
18:00 CEST	3.3.3	MVLOCK: A MULTI-VALUED LOGIC LOCKING SCHEME FOR FUTURE-GENERATION COMPUTING SYSTEMS Speaker and Author: Farhad Merchant, Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE Abstract The future-generation computing systems will require sophisticated security mechanisms to prevent a variety of attacks. Especially with the emergence of neuromorphic computing, the underlying computations are not purely digital anymore. The complexity of the future-generation computing systems also increases the attack surface for the bad actors. The vulnerability of the designs while in the production at a third-party foundry is going to be a major concern. Logic locking is an emerging technique able to provide various measures to protect against foundry attacks such as hardware Trojan insertion, IP piracy, and counterfeiting. In this paper, we discuss the impact of post-CMOS technologies on security and how various logic locking paradigms can help us overcome hardware security challenges. Here, in particular, we propose integrating soft (biological) intelligent systems as high-density building-blocks to store information and create a subset of multi-valued logic locking (MVLock), and discuss increasing difficulty level in breaking the logic locked circuits. Classical Boolean satisfiability test-based attacks and novel machine learning based attacks are analysed for key retrieval and prediction.
18:15 CEST	3.3.4	HARNESSING SECURITY THROUGH RUNTIME RECONFIGURABLE TRANSISTORS Speaker and Author: Akash Kumar, TU Dresden, DE Abstract Polymorphism is an indispensable property for ensuring security. Emerging reconfigurable nanotechnologies exhibit functional polymorphism at the very transistor level. The transistors belonging to this class, showcase ambipolar behavior where a single transistor can be configured to have either p-type and n-type functionality. This ambipolar behavior offers applications opportunities both at the logic gate as well as the circuit level. Transistors being the most fundamental piece in electronic circuits, polymorphism at this level can help in providing strong bottom-up security. In this talk, we discuss how runtime-reconfigurability at the transistor-level manifests itself into interesting circuit paradigms by offering more functionality per computation unit. We will discuss how transistor-level reconfigurability can be used for designing security primitives such as true random number generators. We will introduce a novel application scenario of kill-switch based on these nanotechnologies which is more disruptive as it is innocuous and completely hidden in normal circuit operation.

Time	Label	Presentation Title Authors
17:30 CEST	3.4.1	FORMAL SYNTHESIS OF ADAPTIVE DROPLET ROUTING FOR MEDA BIOCHIPS Speaker: Mahmoud Elfar, Duke University, US Authors: Mahmoud Elfar, Tung-Che Liang, Krishnendu Chakrabarty and Miroslav Pajic, Duke University, US Abstract A digital microfluidic biochip (DMFB) enables the miniaturization of immunoassays, point-of-care clinical diagnostics, and DNA sequencing. A recent generation of DMFBs uses a micro-electrode-dot-array (MEDA) architecture, which provides fine-grained control of droplets and real-time droplet sensing using CMOS technology. However, microelectrodes in a MEDA biochip can degrade due to charge trapping when they are repeatedly charged and discharged during bioassay execution; such degradation leads to the failure of microelectrodes and erroneous bioassay outcomes. To address this problem, we first introduce a new microelectrode-cell design such that we can obtain the health status of all the microelectrodes in a MEDA biochip by employing the inherent sensing mechanism. Next, we present a stochastic game-based model for droplet manipulation, and a formal synthesis method for droplet routing that can dynamically change droplet transportation routes. This adaptation is based on the real-time health information obtained from microelectrodes. Comprehensive simulation results for four real-life bioassays show that our method increases the likelihood of successful bioassay completion with negligible impact on time-to-results.
17:45 CEST	3.4.2	HYGRAPH: ACCELERATING GRAPH PROCESSING WITH HYBRID MEMORY-CENTRIC COMPUTING Speaker: Minxuan Zhou, University of California, San Diego, US Authors: Minxuan Zhou¹, Muzhou Li¹, Mohsen Imani² and Tajana Rosing¹ ¹UCSD, US; ²University of California Irvine, US Abstract Graph applications are challenging to run efficiently on conventional systems because of their large and irregular data. Several works have exploited near-data processing (NDP) based on emerging 3D-stacked memory to accelerate graph processing applications by offloading computations to massively parallel cores in the memory chip. Even though NDP can efficiently support parallel operations in a memory scalable way, it still requires data movement between memory and near-memory cores. Such data movement introduces large overhead because of the random data pattern in graph workloads. Furthermore, the parallelism provided by NDP systems is still insufficient for graph applications because of the limited number of processing cores. In this work, we tackle these challenges by integrating processing in-memory (PIM) technology in the NDP-based accelerator. We propose HyGraph, a software-hardware co-design for graph acceleration that exploits hybrid memory-centric computing technologies, including NDP and PIM. The design of HyGraph includes an optimization algorithm for hybrid memory layout, a run-time system combining both NDP and PIM processing flows, and customized hardware for efficiently enabling PIM functionality in NDP systems. Our experimental results show that HyGraph is up to 1.9× faster and 2.4× more energy-efficient than state-of-the-art memory-centric graph accelerators on several widely used graph algorithms with various real-world graphs.
18:00 CEST	IP3_2.1	GENERIC SAMPLE PREPARATION FOR DIFFERENT MICROFLUIDIC PLATFORMS Speaker: Sudip Poddar, Johannes Kepler University Linz, Austria, AT Authors: Sudip Poddar¹, Gerold Fink², Werner Haselmayr¹ and Robert Wille² ¹Johannes Kepler University, AT; ²Johannes Kepler University Linz, AT Abstract Sample preparation plays a crucial role in several medical applications. Microfluidic devices or Labs-on-Chips (LoCs) got established as a suitable solution to realize this task in a miniaturized, integrated, and automatic fashion. Over the years, a variety of different microfluidic platforms emerged, which all have their respective pros and cons. Accordingly, numerous approaches for sample preparation have been proposed—each specialized on a single platform only. In this work, we propose an idea towards a generic sample preparation approach which will generalize the constraints of the different microfluidic platforms and, by this, will provide a platform-independent sample preparation method. This will allow designers to quickly check what existing platform is most suitable for the considered task and to easily support upcoming and future microfluidic platforms as well. We illustrate the applicability of the proposed method with examples for various platforms.
18:01 CEST	IP3_2.2	RAISE: A RESISTIVE ACCELERATOR FOR SUBJECT-INDEPENDENT EEG SIGNAL CLASSIFICATION Speaker: Fan Chen, Duke University, US Authors: Fan Chen¹, Linghao Song¹, Hai (Helen) Li² and Yiran Chen¹ ¹Duke University, US; ²Duke University/TUM-IAS, US Abstract State-of-the-art deep neural networks (DNNs) for electroencephalography (EEG) signals classification focus on subject-related tasks, in which the test data and the training data needs to be collected from the same subject. In addition, due to limited computing resources and strict power budgets at edges, it is very challenging to deploy the inference of such DNN models on biological devices. In this work, we present an algorithm/hardware co-designed low-power accelerator for subject-independent EEG signal classification. We propose a compact neural network that is capable to identify the common and stable structure among subjects. Based on it, we realize a robust subject-independent EEG signal classification model that can be extended to multiple BCI tasks with minimal overhead. Based on this model, we present RAISE, a low-power processing-in-memory inference accelerator by leveraging the emerging resistive memory. We compare the proposed model and hardware accelerator to prior arts across various BCI paradigms. We show that our model achieves the best subject-independent classification accuracy, while RAISE achieves 2.8x power reduction and 2.5x improvement in performance per watt compared to the state-of-the-art resistive inference accelerator.
18:02 CEST	3.4.3	EXACT PHYSICAL DESIGN OF QUANTUM CIRCUITS FOR ION-TRAP-BASED QUANTUM ARCHITECTURES Speaker: Robert Wille, Institute for Integrated Circuits, Johannes Kepler University Linz, Austria, AT Authors: Oliver Keszocze¹, Naser Mohammadzadeh² and Robert Wille³ ¹Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), DE; ²Department of Computer Engineering, Shahed University, IR; ³Johannes Kepler University Linz, AT Abstract Quantum computers exploit quantum effects in a controlled manner in order to efficiently solve problems that are very hard to address on classical computers. Ion-trapped-based technologies are a particularly advanced concept of realizing quantum computers with advantages with respect to physical realization and fault-tolerance. Accordingly, several physical design methods aiming at realizing quantum circuits to corresponding architectures have been proposed. However, all these methods are heuristic and cannot guarantee minimality. In this work, we proposed a solution which can generate exact physical designs, i.e., solutions which require a minimal number of time steps. To this end, satisfiability solvers are utilized. Experimental evaluations confirm that, despite the underlying computational complexity of the problem, this allows to generate minimal physical designs for several quantum circuits for the first time.
18:17 CEST	IP3_3.1	DOUBLE DQN FOR CHIP-LEVEL SYNTHESIS OF PAPER-BASED DIGITAL MICROFLUIDIC BIOCHIPS Speaker: Fang-Chi Wu, Department of Computer Science and Engineering, National Sun Yat-Sen University, TW Authors: Fang-Chi Wu¹, Jian-De Li², Katherine Shu-Min Li¹, Sying-Jyan Wang² and Tsung-Yi Ho³ ¹National Sun Yat-sen University, TW; ²National Chung Hsing University, TW; ³National Tsing Hua University, TW Abstract Paper-based digital microfluidic biochip (PB-DMFB) technology is one of the most promising solutions in biochemical applications due to the paper substrate. The paper substrate makes PB-DMFBs more portable, cost-effective, and less dependent on manufacturing equipment. However, the single-layer paper substrate, which entangles electrodes, conductive wires, and droplet routing in the same layer, raises challenges to chip-level synthesis of PB-DMFBs. Furthermore, current design automation tools have to address various design issues including manufacturing cost, reliability, and security. Therefore, a more flexible chip-level synthesis method is necessary. In this paper, we propose the first reinforcement learning based chip-level synthesis for PB-DMFBs. Double deep Q-learning networks are adapted for the agent to select and estimate actions, and then we obtain the optimized synthesis results. Experimental results show that the proposed method is not only effective and efficient for chip-level synthesis but also scalable to reliability and security–oriented schemes.

Time	Label	Presentation Title Authors
17:30 CEST	3.5.1	PRESERVING SELF-DUALITY DURING LOGIC SYNTHESIS FOR EMERGING RECONFIGURABLE NANOTECHNOLOGIES Speaker: Shubham Rai, TU Dresden, DE Authors: Shubham Rai¹, Heinz Riener², Giovanni De Micheli³ and Akash Kumar¹ ¹TU Dresden, DE; ²EPFL, CH; ³École Polytechnique Fédérale de Lausanne (EPFL), CH Abstract Emerging reconfigurable nanotechnologies allow the implementation of self-dual functions with a fewer number of transistors as compared to traditional CMOS technologies. To achieve better area results for Reconfigurable Field-Effect Transistors (RFET)-based circuits, a large portion of a logic representation must be mapped to self-dual logic gates. This, in turn, depends upon how self-duality is preserved in the logic representation during logic optimization and technology mapping. In the present work, we develop Boolean size-optimization methods– a rewriting and a resubstitution algorithms using Xor-Majority Graphs(XMGs) as a logic representation aiming at better preserving self-duality during logic optimization. XMGs are more compact for both unate and binate logic functions as compared to conventional logic representations such as And-Inverter Graphs(AIGs) or Majority-Inverter Graphs (MIGs). We evaluate the proposed algorithm over crafted benchmarks (with various levels of self-duality), and cryptographic benchmarks. For cryptographic benchmarks with a high self-duality ratio, the XMG-based logic optimisation flow can achieve an area reduction of up to17% when compared to AIG-based optimization flows implemented in the academic logic synthesis tool ABC.
17:45 CEST	3.5.2	AUTOSYMMETRY OF INCOMPLETELY SPECIFIED FUNCTIONS Speaker: Valentina Ciriani, Università degli Studi di Milano, IT Authors: Anna Bernasconi¹ and Valentina Ciriani² ¹Universita' di Pisa, IT; ²Universita' degli Studi di Milano, IT Abstract Autosymmetric Boolean functions are “regular functions” that are rather frequent in the set of Boolean functions describing standard circuits. Autosymmetry is typically exploited for improving the synthesis time and the quality of the optimized circuits. This paper studies in not a naive way, for the first time, the autosymmetry of incompletely specified functions, i.e., Boolean functions with don’t care conditions. The theory of autosymmetry for completely specified functions is extended to the incompletely specified case and a new heuristic algorithm is provided for the detection of autosymmetry. The experimental results validate the theoretical study and show that the 77% of the considered benchmarks has an improved autosymmetry degree.
18:00 CEST	IP3_1.1	SYNTHESIS OF SI CIRCUITS FROM BURST-MODE SPECIFICATIONS Speaker: Alex Chan, Newcastle University, GB Authors: Alex Chan¹, Danil Sokolov¹, Victor Khomenko¹, David Lloyd² and Alex Yakovlev¹ ¹Newcastle University, GB; ²Dialog Semiconductor, GB Abstract In this paper, we present a new workflow that is based on the conversion of Extended Burst-Mode (XBM) specifications to Signal Transition Graphs (STGs). While XBMs offer a simple design entry to specify asynchronous circuits, they cannot be synthesised into speed-independent (SI) circuits, due to the 'burst mode' timing assumption inherent in the model. Furthermore, XBM synthesis tools are no longer supported, and there are no dedicated tools for formal verification of XBMs. Our approach addresses these issues, by granting the XBMs access to sophisticated synthesis and verification tools available for STGs, as well as the possibility to synthesise SI circuits. Experimental results show that our translation only linearly increases the model size and that our workflow achieves a much improved synthesis success rate, with a 33% average reduction in the literal count.
18:01 CEST	IP3_1.2	LOW-LATENCY ASYNCHRONOUS LOGIC DESIGN FOR INFERENCE AT THE EDGE Speaker: Adrian Wheeldon, Newcastle University, GB Authors: Adrian Wheeldon¹, Alex Yakovlev¹, Rishad Shafik¹ and Jordan Morris² ¹Newcastle University, GB; ²ARM Ltd, Newcastle University, GB Abstract Modern internet of things (IoT) devices leverage machine learning inference using sensed data on-device rather than offloading them to the cloud. Commonly known as inference at-the-edge, this gives many benefits to the users, including personalization and security. However, such applications demand high energy efficiency and robustness. In this paper we propose a method for reduced area and power overhead of self-timed early-propagative asynchronous inference circuits, designed using the principles of learning automata. Due to natural resilience to timing as well as logic underpinning, the circuits are tolerant to variations in environment and supply voltage whilst enabling the lowest possible latency. Our method is exemplified through an inference datapath for a low power machine learning application. The circuit builds on the Tsetlin machine algorithm further enhancing its energy efficiency. Average latency of the proposed circuit is reduced by 10x compared with the synchronous implementation whilst maintaining similar area. Robustness of the proposed circuit is proven through post-synthesis simulation with 0.25 V to 1.2 V supply. Functional correctness is maintained and latency scales with gate delay as voltage is decreased.
18:02 CEST	3.5.3	GOMIL: GLOBAL OPTIMIZATION OF MULTIPLIER BY INTEGER LINEAR PROGRAMMING Speaker: Weihua Xiao, Shanghai Jiao Tong University, CN Authors: Weihua Xiao¹, Weikang Qian¹ and Weiqiang Liu² ¹Shanghai Jiao Tong University, CN; ²Nanjing University of Aeronautics and Astronautics, CN Abstract Multiplier is an important arithmetic circuit. State-of-the-art designs consist of a partial product generator (PPG), a compressor tree (CT), and a carry propagation adder (CPA), with the last two components dominating the area and delay. Existing representative works optimize the CT and the CPA separately, adding a rigid boundary between these two components. In this paper, we break the boundary by proposing GOMIL, a global optimization for multiplier by integer linear programming. Two ILP sub-problems are first formulated to optimize the CT and the prefix structure in the CPA, respectively. Then, they are unified to provide a global optimization to the multiplier. The proposed method is applicable to not only multipliers with the AND gate-based PPG, but also those with Booth encoding-based PPG. The experimental results showed that the multipliers optimized by GOMIL can reduce the power-delay product by up to 71%, compared to the state-of-the-art multipliers developed in industry. The code of GOMIL is made open-source.

Time	Label	Presentation Title Authors
17:30 CEST	3.6.1	AN EVENT-DRIVEN SYSTEM-LEVEL NOISE ANALYSIS METHODOLOGY FOR RF SYSTEMS Speaker: Christoph Beyerstedt, RWTH Aachen University, DE Authors: Christoph Beyerstedt, Jonas Meier, Fabian Speicher, Ralf Wunderlich and Stefan Heinen, RWTH Aachen University, DE Abstract This paper presents an approach for a system-level noise simulation in the frequency domain for RF systems. The noise analysis is able to depict frequency conversion due to nonlinear or time-varying circuits and can also consider the signal processing in the digital part. Therefore, it is possible to analyze the noise from all parts of the system at a point of choice, e.g. directly at the demodulator input. The approach is based on analog models of the genuine circuit level implementation used for system level exploration or verification purpose. It can be integrated in a conventional system level simulation with nearly no overhead. The noise analysis is implemented on top of an already existing event-driven RF simulation approach which uses a combination of SystemVerilog and C++ for modeling. MATLAB is used for post-processing and visualization of the results.
17:45 CEST	3.6.2	OPENSERDES: AN OPEN SOURCE PROCESS-PORTABLE ALL-DIGITAL SERIAL LINK Speaker: Gaurav Kumar K, Purdue University, US Authors: Gaurav Kumar K, Baibhab Chatterjee and Shreyas Sen, Purdue University, US Abstract Over the last decade, the growing influence of open source software has necessitated the need to reduce the abstraction levels in hardware design. Open source hardware significantly reduces the development time, increasing the probability of first-pass success and enable developers to optimize software solutions based on hardware features, thereby reducing the design costs. The recent introduction of open source Process Development Kit(OpenPDK) by Skywater technologies in June 2020 has eliminated the barriers to Application-Specific Integrated Circuit (ASIC)design, which is otherwise considered expensive and not easily accessible. The OpenPDK is the first concrete step towards achieving the goal of open source circuit blocks that can be imported to reuse and modify in ASIC design. With process technologies scaling down for better performance, the need for entirely digital designs, which can be synthesized in any standard Automatic Place-and-Route (APR) tool, has increased considerably, for mapping physical design to the new process technology. This work presents a first open source all-digital Serializer/Deserializer (SerDes) for multi-GHz serial links designed using Skywater OpenPDK 130nmprocess node. To ensure that the design is fully synthesizable, the SerDes uses CMOS inverter based drivers at the transmitter, while the receiver front end comprises a resistive feedback inverter as a sensing element, followed by sampling elements. A fully digital oversampling CDR at the receiver end recovers the transmitter clock for proper decoding of data bits. The physical design flow utilizes OpenLANE, which is an open source end-to-end tool for generating GDS from RTL. Cadence Virtuoso has been used for extracting parasitics for post-layout simulations, which exhibit the SerDes functionality at 2 Gbps for 34 dB channel loss while consuming 438 mW power. The generated GDS and netlist files of the SerDes, along with the required documentation, are uploaded in a GitHub repository for public access.
18:00 CEST	IP3_3.2	CONSTRUCTIVE USE OF PROCESS VARIATIONS: RECONFIGURABLE AND HIGH-RESOLUTION DELAY-LINE Speaker: Xiaolin Xu, Northeastern University, US Authors: Wenhao Wang¹, Yukui Luo² and Xiaolin Xu² ¹ECE Department of Northeastern University, US; ²Northeastern University, US Abstract Delay-line is a critical circuit component for highspeed electronic design and testing, such as high-performance FPGA and ASICs, to provide timing signals of specific duration or duty cycle. However, the performance of existing CMOS-based delay-lines is limited by various practical issues. For example, the minimum propagation delay (resolution) of CMOS gates is limited by the process variations from circuit fabrication. This paper presents a novel delay-line scheme, which instead of mitigating the process variations from circuit fabrication, constructively leverages them to generate time signals of specific duration. Moreover, the resolution of the proposed delay-line method is reconfigurable, for which we propose a Machine Learning modeling method to assist such reconfiguration, i.e., to generate time duration of different scales. The performance of the proposed delay-line is validated with HSpice simulation and prototype on a Xilinx Virtex-6 FPGA evaluation kit. The experimental results demonstrate that the proposed delay
18:01 CEST	3.6.3	(Best Paper Award Candidate) DIGITAL TEST OF ZIGBEE TRANSMITTERS: VALIDATION IN INDUSTRIAL TEST ENVIRONMENT Speaker: Thibault Vayssade, University Montpellier, CNRS, LIRMM, FR Authors: Thibault VAYSSADE¹, Florence AZAIS¹, Laurent LATORRE¹ and François LEFEVRE² ¹University Montpellier, CNRS, LIRMM, FR; ²NXP Semiconductors, FR Abstract This paper presents the validation of a low-cost solution for production test of ZigBee transmitters in industrial environment. The solution relies on 1-bit acquisition of a 2.4GHz signal with a standard digital ATE channel using harmonic sampling. Dedicated post-processing algorithms are then applied on the low-frequency binary vector captured by the ATE to retrieve the RF signal characteristics and implement the tests specified by IEEE Std 802.15.4. Experimental results collected on more than 1.5 thousand pieces of a ZigBee transceiver from NXP Semiconductors are presented.

Time	Label	Presentation Title Authors
17:30 CEST	3.7.1	A QUANTIZATION FRAMEWORK FOR NEURAL NETWORK ADAPTION AT THE EDGE Speaker: Mengyuan Li, University of Notre Dame, US Authors: Mengyuan Li and Xiaobo Sharon Hu, University of Notre Dame, US Abstract Edge devices employing a neural network (NN) inference engine running a pre-trained model often perform poorly or simply fail at unseen situations. Meta learning, consisting of meta training, NN adaptation and inference, has been shown to be quite effective in quickly learning and responding to a new environment. The adaption phase, including both forward and backward computation, should be performed on edge devices to maximize the benefit in the few-shot learning application. However, deploying high-precision, full-blown training accelerators at the edge can be rather costly for most Internet of Things applications. This paper reveals some unique observations in the adaptation phase and introduces a quantization framework, AIQ, based on these observations to support adaption at the edge with inference-level bit widths. AIQ includes two key ideas, i.e., gated weight buffering and dynamic error scaling, to reduce memory and computational needs with minimal sacrifice in accuracy. Major modules of AIQ, are synthesized and evaluated. Experimental results show that AIQ saves 41% and 70% weight memory for two widely used datasets while incurring minimum hardware overhead and negligible accuracy loss.
17:45 CEST	3.7.2	TINY-HD: ULTRA-EFFICIENT HYPERDIMENSIONAL COMPUTING ENGINE FOR IOT APPLICATIONS Speaker: Behnam Khaleghi, University of California San Diego, US Authors: Behnam Khaleghi¹, Hanyang Xu¹, Justin Morris¹ and Tajana Rosing² ¹University of California, San Diego, US; ²UCSD, US Abstract Hyperdimensional computing (HD) is a new brain-inspired algorithm that mimics the human brain for cognitive tasks. Despite its inherent potential, the practical efficiency of HD is tied to the underlying hardware, which throttles the efficiency of HD in conventional microprocessors. In this paper, we propose tiny-HD, a light-weight dedicated HD platform that targets low power, high energy efficiency, and low latency, while being configurable to support various applications. We leverage an enhanced HD encoding that alleviates the memory requirements and also simplifies the dataflow to make tiny-HD flexible with an efficient architecture. We further augment tiny-HD by pipelining the stages and resource sharing, as well as a data layout that enables opportunistic power reduction. We compared tiny-HD in terms of area, performance, power, and energy consumption with the state-of-the-art HD platforms. tiny-HD occupies 0.5mm^2, consumes 1.6mW standby and 9.6mW runtime power (at 400MHz), with a 0.016ms latency on a set of IoT benchmarks. tiny-HD consumes average per-query energy of 160nJ, which outperforms the state-of-the-art FPGA and ASIC implementations by 95.5x and 11.2x, respectively.
18:00 CEST	IP3_4.1	RESOLUTION-AWARE DEEP MULTI-VIEW CAMERA SYSTEMS Speaker: Zeinab Hakimi, Pennsylvania State University, US Authors: Zeinab Hakimi¹ and Vijaykrishnan Narayanan² ¹Pennsylvania State University, US; ²Penn State University, US Abstract Recognizing 3D objects with multiple views is an important problem in computer vision. However, multi view object recognition can be challenging for networked embedded intelligent systems (IoT devices) as they have data transmission limitation as well as computational resource constraint. In this work, we design an enhanced multi-view distributed recognition system which deploys a view importance estimator to transmit data with different resolutions. Moreover, a multi-view learning-based superresolution enhancer is used at the back-end to compensate for the performance degradation caused by information loss from resolution reduction. The extensive experiments on the benchmark dataset demonstrate that the designed resolution-aware multi-view system can decrease the endpoint’s communication energy by a factor of 5X while sustaining accuracy. Further experiments on the enhanced multi-view recognition system show that accuracy increment can be achieved with minimum effect on the computational cost of back-end system.
18:01 CEST	IP3_4.2	HSCONAS: HARDWARE-SOFTWARE CO-DESIGN OF EFFICIENT DNNS VIA NEURAL ARCHITECTURE SEARCH Speaker: Xiangzhong Luo, Nanyang Technological University, SG Authors: Xiangzhong Luo, Di Liu, Shuo Huai and Weichen Liu, Nanyang Technological University, SG Abstract In this paper, we present a novel multi-objective hardware-aware neural architecture search (NAS) framework, namely HSCoNAS, to automate the design of deep neural networks (DNNs) with high accuracy but low latency upon target hardware. To accomplish this goal, we first propose an effective hardware performance modeling method to approximate the runtime latency of DNNs on target hardware, which will be integrated into HSCoNAS to avoid the tedious on-device measurements. Besides, we propose two novel techniques, i.e., dynamic channel scaling to maximize the accuracy under the specified latency and progressive space shrinking to refine the search space towards target hardware as well as alleviate the search overheads. These two techniques jointly work to allow HSCoNAS to perform fine-grained and efficient explorations. Finally, an evolutionary algorithm (EA) is incorporated to conduct the architecture search. Extensive experiments on ImageNet are conducted upon diverse target hardware, i.e., GPU, CPU, and edge device to demonstrate the superiority of HSCoNAS over recent state-of-the-art approaches.
18:02 CEST	3.7.3	A VIDEO-BASED FALL DETECTION NETWORK BY SPATIO-TEMPORAL JOINT-POINT MODEL ON EDGE DEVICES Speaker: Ziyi Guan, Southern University of Science and Technology, CN Authors: Ziyi Guan¹, Shuwei Li¹, Yuan Cheng², Changhai Man¹, Wei Mao¹, Ngai Wong³ and Hao Yu¹ ¹Southern University of Science and Technology, CN; ²Shanghai Jiao Tong University, CN; ³University of Hong Kong, CN Abstract Tripping or falling is among the top threats in elderly healthcare, and the development of automatic fall detection systems are of considerable importance. With the fast development of the Internet of Things (IoT), camera vision-based solutions have drawn much attention in recent years. The traditional fall video analysis on the cloud has significant communication overhead. This work introduces a fast and lightweight video fall detection network based on a spatio-temporal joint-point model to overcome these hurdles. Instead of detecting falling motion by the traditional Convolutional Neural Networks (CNNs), we propose a Long Short-Term Memory (LSTM) model based on time-series joint-point features, extracted from a pose extractor and then filtered from a geometric joint-point filter. Experiments are conducted to verify the proposed framework, which shows a high sensitivity of 98.46% on Multiple Cameras Fall Dataset and 100% on UR Fall Dataset. Furthermore, our model can achieve pose estimation tasks simultaneously, attaining 73:3 mAP in the COCO keypoint challenge dataset, which outperforms the OpenPose work by 8%.

Time	Label	Presentation Title Authors
07:00 CEST	4.1.1	ENABLING AND SUPPORTING CAR-AS-A-SERVICE BY DIGITAL TWIN MODELING AND DEPLOYMENT Speaker: Charles Steinmetz, University of Applied Sciences Hamm-Lippstadt, DE Authors: Charles Steinmetz¹, Greyce N. Schroeder², Achim Rettberg³, Ricardo Nagel Rodrigues⁴ and Carlos Eduardo Pereira² ¹Hochschule Hamm-Lippstadt - Campus Lippstadt, DE; ²Federal University of Rio Grande do Sul, BR; ³University of Applied Science Hamm-Lippstadt & University Oldenburg, DE; ⁴FURG, BR Abstract Smart City is one area of application for the Internet of Things (IoT) and it has been attracting attention from both academia and industry. Cities will be composed by autonomous parts that communicate and provide services to each other. For instance, cars (autonomous or not) may be seen as a service that transports people from one point to another. Interactions between users and these kinds of services will grow, making it necessary to have digitization of all these parts of the Smart City. The Digital Twin (DT) concept proposes that real-world assets have a virtual representation connecting the physical world with the cyber world. This allows to track the whole life-cycle of this object as well as perform simulations with current or previously stored data. In this context, this work proposes the use of Digital Twin for enabling and supporting car-as-a-service (CaaS). A case study has been developed to demonstrate the modeling and the deployment of the Digital Twin, highlighting how this concept can be one of the key enablers for CaaS.
07:15 CEST	4.1.2	DIGITAL TWIN EXTENSION WITH EXTRA-FUNCTIONAL PROPERTIES Speaker: Sara Vinco, Politecnico di Torino, IT Authors: Khaled Alamin¹, Nicola Dall'Ora², Enrico Fraccaroli³, Sara Vinco¹, Davide Quaglia² and Massimo Poncino¹ ¹Politecnico di Torino, IT; ²University of Verona, IT; ³Università degli Studi di Verona, IT Abstract Digital twins of production lines do not focus solely on the management of the production process, they can also monitor and optimize other extra-functional aspects such as energy consumption and communications. This paper proposes the extension of digital twin concept in such directions. First, we extend the digital twin with models of energy consumption, that allow the monitoring of production line components throughout production lifetime. Then, we propose a flow to design the communication network starting from information obtained from the digital twin concerning the production, usage and flowing of information through the plant. All these methodologies start from the production line specification, then they enrich it with data collected during operation, and finally information is used to perform design and optimization. Results have been shown on a real Industry 4.0 research facility.
07:30 CEST	4.1.3	COGNITIVE DIGITAL TWIN FOR MANUFACTURING SYSTEMS Speaker: Mohammad Abdullah Al Faruque, University of California, Irvine, US Authors: Mohammad Al Faruque¹, Deepan Muthirayan², Shih-Yuan Yu² and Pramod P. Khargonekar² ¹University of California Irvine, US; ²University of California, Irvine, US Abstract A digital twin is the virtual replica of a physical system. Digital twins are useful because they provide models and data for design, production, operation, diagnostics, and autonomy of machines and products. Hence, the digital twin has been projected as the key enabler of the Visions of Industry 4.0. The digital twin concept has become increasingly sophisticated and capable over time, enabled by many technologies. In this paper, we propose the cognitive digital twin as the next stage of advancement of a digital twin that will help realize the vision of Industry 4.0. Cognition, which is inspired by advancements in cognitive science, machine learning, and artificial intelligence, will enable a digital twin to achieve some critical elements of cognition, e.g., attention (selective focusing), perception (forming useful representations of data), memory (encoding and retrieval of information and knowledge), etc. Our main thesis is that cognitive digital twins will allow enterprises to creatively, effectively, and efficiently exploit implicit knowledge drawn from the experience of existing manufacturing systems and enable the transfer of higher performance decisions and control and improve the performance across the enterprise (at scale). Finally, we present open questions and challenges to realize these capabilities in a digital twin.
07:45 CEST	4.1.4	DYNAMIC FAULT INJECTION INTO DIGITAL TWINS OF SAFETY-CRITICAL SYSTEMS Speaker: Thomas Markwirth, Fraunhofer EAS/ IIS, DE Authors: Thomas Markwirth¹, Roland Jancke² and Christoph Sohrmann² ¹Fraunhofer EAS/IIS, DE; ²Fraunhofer IIS/EAS, DE Abstract In this work we present a technology for dynamically introducing fault structures into digital twins without the need to change the virtual prototype model. The injection is done at the beginning of a simulation by dynamically rewiring the involved netlists. During the simulation on a real-time platform, faults can be activated or deactivated triggered by sequences, statistical effects or by events from the real world. In some cases the fault structures can even be auto-generated directly from a formal specification, which further automates the development process for safety-relevant systems. The approach is demonstrated at a SystemC/SystemC AMS virtual prototype of a safety-critical sub-systems which runs on a dSPACE real-time hardware.

Time	Label	Presentation Title Authors
07:00 CEST	4.2.1	MISSION SPECIFICATION AND EXECUTION OF MULTIDRONE SYSTEMS Speaker: Markus Gutmann, University of Klagenfurt, AT Authors: Markus Gutmann¹ and Bernhard Rinner² ¹Alpen-Adria Universität Klagenfurt, AT; ²University of Klagenfurt, AT Abstract Small unmanned aerial vehicles, commonly called drones, enable novel applications in many domains. Multidrone systems are a current key trend where several drones operate collectively as an integrated networked autonomous system to complete various missions. The specification and execution of multidrone missions are particularly challenging, since substantial expertise of the mission domain, the drone’s capabilities, and the drones’ software environment is required to properly encode the mission. In this position paper, we introduce a specification language for multidrone missions and describe the transcoding of its components into the multidrone execution environment for both simulations and real drones. The key features of our approach include (i) domain-independence of the mission specification, (ii) readability and ease of use, and (iii) expandability. The specification language has a simple syntax and uses a parameterized description of execution blocks and mission capabilities, which are derived from native drone functions. Domain-independence and expandability are provided by a clear separation between the specification and the implementation of the mission tasks. We demonstrate the effectiveness of our approach with a selected multidrone mission example.
07:15 CEST	4.2.2	PERCEPTION COMPUTING-AWARE CONTROLLER SYNTHESIS FOR AUTONOMOUS SYSTEMS Speaker: Samarjit Chakraborty, UNC Chapel Hill, US Authors: Clara Hobbs¹, Debayan Roy², Sridhar Duggirala¹, F. Donelson Smith¹, Soheil Samii³, James Anderson¹ and Samarjit Chakraborty¹ ¹UNC Chapel Hill, US; ²TU Munich, DE; ³General Motors, US Abstract Feedback control loops are ubiquitous in any autonomous system. The design flow for any controller starts by determining a control strategy, while abstracting away all implementation details. However, when designing controllers for autonomous systems, there is significant computation associated with the perception modules. For example, this involves vision processing using deep neural networks on multicore CPU+accelerator platforms. Such computation can be organized in many different ways, with each choice resulting in very different sensor-to-actuator delays and tradeoffs between cost, delay, and accuracy. Further, each of these choices requires the control strategy to be designed accordingly. It is not possible for a control designer to enumerate and account for all of these choices manually, or abstract them away as "implementation details" as done in traditional controller design. In this paper we outline this problem and discuss how automated controller-synthesis techniques could help in addressing it.
07:30 CEST	4.2.3	CLOSED-LOOP APPROACH TO PERCEPTION IN AUTONOMOUS SYSTEM Speaker: Saibal Mukhopadhyay, Georgia Institute of Technology, US Authors: Saibal Mukhopadhyay¹, Kruttidipta Samal¹ and Marilyn Wolf² ¹Georgia Institute of Technology, US; ²University of Nebraska, US Abstract Currently, functional tasks within Autonomous Systems are balkanized into several sub-systems such as object detection, tracking, motion planning, multi-sensor fusion etc. which are developed and tested in isolation. In recent times, deep learning is used in the perception systems for improved accuracy, but such algorithms are not adaptive to the transient real-world requirements of an Autonomous System such as latency and energy. These limitations are critical for resource constrained systems such as autonomous drones. Therefore, a holistic closed-loop system design is required for building reliable and efficient perception systems for autonomous drones. The closed-loop perception system creates a focus-of-attention based feedback from end-task such as motion planning to control computation within the deep neural networks (DNNs) used in early perception tasks such as object detection. We observe that this closed-loop perception system improves resource utilization of resource hungry DNNs within perception system with minimal impact on motion planning.
07:45 CEST	4.2.4	COMPUTING FOR CONTROL AND CONTROL FOR COMPUTING Speaker: Justin Bradley, University of Nebraska-Lincoln, US Authors: Xinkai Zhang and Justin Bradley, University of Nebraska-Lincoln, US Abstract Computing can be thought of as a service provided to a system to yield actionable tasks enacted by physical hardware. But rarely is control thought to be in the service of enhancing computation. Consideration of that perspective is what motivates co-regulation, our framework for holistic cyber-physical control of autonomous vehicles. In this paper we elaborate on how co-regulation will enable the next generation of autonomous vehicles precisely because it considers computation as an enabler and consumer of autonomous behavior. We report on the latest advances in this space showing how co-regulation exceeds results in event-triggered, self-triggered, and fixed-rate control strategies yielding more robustness and adaptivity to changing and uncertain conditions -- a requirement for next-gen autonomous vehicles. We then describe a co-regulated decision making algorithm based on Markov Decision Processes showing how full consideration of computational resource allocation can increase decision-making capabilities in uncertain environments.

Time	Label	Presentation Title Authors
07:00 CEST	4.3.1	QSLC: QUANTIZATION-BASED, LOW-ERROR SELECTIVE APPROXIMATION FOR GPUS Speaker: Sohan Lal, TU Berlin, DE Authors: Sohan Lal, Jan Lucas and Ben Juurlink, TU Berlin, DE Abstract GPUs use a large memory access granularity (MAG) that often results in a low effective compression ratio for memory compression techniques. The low effective compression ratio is caused by a significant fraction of compressed blocks that have a few bytes above a multiple of MAG. While MAG-aware selective approximation, based on a tree structure, has been used to increase the effective compression ratio and the performance gain, approximation results in a high error that is reduced by using complex optimizations. We propose a simple quantization-based approximation technique (QSLC) that can also selectively approximate a few bytes above MAG. While the quantization-based approximation technique has a similar performance to the state-of-the-art tree-based selective approximation, the average error for the quantization-based technique is 5× lower. We further trade-off the two techniques and show that the area and power overhead of the quantization-based technique is 12.1× and 7.6× lower than the state-of-the-art, respectively. Our sensitivity analysis to different block sizes further shows the opportunities and the significance of MAG-aware selective approximation.
07:15 CEST	4.3.2	VALUE SIMILARITY EXTENSIONS FOR APPROXIMATE COMPUTING IN GENERAL-PURPOSE PROCESSORS Speaker: Younghoon Kim, Purdue University, US Authors: Younghoon Kim¹, Swagath Venkataramani², Sanchari Sen³ and Anand Raghunathan¹ ¹Purdue University, US; ²IBM T. J. Watson Research Center, US; ³IBM T.J. Watson Research Center, US Abstract Approximate Computing (AxC) is a popular design paradigm wherein selected computations are executed approximately to gain efficiency with minimal impact on application-level quality. Most efforts in AxC target specialized accelerators and domain-specific processors, with relatively limited focus on General-Purpose Processors (GPPs). However, GPPs are still broadly used to execute applications that are amenable to AxC, making AxC for GPPs a critical challenge. A key bottleneck in applying AxC to GPPs is that their execution units account for only a small fraction of total energy, requiring a holistic approach targeting compute, memory and control front-ends. This paper proposes such an approach that leverages the application property of value similarity, i.e., input operands to computations that occur close-in-time take similar values. Such similar computations are dynamically pre-detected and the fetch-decode-execute of entire instruction sequences are skipped to benefit performance. To this end, we propose a set of lightweight micro-architectural and ISA extensions called VSX that enable: (i) similarity detection amongst values in a cache-line, (ii) skipping of pre-defined instructions and/or loop iterations when similarity is detected, and (iii) substituting outputs of skipped instructions with saved results from previously executed computations. We also develop compiler techniques, guided by user annotations, to benefit from VSX in the context of common Machine Learning (ML) kernels. Our RTL implementation of VSX for a low-power RISC-V processor incurred 2.13% area overhead and yielded 1.19X-3.84X speedup with <0.5% accuracy loss on 6 ML benchmarks.
07:30 CEST	IP5_5.2	TRULOOK: A FRAMEWORK FOR CONFIGURABLE GPU APPROXIMATION Speaker: Mohsen Imani, University of California Irvine, US Authors: Ricardo Garcia¹, Fatemeh Asgarinejad¹, Behnam Khaleghi¹, Tajana Rosing¹ and Mohsen Imani² ¹University of California San Diego, US; ²University of California Irvine, US Abstract In this paper, we propose TruLook, a framework that employs approximate computing techniques for GPU acceleration through computation reuse as well as approximate arithmetic operations to eliminate redundant and unnecessary exact computations. To enable computational reuse, GPU is enhanced with small lookup tables which are placed close to the stream cores that return already computed values for exact and potential inexact matches. Inexact matching is subject to a threshold that is controlled by the number of mantissa bits involved in the search. Approximate arithmetic is provided by a configurable approximate multiplier that dynamically detects and approximates operations which are not significantly affected by approximation. TruLook guarantees the accuracy bound required for an application by configuring the hardware at runtime. We have evaluated TruLook efficiency on a wide range of multimedia and deep learning applications. Our evaluation shows that with 0% and less than 1% quality loss budget, TruLook yields on average 2.1× and 5.6× energy-delay product improvement over four popular networks on ImageNet dataset.
07:31 CEST	IP4_1.2	(Best Paper Award Candidate) AXPIKE: INSTRUCTION-LEVEL INJECTION AND EVALUATION OF APPROXIMATE COMPUTING Speaker: Isaias Felzmann, University of Campinas, BR Authors: Isaías Bittencourt Felzmann¹, João Fabrício Filho² and Lucas Wanner³ ¹University of Campinas, BR; ²Unicamp/UTFPR, BR; ³Unicamp, BR Abstract Representing the interaction between accurate and approximate hardware modules at the architecture level is essential to understand the impact of Approximate Computing in a general-purpose computing scenario. However, extensive effort is required to model approximations into a baseline instruction level simulator and collect its execution metrics. In this work, we present the AxPIKE ISA simulation environment, a tool that allows designers to inject models of hardware approximation at the instruction level and evaluate their impact on the quality of results. AxPIKE embeds a high-level representation of a RISC-V system and produces a dedicated control mechanism, that allows the simulated software to manage the approximate behavior of compatible execution scenarios. The environment also provides detailed execution statistics that are forwarded to dedicated tools for energy accounting. We apply the AxPIKE environment to inject integer multiplication and memory access approximations into different applications and demonstrate how the generated statistics are translated into energy-quality trade-offs.
07:32 CEST	4.3.3	A 1D-CRNN INSPIRED RECONFIGURABLE PROCESSOR FOR NOISE-ROBUST LOW-POWER KEYWORDS RECOGNITION Speaker: Bo Liu, Southeast University, CN Authors: Bo Liu, Zeyu Shen, Lepeng Huang, Yu Gong, Zilong Zhang and Hao Cai, Southeast University, CN Abstract A low-power high-accuracy reconfigurable processor is proposed for noise-robust keywords recognition and evaluated in 22nm technology, which is based on an optimized one-dimensional convolutional recurrent neural network (1D-CRNN). In traditional DNN-based keywords recognition system, the speech feature extraction based on traditional algorithms and the DNN based keywords classification are two independent modules. Compared to the traditional architecture, both the feature extraction and keywords classification are processed by the proposed 1D-CRNN with weight/data bit width quantized to 8/8 bits. Therefore unified training and optimization framework can be performed for various application scenarios and input loads. The proposed 1D-CRNN based keywords recognition system can achieve a higher recognition accuracy with reduced computation operations. Based on system-architecture co-design, an energy-efficient DNN accelerator which can be dynamically reconfigured to process the 1D-CRNN with different configurations is proposed. The processing circuits of the accelerator are optimized to further improve the energy efficiency using a fine-grained precision reconfigurable approximate multiplier. Compared to the state-of-the-art architectures, this work can support 1~5 real-time keywords recognition with lower power consumption, while maintaining higher system capability and adaptability.

Time	Label	Presentation Title Authors
07:00 CEST	4.4.1	(Best Paper Award Candidate) LSP: COLLECTIVE CROSS-PAGE PREFETCHING FOR NVM Speaker: Haiyang Pan, Institute of Computing Technology, CAS, Beijing, China, CN Authors: Haiyang Pan¹, Yuhang Liu¹, Tianyue Lu¹ and Mingyu Chen² ¹Institute of Computing Technology, Chinese Academy of Sciences, CN; ²Professor, CN Abstract As an emerging technique, non-volatile memory (NVM) provides valuable opportunities for boosting the memory system, which is vital for the computing system performance. However, one challenge preventing NVM from replacing DRAM as the main memory is that NVM row activation's latency is much longer (by approximately 10x) than that of DRAM. To address this issue, we present a collective cross-page prefetching scheme that can accurately open an NVM row in advance and then prefetch the data blocks from the opened row with low overhead. We identify a memory access pattern (referred to as a ladder stream) to facilitate prefetching that can cross page boundary, and propose the ladder stream prefetcher (LSP) for NVM. In LSP, two crucial components have been well designed. Collective Prefetch Table is proposed to reduce the interference with demand requests caused by prefetching through speculatively scheduling the prefetching according to the states of the memory queue. It is implemented with low overhead by using single entry to track multiple prefetches. Memory Mapping Table is proposed to accurately prefetch future pages by maintaining the mapping between physical and virtual addresses. Experimental evaluations show that LSP improves the memory system performance with no prefetching by 66\%, and the improvement over the state-of-the-art prefetchers, Access Map Pattern Matching Prefetcher (AMPM), Best-Offset Prefetcher (BOP) and Signature Path Prefetcher (SPP) is 26.6\%, 21.7\% and 27.4\%, respectively.
07:15 CEST	4.4.2	EFFICIENT HARDWARE-ASSISTED OUT-PLACE UPDATE FOR PERSISTENT MEMORY Speaker: Yifu Deng, Michigan Technological University, US Authors: Yifu Deng¹, Jianhui Yue², Zhiyuan Lu¹ and Yifeng Zhu³ ¹Michigan Technological University, US; ²Michigan Tech. University, US; ³University of Maine, US Abstract Shadow paging can guarantee crash consistency forPersistent Memory (PM). However, shadow paging requires theuse of an address mapping table to track shadow pages, andfrequent accesses to this table introduce significant performanceoverhead. In addition, maintaining crash consistency at the gran-ularity level of a page causes a large amount of unnecessary writetraffic. This paper proposes a novel hardware-assisted fine-grainedout-place-update scheme at the granularity level of a cachelineto efficiently support crash consistency for PM. Our designfully leverages the Address Indirection Table (AIT) available incommodity PM to implement remapping. To ensure the atomicityand durability of AIT updates, we propose two policies: eagerpersisting and lazy persisting. We also employ overflow log tohandle the eviction of speculative AIT cache entries upon anoverflow in the AIT cache. Evaluation results based on multicoreworkloads demonstrate that our proposed scheme can improve thetransaction throughput over the state-of-the-art design by 24.0%on average.
07:30 CEST	IP4_1.1	(Best Paper Award Candidate) HARDWARE ACCELERATION OF FULLY QUANTIZED BERT FOR EFFICIENT NATURAL LANGUAGE PROCESSING Speaker: Zejian Liu, National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, CN Authors: Zejian Liu¹, Gang Li² and Jian Cheng¹ ¹National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, CN; ²Institute of Automation, Chinese Academy of Sciences, CN Abstract BERT is the most recent Transformer-based model that achieves state-of-the-art performance in various NLP tasks. In this paper, we investigate the hardware acceleration of BERT on FPGA for edge computing. To tackle the issue of huge computational complexity and memory footprint, we propose to fully quantize the BERT (FQ-BERT), including weights, activations, softmax, layer normalization, and all the intermediate results. Experiments demonstrate that the FQ-BERT can achieve 7.94× compression for weights with negligible performance loss. We then propose an accelerator tailored for the FQ-BERT and evaluate on Xilinx ZCU102 and ZCU111 FPGA. It can achieve a performance-per-watt of 3.18 fps/W, which is 28.91× and 12.72× over Intel(R) Core(TM) i7-8700 CPU and NVIDIA K80 GPU, respectively.
07:31 CEST	4.4.3	UNDERSTANDING POWER CONSUMPTION AND RELIABILITY OF HIGH-BANDWIDTH MEMORY WITH VOLTAGE UNDERSCALING Speaker: SeyedSaber NabaviLarimi, Barcelona Supercomputing Center, ES Authors: Seyed Saber Nabavi Larimi¹, Behzad Salami¹, Osman Sabri Unsal¹, Adrian Cristal Kestelman¹, Hamid Sarbazi-Azad² and Onur Mutlu³ ¹Barcelona Supercomputing Center, ES; ²Sharif University of Technology and IPM, IR; ³ETH, CH Abstract Modern computing devices employ High-Bandwidth Memory (HBM) to meet their memory bandwidth requirements. An HBM-enabled device consists of multiple DRAM layers stacked on top of one another next to a compute chip (e.g. CPU, GPU, and FPGA) in the same package. Although such HBM structures provide high bandwidth at a small form factor, the stacked memory layers consume a substantial portion of the package’s power budget. Therefore, power-saving techniques that preserve the performance of HBM are desirable. Undervolting is one such technique: it reduces the supply voltage to decrease power consumption without reducing the device’s operating frequency to avoid performance loss. Undervolting takes advantage of voltage guardbands put in place by manufacturers to ensure correct operation under all environmental conditions. However, reducing voltage without changing frequency can lead to reliability issues manifested as unwanted bit flips. In this paper, we provide the first experimental study of real HBM chips under reduced-voltage conditions. We show that the guardband regions for our HBM chips constitute 19% of the nominal voltage. Pushing the supply voltage down within the guardband region reduces power consumption by a factor of 1.5X for all bandwidth utilization rates. Pushing the voltage down further by 11% leads to a total of 2.3X power savings at the cost of unwanted bit flips. We explore and characterize the rate and types of these reduced-voltage-induced bit flips and present a fault map that enables the possibility of a three-factor trade-off among power, memory capacity, and fault rate.