January 30, 2023
Volume 20, issue 6

Download PDF version of this article PDF

To PiM or Not to PiM

The case for in-memory inferencing of quantized CNNs at the edge

Gabriel Falcao And João Dinis Ferreira

As artificial intelligence becomes a pervasive tool for the billions of IoT (Internet of things) devices at the edge, the data movement bottleneck imposes severe limitations on the performance and autonomy of these systems. PiM (processing-in-memory) is emerging as a way of mitigating the data movement bottleneck while satisfying the stringent performance, energy efficiency, and accuracy requirements of edge imaging applications that rely on CNNs (convolutional neural networks).

The globalization of affordable Internet access has spurred a revolution in computer architectures, characterized by the accelerated widespread adoption of smartphones, tablets, and other smart devices, which are now commonplace.¹⁵ The rise of IoT applications in a wide range of domains (e.g., personal computing, education, industry, military, healthcare, digital agriculture) brings with it the ability to integrate billions of devices on the Internet, as depicted in figure 1.

To PiM or Not to PiM: The case for in-memory inferencing of quantized CNNs at the edge

This integration presents unprecedented challenges, such as the need for inexpensive computation and communication, capable of crunching the increasing volumes of data generated every day.

The IoT paradigm also promises a more intimate connection between the cyber and physical worlds, as data becomes a ubiquitous asset exchanged among all manner of connected (smart) devices. Moreover, the data flow is often bidirectional, taking place not just from the physical to the cyber world, but also from the cyber to (what is possible in) the physical world.

Bringing this vision to fruition will require that IoT devices exhibit AI. The most promising approaches today are based on empowering systems with the ability to learn autonomously from experience by assimilating large amounts of data—using ML (machine learning) algorithms—with a particular focus on deep learning and image inference.

AI Demands at the Edge Will Grow

The total volume of digital data created, replicated, and consumed within a year has surpassed dozens of ZBs (10²¹bytes) in 2020, and the International Data Corporation estimates that this number will grow to hundreds of ZBs in coming years.¹⁸ The Covid-19 pandemic contributed to this figure because of widespread work-from-home mandates and a sudden increase in videoconferencing and streaming data. A significant portion of this data is consumed at the edge, often with processing performed entirely on smartphones and embedded systems. The rise of many other AI-based applications applied to the big data revolution exacerbates this problem by placing increasing stress on computing and memory systems, in particular those operating at the edge.

The case for edge computing as the enabler of sustainable AI scaling is strengthened because a sizable portion of the data generated by modern digital systems originates from sensors located at the edge—under this paradigm, data is processed where it is generated, as figure 2 illustrates. In contrast with the conventional approach shown on the left, with the emergence of cloud-edge hierarchies, AI moves to the edge layer, shown on the right in the figure. This new paradigm creates pressure for more intensive computation on the edge processing nodes, but it also decreases the time and energy spent communicating with the cloud and introduces data privacy and latency benefits.

With the growth in data generation and the expansion of demand for AI applications at increasing rates, five goals justify moving the processing of AI to the edge:^1,6,20,24

•Latency. Many applications have an interactive nature and thus cannot endure long latencies, especially for memory, storage, and network requests.

•Reliability. Communication networks are not reliable at all places, at all times. To ensure maximum uptime, AI-based decision-making must not rely on always-available communication networks.

•Privacy/Security. Some applications require sensitive data to be kept in a controlled local environment, avoiding its circulation to/from the cloud. Examples are medical, finance, or autonomous driving, among many others.

•Bandwidth. Data that is processed near where it is collected does not need to be sent to the cloud, which reduces the overall bandwidth demand on the network and the edge-computing systems.

•Data provenance. Provenance issues may prevent data from being processed far from where it is generated. Data-center storage may need to comply with regional data protection legislation such as GDPR (General Data Protection Regulation) in Europe and PIPA (Personal Information Protection Act) in Canada.

To achieve these goals, computer architects and software developers must adopt a holistic vision of the combined cloud+edge system to keep the unnecessary movement of data between components to a minimum by processing data where it is generated and stored, as this is the dominant performance and energy bottleneck.⁹

AI at the Edge: Ml Solutions for Data Challenges

At present, neural networks are widely used in many domains and are becoming integral components of other emerging applications, such as self-driving cars, always-on biosignal monitoring, augmented and virtual reality, critical IoT, and voice communication (which represents up to 25 percent of the use cases of 5G at the edge), all of which require AI algorithms to operate on high volumes of data at the edge (see figure 3). Examples include (a) autonomous vehicles, (b) digital agriculture, (c) smartphones, and (d) smart IoT devices that process substantial volumes of data while running AI kernels at the edge. The systems that support these emerging applications will be expected to make decisions faster—and, often, better—than their human counterparts, with support for continuous fine-tuning of their decision-making by factoring in ever-increasing volumes of data for training and inference.

In particular, CNNs are the de facto standard for image-based decision-making tasks. These models make heavy use of the convolution and MAC (multiply-and-accumulate) operations, which represent more than 90 percent of the total cost of computation.¹¹ For this reason, state-of-the-art neural network accelerators (e.g., Google's Tensor Processing Unit) have focused on optimizing the performance and energy efficiency of MAC operations.

Characterizing and optimizing CNN architectures

Typical CNNs consist of hundreds of millions of parameters and require the execution of billions of operations per second to achieve real-time performance. As their accuracy improves, CNNs include more parameters and layers, becoming wider and deeper. The use of compact data representations provided by quantization mitigates some of the overhead of these more complex network architectures and allows for high degrees of parallelism and data reuse, which are especially useful in constrained processing environments.

Data quantization significantly reduces the computation and storage requirements of neural networks by decreasing the bit width of the model's weights. Quantizing these values to under 8 bits while retaining accuracy, however, requires manual effort, hyperparameter tuning, and intensive retraining.

While training requires a high dynamic range, inference does not: In most cases, 2- to 4-bit precision achieves the desired levels of accuracy.¹⁰ Going further, it is even possible to approximate CNNs by binarizing (i.e., quantizing to one bit) their input, weights, and/or activations.^10,17

In many practical settings, and in particular for edge computing, the performance and energy-efficiency benefits of binary neural networks outweigh the accuracy loss. A further benefit of binary neural networks is the ability to approximate the convolution operation required by CNNs, by combining the much more efficient bitwise XNOR and bit-counting operations.

Table 1 illustrates the memory and processing requirements of five widely used CNN models across a range of devices, from data-center servers to IoT nodes. The table indicates the number of parameters and MAC operations for all the networks. W/A refers to the bit widths of weights and activations for each level of quantization. All networks have 224-by-224 input resolution. These devices were selected from a large set of neural network algorithms to fit the edge nodes, namely in terms of memory required to run the models. Their size is shown as a function of weight/activation bit widths (32-bit, 2-bit, 1-bit), as well as the number of MAC operations used by each network.

Specialization as the enabler of high-performance AI

Modern deep-learning algorithms have substantial computational, memory, and energy requirements, which makes their implementation on edge devices challenging. This challenge can be addressed by exploiting two unique characteristics of ML algorithms: First, each class of deep-learning algorithm relies on a limited set of specialized operations; second, in many cases these algorithms provide good accuracy even when they use low-bit-width operations.¹³

In recent years, several frameworks (e.g., TensorFlow, PyTorch, TensorRT) helped bridge the semantic gap between the high-level description of a neural network model and its underlying hardware mapping by using specialized instructions. This is achieved by performing operations in a bulk parallel manner, minimizing memory accesses and maximizing compute resource utilization.

Efficient Edge AI: Architecting Data-Centric Systems

These observations and constraints lead to the formulation of a set of well-defined target metrics for AI at the edge:

•Accuracy. The success rate of the AI task¹⁰ (e.g., image classification, object detection, sentence generation, translation).

•Throughput. The rate of processing of input data. Many real-time AI applications that support video must sustain processing rates on the order of thousands of FPS (frames per second)—for example, 2300 FPS in self-driving cars²³ or hundreds to thousands of FPS in ultrasound medical devices.

•Latency. The critical path delay associated with the processing of a single input element. 5G standards define a maximum latency of one millisecond for positioning and tracking systems; self-driving cars must provide latencies within the same order of magnitude.²³

•Power and energy. Most edge devices are battery-powered, and maximizing battery life is a key design target. For reference, the computational system of self-driving cars requires a power supply of up to 2.5kW.²³

•Data precision. AI data need not always be represented in 64- or 32-bit floating-point precision. For many inference applications, integer precision of eight bits or less suffices.^4,13

The good old processor-centric computing model

The late 20^th and early 21^st centuries saw the widespread use of the processor-centric computational model. In this model, programs and data are stored in memory, and processing takes place in specialized ALUs (arithmetic logic units). Together with Moore's law, the introduction of caches, branch predictors, out-of-order execution, multithreading, and several other hardware and software optimizations enabled a steady and mostly unfazed series of performance improvements over the past decades.

In contrast, memory systems have improved at a much slower pace. This performance gap between the processor and the main memory—supported by the fact that these technologies are still several process-node generations apart—has given rise to a critical data movement bottleneck, dubbed the memory wall.¹⁴ Memory is the dominant performance and energy bottleneck in modern computing systems; data movement is much more expensive than computation, both in latency and energy. The data movement bottleneck will remain relevant as the number of smart devices connected to the Internet—as of this writing, already in the billions, as depicted in figure 1—continues to grow.

A compelling possibility: processing data where it resides

As the demand for inferencing at the edge grows, accessing data more efficiently becomes increasingly relevant. The proposed improvements span (1) data reuse by exploiting temporal and physical locality; (2) algorithm design, with the introduction of optimized neural network topologies and the use of quantization; (3) specialized hardware, introducing dedicated vectorized instructions designed to address the demands of these workloads (e.g., SIMD MAC operations); and (4) PiM architectures. When performed at the edge, PiM enables higher throughput for AI applications, without compromising device autonomy.¹⁷

PiM solutions differ primarily in their proximity to the data:¹⁴ In the PnM (processing-near-memory) paradigm, computation takes place close to where data resides, but in a different medium—for example, in the logic layer of a 3D-stacked memory; in contrast, the PuM (processing-using-memory) paradigm takes advantage of the storage medium's physical properties to perform computation—for example, ReRAM (resistive random access memory) or DRAM (dynamic random access memory) cells.

Processing-near-Memory

3D-stacked memories are an emerging type of memory architecture that enables the vertical stacking of memory layers on top of a logic layer. This logic layer can be designed to feature hardware support for several operations, thus enabling computation inside the memory units.¹⁹

Processing-using-Memory

DRAM technology is especially well suited for supporting bitwise operations since adjacent memory rows can communicate with one another through their bitlines.

Ambit²¹ supports bulk bitwise majority/AND/OR/NOT functions by exploiting the analog operation of DRAM. The combination of these operations allows the design of full applications. Recent studies show that Ambit's core operating principle can be performed in commodity off-the-shelf DRAM chips with no changes to DRAM.⁷ Ambit improves performance by 30 to 45 times and reduces energy consumption by 25 to 60 times for the execution of bulk bitwise operations, resulting in an overall speedup in database queries of four to 12 times.

SIMDRAM⁸ creates an optimized graph representation of a user-defined arbitrary operation using bitwise majority and NOT operations, which can be performed using the triple-row activation command defined in Ambit. The SIMDRAM control unit orchestrates the computation from start to finish by executing the previously defined DRAM commands. SIMDRAM improves performance by 88/5.8 times and reduces energy consumption by 257/31 times, compared with CPU/GPU execution.

The DRAM-based PuM architecture pLUTo⁵ extends the flexibility and performance of PuM by introducing a mechanism for bulk in-DRAM value lookups. The lookups take place entirely within the DRAM subarray and therefore do not require that data be moved off-chip at any point. With pLUTo, it is possible to implement arbitrarily complex functions as table lookups (so long as the memory arrays are sufficiently large to accommodate them), while minimizing the overall movement of data. pLUTo improves performance by 33/8 times and reduces energy consumption by 110/80 times compared with CPU/GPU execution.

PRIME² and ISAAC²² are two promising neural network accelerators based on ReRAM. These proposals leverage the ReRAM crossbar array to perform matrix-vector multiplication efficiently in the analog domain. These solutions report performance and energy consumption improvements of up to 2360 and 895 times, respectively, relative to state-of-the-art neural-processing unit designs.

Enabling the adoption of PiM

Because quantized neural networks rely on simple operations, they are a prime target for exploiting the use of PiM to process AI kernels at the edge. Performing more straightforward computation close to where data resides greatly reduces the overall movement of data, which improves latency, throughput, and energy efficiency. These can likely be applied even more efficiently to quantized and binary CNNs, which use XNOR and bit count operators.

To validate the performance and energy merits of PiM, the next section presents a quantitative analysis of the improvements to CNN image inference at the edge, offered by Ambit, a DRAM-based PuM architecture that supports bitwise logic operations.

To PiM or Not to Pim: A Quantitative Analysis

This section evaluates the accuracy, performance, and energy costs of performing inference for binary and quantized versions of the most recent neural networks, introduced in table 1. To this end, table 2 shows the energy consumption of Ambit-based PiM designs to perform inference on each of these networks. These estimates account for the cost of performing the dominant MAC operations required by the convolutions. Energy is calculated for execution on both ARM CPUs at the edge (baseline), as well as using an analytical model for PiM technology. (For full access to the original references, see https://github.com/joaodinissf/qcnn-accuracy/.)

This shows that different accuracy-performance-energy tradeoffs are possible. Table 3 is an analysis of the performance of the four most recent neural networks from table 1, targeting 15 and 60 FPS, for 3-, 2- and 1-bit precision, with supported inference throughput (in FPS) for Qualcomm Snapdragon 865, Intel Xeon Gold 6154, and Edge TPU baselines, compared with an Ambit-based PiM architecture. Cells with a '-' indicate lack of real-time support; cells with a '+' indicate real-time support. Empirical observations show that these quantized models ensure adequate accuracy for many applications. Ingenious quantization techniques enable further compression of even the smallest neural network models and the portability of very large networks for use in power-constrained devices, with small, tunable, accuracy losses.

Accuracy

The average accuracy loss is 6.8 percent, and in the best case the loss is as low as 2.2 percent. In cases where the accuracy of the 1-bit network may result in too great of an accuracy loss, settling for 2- or 3-bit quantization often yields good accuracy values.

Performance

The described Ambit-based accelerator achieves 15 FPS in eight cases out of 12, and 60 FPS for the other four cases. In contrast, the Qualcomm Snapdragon, Intel Xeon, and Edge TPU baselines sustain 15 FPS in, respectively, zero, three, and nine cases out of 12, and 60 FPS in zero, zero, and nine cases out of 12. The average speedup of the Ambit-based accelerator over each of these baselines is 58.8, 21.3, and 4.6 times.

Two key conclusions can be drawn. First, the Edge TPU can sustain a processing rate of only 3.4 FPS for VGG-16, because of this network's high number of parameters. This illustrates the poor scalability of current specialized neural network accelerators to process very large networks. Second, the flexible degree of parallelism in the PiM implementation, attained through the operation of multiple subarrays in parallel, allows the inference time to scale quasi-linearly with the size of the network, which enables a near-constant inference time in the largest networks.

Energy

Average energy savings of 35.4 times were observed for 1-bit, relative to 32-bit precision. Energy gains are linearly proportional to the degree of quantization.

Open Challenges

As the volume of data to process approaches the installed processing capacity, the successful implementation of AI at the edge will depend on the development of optimized architectures that are able to perform AI tasks while meeting strict performance and energy-efficiency requirements. After the introduction of 1,000-core processors and the expansion of levels of parallelism in the compute units, computer architects and software developers must now turn to the memory to design the next-generation of high-performance and highly-efficient systems. While PiM architectures have shown promise,¹⁶ enabling their adoption at the edge layer will require answers to the following open questions:

For manufacturers: designing PiM

•Software/Hardware co-design. The integration of PiM technology and its compilers with new firmware and the operating system should be facilitated to enable the design of edge systems with maximal performance and energy efficiency in order to meet the computational demands of AI applications and use cases. PiM can be specialized for the edge by supporting common image-processing operations, which are especially relevant for performance.

•Low design complexity and cost. PiM designs should be made sufficiently low-cost to entice hardware manufacturers to integrate them into their products. This will entail the development of early-access commercial prototypes and proof-of-concept products that encourage early adopters to disseminate the technology and offset R&D costs. PiM technology is reaching a point of maturity, and it will soon make its way to self-driving cars, healthcare, digital agriculture, and other edge AI applications with massive total addressable markets.

•Multitiered PiM architectures. As PiM technology matures, it will populate multiple levels of the existing memory hierarchy with complementary processing components, each with its own benefits and drawbacks.^3,12 For example, processing-in-cache architectures have been proposed, which trade off the DRAM capacity for greater speed and support for a wider range of operations. Analogously, processing-in-storage (typically oriented toward nonvolatile memories) is well suited for the execution of simpler operations on vast volumes of data, with very high throughput.

•The compiler. PiM substrates will require custom compilers; these compilers can be a tool to aid software developers in the identification of common memory access and computational patterns, in order to automate certain steps of the compilation process to yield maximal performance (e.g., the adoption of ideal data mapping or the circumvention of bottlenecking memory accesses).

For end users: software development for PiM

•Algorithm design. Mapping existing and emerging applications to PiM substrates requires the algorithm to exploit data quantization, exposed parallelism, and the coalescing of memory accesses to offset the high-cost operations in the system, which are especially costly at the edge. The programmer is responsible for achieving the desired tradeoff between performance, energy efficiency, and accuracy to satisfy the requirements of the target AI application.

•The development framework. A low-cost approach to empower the compiler with knowledge about the application's data flow is to create an intuitive and expressive API that abstracts as many as possible of the high-efficiency operations supported by the PiM substrate in an easy-to-use way. It is tempting to draw an analogy between the current state of PiM and the early days of GPGPU (general-purpose computing on graphics processing units): CUDA and OpenCL were instrumental in enabling the mass-scale adoption of GPUs, and a similar API must assume that role for PiM.

•Benchmarking tools. Standard benchmarking, profiling, simulation, and analysis tools enable the comparison of different architectures and are therefore essential for the R&D stage of PiM. This is especially important for emerging applications, as is the case for many edge AI workloads. Similar to the role that MLPerf (the result of a collaboration of a consortium of AI leaders from academia, research labs, and industry) plays in the development of machine-learning algorithms, a set of standardized tests would allow the development of PiM to remain steadfast and open.

What the Future Will Bring and the Role of PiM

Demand for AI at the edge will continue to grow. High-performance and energy-efficient PiM-based edge architectures for the processing of AI should have the following qualities:

•Support quantized data structures. As shown exhaustively in the literature, particularly for inference, the use of reduced-bit-width representations yields only small losses in accuracy. This is particularly useful for reducing the movement of data within/from/to the memory subsystem and for increasing the degree of vectorization and parallelism.

•Support specialized instructions. Bitwise and LUT (lookup table)-based operations can be implemented in the memory substrate as row-level operations or in the logic layer, with sizable benefits for running AI kernels with low hardware complexity, high bandwidth, and high parallelism.

•Avoid the unnecessary movement of data. The compiler should be able to detect what part of the AI kernel should run on the processor and what part should run in memory. Not only are different portions of the kernel more suitable for distinct subsystems, but it is also fundamental to balance the workload among them to maximize performance. The minimization of energy consumption is another target that the computer architecture community should aim for.

•Foster a PiM ecosystem. Achieving critical mass in the adoption of PiM is a crucial milestone for its success, and reaching it requires a programmer-friendly interface, intuitive compilers, and comprehensive test suites, with a set of industry-standard benchmarks.

Where do we go from here?

Several PiM architectures have demonstrated the ability to perform AI tasks with unprecedented levels of efficiency by meeting the takeaways presented here. Existing PiM designs support a limited range of operations, however, and further work is necessary to meet the requirements of AI tasks.

PiM is not a silver bullet, and it will not supersede conventional computing—although it may soon become evident that computers and other smart devices benefit from incorporating both conventional processing units and PiM-enabled memories. When designing new architectures, computer architects must remain mindful that data movement is expensive in both latency and energy. Emerging technologies and architectures enable the mitigation of data movement and, as such, pave a path for the design of more efficient computing devices. The current paradigm will take us only so far; PiM presents a compelling alternative.

Acknowledgments

This work was partially supported by Instituto de Telecomunicações and Fundação para a Ciência e a Tecnologia, Portugal, under grants EXPL/EEI-HAC/1511/2021, PTDC/EEIHAC/30485/2017 and UIDB/EEA/50008/2020.

References

1. Bonawitz, K., Kairouz, P., McMahan, B., Ramage, D. 2021. Federated learning and privacy: building privacy-preserving systems for machine learning and data science on decentralized data. acmqueue 19(5), 87–114; https://dl.acm.org/doi/10.1145/3494834.3500240.

2. Chi, P., Li, S., Xu, C., Zhang, T., Zhao, J., Liu, Y., Wang, Y., Xie, Y. 2016. PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Computer Architecture News 44(3), 27-39; https://dl.acm.org/doi/10.1145/3007787.3001140.

3. Devaux, F. 2019. The true processing in memory accelerator. In IEEE Hot Chips 31 Symposium (HCS). IEEE Computer Society, 1–24; https://ieeexplore.ieee.org/document/8875680.

4. Duarte, P., Tomas, P., Falcao, G. 2017. SCRATCH: an end-to-end application-aware soft-GPGPU architecture and trimming tool. In 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 165-177; https://ieeexplore.ieee.org/document/8686531.

5. Ferreira, J. D., Falcao, G., Gómez-Luna, J., Alser, M., Orosa, L., Sadrosadati, M., Kim, J. S., Oliveira, G. F., Shahroodi, T., Nori, A., Mutlu, O. 2022. pLUTo: enabling massively parallel computation in DRAM via lookup tables. In 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 900-919; https://www.computer.org/csdl/proceedings-article/micro/2022/627200a900/1HMSBFwBUOc.

6. Fuketa, H., Uchiyama, K. 2021. Edge artificial intelligence chips for the cyberphysical systems era. Computer 54(1), 84-88; https://ieeexplore.ieee.org/document/9321799.

7. Gao, F., Tziantzioulis, G., Wentzlaff, D. 2019. ComputeDRAM: in-memory compute using off-the-shelf DRAMs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 100-113; https://dl.acm.org/doi/10.1145/3352460.3358260.

8. Hajinazar, N., Oliveira, G., Gregorio, S., Ferreira, J., Ghiasi, N., Patel, M., Alser, M., Ghose, S., Gómez- Luna, J., Mutlu, O. 2021. SIMDRAM: a framework for bit-serial SIMD processing using DRAM. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 329-345; https://dl.acm.org/doi/10.1145/3445814.3446749.

9. Hennessy, J. L., Patterson, D. A. 2019. A new golden age for computer architecture. Communications of the ACM 62(2), 48-60; https://dl.acm.org/doi/10.1145/3282307.

10. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y. 2017. Quantized neural networks: training neural networks with low precision weights and activations. Journal of Machine Learning Research 18, 1-30; https://www.jmlr.org/papers/volume18/16-456/16-456.pdf.

11. Jouppi, N., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A. 2017. In-datacenter performance analysis of a tensor processing unit. In ACM/IEEE 44th Annual International Symposium on Computer Architecture, 1-12; https://ieeexplore.ieee.org/abstract/document/8192463/similar.

12. Kwon, Y.-C., Lee, S., Lee, J., Kwon, S.-H., Ryu, J., Son, J.-P., Seongil, O., Yu, H.-S., Lee, H., Kim, S. 2021. 25.4 A 20nm 6GB function-in-memory DRAM, based on HBM2 with a 1.2 TFLOPS programmable computing unit using bank-level parallelism, for machine learning applications. In IEEE International Solid-state Circuits Conference (ISSCC), 350-352; https://ieeexplore.ieee.org/document/9365862.

13. Marques, J. Andrade, J., Falcao, G. 2017. Unreliable memory operation on a convolutional neural network processor. In IEEE International Workshop on Signal Processing Systems (SiPS), 1-6; https://ieeexplore.ieee.org/document/8110024.

14. Mutlu, O., Ghose, S., Gómez-Luna, J., Ausavarungnirun, R. 2020. A modern primer on processing in memory. arXiv; https://arxiv.org/abs/2012.03112.

15. Pandey, P., Pompili, D. 2019. Handling limited resources in mobile computing via closed-loop approximate computations. IEEE Pervasive Computing 18 (1), 39–48; https://ieeexplore.ieee.org/document/8705029.

16. Radojkovic, P., Carpenter, P., Esmaili-Dokht, P., Cimadomo, R., Charles, H.-P., Sebastian, A., Amato, P. 2021. Processing in memory: the tipping point. European Technology Platform for High-performance Computing, white paper; https://www.etp4hpc.eu/pujades/files/ETP4HPC_WP_Processing-In-Memory_FINAL.pdf.

17. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A. 2020. Enabling AI at the edge with XNOR-networks. Communications of the ACM 63(12), 83-90; https://dl.acm.org/doi/10.1145/3429945.

18. Reinsel, D., Gantz, J., Rydning, J. 2018. The digitization of the world—from edge to core. IDC White Paper, I. D. Corporation; https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf.

19. Rosenfeld, P. 2014. Performance exploration of the hybrid memory cube. Ph.D. dissertation, University of Maryland; https://user.eng.umd.edu/~blj/papers/thesis-PhD-paulr--HMC.pdf.

20. Satyanarayanan, M. 2020. Edge computing: a new disruptive force. Keynote address, 13th ACM International Systems and Storage Conference.

21. Seshadri, V., Lee, D., Mullins, T. Hassan, H., Boroumand, A., Kim, J., Kozuch, M., Mutlu, O., Gibbons, P., Mowry, T. 2017. Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 273-287; https://dl.acm.org/doi/10.1145/3123939.3124544.

22. Shafiee, A., Nag, A., Muralimanohar, N., Balasubramonian, R., Strachan, J., Hu, M., Williams, R., Srikumar, V. 2016. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News 44(3), 14-26; https://dl.acm.org/doi/10.1145/3007787.3001139.

23. Talpes, E., Sarma, D., Venkataramanan, G., Bannon, P., McGee, B., Floering, B., Jalote, A., Hsiong, C. Arora, S., Gorti, A. 2020. Compute solution for Tesla's full self-driving computer. IEEE Micro 40(2), 25-35; https://ieeexplore.ieee.org/document/9007413.

24. Zheng, H., Hu, H., Han, Z. 2020. Preserving user privacy for machine learning: local differential privacy or federated machine learning? IEEE Intelligent Systems 35(4), 5–14; https://ieeexplore.ieee.org/document/9144394.

Gabriel Falcao received his Ph.D. from the University of Coimbra, Portugal, in 2010. He is a researcher at Instituto de Telecomunicações and a tenured assistant professor at the Department of Electrical and Computer Engineering of the University of Coimbra. He was a visiting professor at EPFL (Swiss Federal Institute of Technology Lausanne) in 2011-2012 and again in 2017, and a visiting academic at ETH Zurich in 2018. Falcao is co-author of more than 100 peer-reviewed publications, and his research interests include parallel computer architectures for data-intensive signal-processing applications. He is working on building low-power architectures for AI, with a particular focus on PiM hardware. In 2021 Falcao was local chair of Euro-Par, and general co-chair of IEEE SiPS, the theme of which was Signal Processing at the Edge. Falcao is a senior member of IEEE and a member of the HiPEAC Network of Excellence in Europe. Contact him at [email protected].

João Dinis Ferreira holds a bachelor's degree in electrical and computer engineering from the University of Coimbra, where he graduated in the top three percent of his class. He is a research and teaching assistant at ETH Zurich, where he is pursuing a master's degree in electrical engineering and information technology. His research interests include memory systems, processing-in-memory, and machine-learning hardware. Contact him at [email protected].

Originally published in Queue vol. 20, no. 6—
Comment on this article in the ACM Digital Library