Home / News / Software-Directed Hardware Reliability for ML Systems

Software-Directed Hardware Reliability for ML Systems

Monday, February 26, 2024

The rapid advancements of AI over the past decade have been fueled by impressive hardware innovations in computing chips such as GPUs and TPUs. However, the traditional trajectory of processor architecture advancement, driven by Moore’s Law and Dennard Scaling, has plateaued due to physical and fundamental limitations. To sustain and surpass this growth, a pivotal shift towards hardware-software codesign is essential. Yet, as we embark on this transformative journey, we encounter a critical challenge: modern-day transistors are increasingly susceptible to errors in the field, due to various phenomena such as high energy particle strikes or manufacturing defects. How then do we go about designing the future AI processors of tomorrow, while navigating the tradeoff between performance, energy efficiency, and reliability?

In this talk, I will describe my contributions to this pressing domain by presenting novel software-directed tools and techniques for processor design and reliability enhancement. I will challenge the notion that quantization alone can propel processor and model innovation forward, advocating instead for a nuanced approach to numerical data formats complemented by robust hardware support. Moreover, I will emphasize the imperative of elevating hardware reliability to a primary design consideration in the computing landscape. I will share insights into my endeavors aimed at integrating reliability as a foundational element in the design process. Finally, I will describe how multi-modal techniques will help form the backbone of future processor design and optimizations, and my future research vision of co-designing ML systems for high performance, scalable reliability, and intelligent resource allocation.

Post Talk Link: Click Here

Passcode: %op7C.XD

Speaker/s

Abdulrahman Mahmoud is a postdoctoral fellow at Harvard University in the Architecture, Circuits, and Compilers group. His research interests are at the intersection of computer architecture and machine learning, with the goal of co-designing efficient and reliable systems to accelerate ML applications, and to develop accurate and performant ML tools and techniques that aid future computer design and automation across the stack. He has open-sourced multiple research tools (including PyTorchFI and GoldenEye) which have collectively been downloaded over 50,000x and garnered accolades in academia and industry alike. His work has featured in top-tier systems venues (ASPLOS, SC, MICRO), machine learning venues (NeurIPS, ICLR, MLSys), and leading reliability and design automation conferences (DSN, ISSRE, DATE). He is currently the co-chair of the ML and Systems Rising Stars program. Abdulrahman completed his PhD at UIUC under the guidance of Dr. Sarita Adve in the RSim Research Group. During his graduate studies, he was very fortunate to be the recipient of the Mavis Future Faculty Fellowship, to be invited to the 7th Heidelberg Laureate Forum, and to receive multiple awards for teaching and mentoring undergraduate students. Prior to joining UIUC, Abdulrahman completed his BSE from Princeton University, where he was the recipient of the John Ogden Bigelow Jr. Prize in Electrical Engineering. He is on the steering committee of the Computer Architecture Student Association (CASA), the Computer Architecture Long Term Mentoring (CALM) program, the Fatima Fellowship, and an organizer of the uArch workshop, all initiatives aimed at broadening participation and providing mentorship opportunity for students in the computer systems and ML communities.

Thursday, July 24, 2025

Software-Directed Hardware Reliability for ML Systems

Speaker/s

Related

Understanding faith in the age of AI

Formal Methods for Modern Payment Protocols

Polygenic Score Modeling to Investigate Genotype-Phenotype Associations

Software-Directed Hardware Reliability for ML Systems

Speaker/s

Related

Understanding faith in the age of AI

Formal Methods for Modern Payment Protocols

Polygenic Score Modeling to Investigate Genotype-Phenotype Associations

Subscribe to The Node