Computer Engineering / Bilgisayar Mühendisliği
Permanent URI for this collectionhttps://hdl.handle.net/11147/10
Browse
2 results
Search Results
Now showing 1 - 2 of 2
Conference Object Citation - WoS: 1Citation - Scopus: 2Predicting the Soft Error Vulnerability of Gpgpu Applications(Institute of Electrical and Electronics Engineers Inc., 2022) Topçu, Burak; Öz, IşılAs Graphics Processing Units (GPUs) have evolved to deliver performance increases for general-purpose computations as well as graphics and multimedia applications, soft error reliability becomes an important concern. The soft error vulnerability of the applications is evaluated via fault injection experiments. Since performing fault injection takes impractical times to cover the fault locations in complex GPU hardware structures, prediction-based techniques have been proposed to evaluate the soft error vulnerability of General-Purpose GPU (GPGPU) programs based on the hardware performance characteristics.In this work, we propose ML-based prediction models for the soft error vulnerability evaluation of GPGPU programs. We consider both program characteristics and hardware performance metrics collected from either the simulation or the profiling tools. While we utilize regression models for the prediction of the masked fault rates, we build classification models to specify the vulnerability level of the programs based on their silent data corruption (SDC) and crash rates. Our prediction models achieve maximum prediction accuracy rates of 96.6%, 82.6%, and 87% for masked fault rates, SDCs, and crashes, respectively.Article Citation - WoS: 23Citation - Scopus: 31A Survey on Multithreading Alternatives for Soft Error Fault Tolerance(Association for Computing Machinery (ACM), 2019) Öz, Işıl; Arslan, SanemSmaller transistor sizes and reduction in voltage levels in modern microprocessors induce higher soft error rates. This trend makes reliability a primary design constraint for computer systems. Redundant multithreading (RMT) makes use of parallelism in modern systems by employing thread-level time redundancy for fault detection and recovery. RMT can detect faults by running identical copies of the program as separate threads in parallel execution units with identical inputs and comparing their outputs. In this article, we present a survey of RMT implementations at different architectural levels with several design considerations. We explain the implementations in seminal papers and their extensions and discuss the design choices employed by the techniques. We review both hardware and software approaches by presenting the main characteristics and analyze the studies with different design choices regarding their strengths and weaknesses. We also present a classification to help potential users find a suitable method for their requirement and to guide researchers planning to work on this area by providing insights into the future trend.
