Computer Engineering / Bilgisayar Mühendisliği
Permanent URI for this collectionhttps://hdl.handle.net/11147/10
Browse
12 results
Search Results
Now showing 1 - 10 of 12
Research Project FTGPGPU - Genel amaçlı grafik işlemci birimi uygulamaları için donanım hatası toleransı analizi(2022) Öz, IşılGenel amaçlı hesaplamalar için grafik islemci birimlerinin (GPGPU) kullanımı, donanım hatalarının kritikligini arttırmakta, programların geçici hata hassasiyetini degerlendirmek ve uygun hata toleransı tekniklerini kullanmak daha önemli hale gelmektedir. Hataya en hassas program bölgelerinin korunması yoluyla, hem performansı, hem de güvenilirligi hedefleyen sistemler için ayrıntılı bölgesel hata hassasiyeti analizi çok önemlidir. Bu projede, GPGPU uygulamalarının geçici donanım hatası hassasiyetinin ölçülmesi, analiz edilmesi ve bu analizlerin sonuçlarının program özellikleri ile iliskilendirilmesi, seçimli hata toleransı yöntemi gelistirilmesi yoluyla kullanılması amaçlanmıstır. Projenin ilk katkısı, GPGPU uygulamlarının geçici hata hassasiyetlerinin bölgesel olarak belirlenmesi için yazılım ile donanım iliskisini saglayacak sekilde assembly seviyesinde hata ayıklayıcı tabanlı bir hata enjeksiyonu ve hata yayılımı analizi aracı gelistirilmesidir. Bu araç kullanılarak farklı yapıdaki, farklı özelliklere sahip GPGPU programlarının belirlenen kod bölgelerine hata enjeksiyonu saglayan deneyler yapılmıs, kod bölgelerinin hata hassasiyetleri ve olusan hatanın program süresince farklı veri yapılarına yayılımı incelenmistir. Projenin ikinci katkısı, GPGPU program kod parçalarının özellikleri ile bu kodlar çalısırken meydana gelebilecek hatalara hassasiyetleri arasındaki iliskinin incelenmesidir. GPGPU programlarındaki kod parçacıklarının performans ve mimari özellikleri profilleme ve simulasyon yöntemleriyle elde edilmis, ilk adımda gelistirilen hata enjeksiyonu aracıyla belirlenen kod parçalarına hata enjekte ederek uygulanan deney sonuçlarında sessiz veri bozunumu, çökme ve dogru çalısma durumları belirlenmistir. Program özellikleri-hata hassasiyeti ikilisi arasındaki iliski incelenerek program özellikleri verilen bir GPGPU uygulamasının hata hassasiyet degerleri makine ögrenmesi yöntemleriyle tahmin edilmistir. Gelistirilen tahminleme modelleriyle sessiz veri bozunumu için %82, çökme durumları için %87, dogru çalısma durumları için %96 dogruluk oranlarıyla tahminleme basarısı saglanmıstır. Projenin üçüncü katkısı, hataya daha hassas kod bölgelerinin çoklanmasına dayalı seçimli hata toleransı yöntemi gelistirilmesidir. Program gelistirici veya kullanıcı tarafından kaynak kodda isaretlenen kod bölgelerinin çoklanması seklinde gerçeklenen derleyici seviyesinde gelistirilen hata toleransı yapısı, belirtilen kernel fonksiyonlarının çoklanmasını artıklı kernel fonksiyonu olarak veya tek kernel fonksiyonu altında artıklı is parçacıgı olarak veya CUDA stream teknigi ile mümkün kılmaktadır. Böylece uygulamanın paralellik ve veri kullanımı özelliklerine göre farklı çoklama yürütme durumları seçilebilmekte, kaba taneli (coarsegrained) bir yapıda çıktı kontrolü ile performanslı bir sekilde çoklama saglanmaktadır.Article Soft Error Vulnerability Prediction of Gpgpu Applications(Springer, 2022) Topçu, Burak; Öz, IşılAs graphics processing units (GPUs) evolve to offer high performance for general-purpose computations in addition to inherently fault-tolerant graphics applications, soft error reliability becomes a significant concern. Fault injection provides a method of evaluating the soft error vulnerability of target programs. Since performing fault injection experiments for complex GPU hardware structures takes impractical times, the prediction-based techniques to evaluate the soft error vulnerability of general-purpose GPU (GPGPU) programs based on metrics from different domains get crucial for both HPC developers and GPU vendors. In this work, we propose machine learning (ML)-based prediction frameworks for the soft error vulnerability evaluation of GPGPU programs. We consider program characteristics, hardware usage and performance metrics collected from the simulation and the profiling tools. While we utilize regression models to predict the masked fault rates, we build classification models to specify the vulnerability level of the GPGPU programs based on their silent data corruption (SDC) and crash rates. Our prediction models achieve maximum prediction accuracy rates of 95.9, 88.46, and 85.7% for masked fault rates, SDCs, and crashes, respectivelyArticle Performance and Accuracy Predictions of Approximation Methods for Shortest-Path Algorithms on Gpus(Elsevier, 2022) Aktılav, Busenur; Öz, IşılApproximate computing techniques, where less-than-perfect solutions are acceptable, present performance-accuracy trade-offs by performing inexact computations. Moreover, heterogeneous architectures, a combination of miscellaneous compute units, offer high performance as well as energy efficiency. Graph algorithms utilize the parallel computation units of heterogeneous GPU architectures as well as performance improvements offered by approximation methods. Since different approximations yield different speedup and accuracy loss for the target execution, it becomes impractical to test all methods with various parameters. In this work, we perform approximate computations for the three shortest-path graph algorithms and propose a machine learning framework to predict the impact of the approximations on program performance and output accuracy. We evaluate random predictions for both synthetic and real road-network graphs, and predictions of the large graph cases from small graph instances. We achieve less than 5% prediction error rates for speedup and inaccuracy values.Conference Object Citation - WoS: 1Citation - Scopus: 2Predicting the Soft Error Vulnerability of Gpgpu Applications(Institute of Electrical and Electronics Engineers Inc., 2022) Topçu, Burak; Öz, IşılAs Graphics Processing Units (GPUs) have evolved to deliver performance increases for general-purpose computations as well as graphics and multimedia applications, soft error reliability becomes an important concern. The soft error vulnerability of the applications is evaluated via fault injection experiments. Since performing fault injection takes impractical times to cover the fault locations in complex GPU hardware structures, prediction-based techniques have been proposed to evaluate the soft error vulnerability of General-Purpose GPU (GPGPU) programs based on the hardware performance characteristics.In this work, we propose ML-based prediction models for the soft error vulnerability evaluation of GPGPU programs. We consider both program characteristics and hardware performance metrics collected from either the simulation or the profiling tools. While we utilize regression models for the prediction of the masked fault rates, we build classification models to specify the vulnerability level of the programs based on their silent data corruption (SDC) and crash rates. Our prediction models achieve maximum prediction accuracy rates of 96.6%, 82.6%, and 87% for masked fault rates, SDCs, and crashes, respectively.Article Citation - WoS: 4Citation - Scopus: 4Predicting the Soft Error Vulnerability of Parallel Applications Using Machine Learning(Springer, 2021) Öz, Işıl; Arslan, SanemWith the widespread use of the multicore systems having smaller transistor sizes, soft errors become an important issue for parallel program execution. Fault injection is a prevalent method to quantify the soft error rates of the applications. However, it is very time consuming to perform detailed fault injection experiments. Therefore, prediction-based techniques have been proposed to evaluate the soft error vulnerability in a faster way. In this work, we present a soft error vulnerability prediction approach for parallel applications using machine learning algorithms. We define a set of features including thread communication, data sharing, parallel programming, and performance characteristics; and train our models based on three ML algorithms. This study uses the parallel programming features, as well as the combination of all features for the first time in vulnerability prediction of parallel programs. We propose two models for the soft error vulnerability prediction: (1) A regression model with rigorous feature selection analysis that estimates correct execution rates, (2) A novel classification model that predicts the vulnerability level of the target programs. We get maximum prediction accuracy rate of 73.2% for the regression-based model, and achieve 89% F-score for our classification model.Article Citation - WoS: 5Citation - Scopus: 5Regional Soft Error Vulnerability and Error Propagation Analysis for Gpgpu Applications(Springer, 2021) Öz, Işıl; Karadaş, Ömer FarukThe wide use of GPUs for general-purpose computations as well as graphics programs makes soft errors a critical concern. Evaluating the soft error vulnerability of GPGPU programs and employing efficient fault tolerance techniques for more reliable execution become more important. Protecting only the most error-sensitive program regions maintains an acceptable reliability level by eliminating the large performance overheads due to redundant operations. Therefore, fine-grained regional soft error vulnerability analysis is crucial for the systems targeting both performance and reliability. In this work, we present a regional fault injection framework and perform a detailed error propagation analysis to evaluate the soft error vulnerability of GPGPU applications. We evaluate both intra-kernel and inter-kernel vulnerabilities for a set of programs and quantify the severity of the data corruptions by considering metrics other than SDC rates. Our experimental study demonstrates that the code regions inside GPGPU programs exhibit different characteristics in terms of soft error vulnerability and the soft errors corrupting the variables propagate into the program output in several ways. We present the potential impact of our analysis by discussing the usage scenarios after we compile our observations acquired from our empirical work.Article Citation - WoS: 1Citation - Scopus: 1A User-Assisted Thread-Level Vulnerability Assessment Tool(Wiley, 2019) Öz, Işıl; Topçuoğlu, Haluk Rahmi; Tosun, OğuzThe system reliability becomes a critical concern in modern architectures with the scale down of circuits. To deal with soft errors, the replication of system resources has been used at both hardware and software levels. Since the redundancy causes performance degradation, it is required to explore partial redundancy techniques that replicate the most vulnerable parts of the code. The redundancy level of user applications depends on user preferences and may be different for the users with different requirements. In this work, we propose a user-assisted reliability assessment tool based on critical thread analysis for redundancy in parallel architectures. Our analysis evaluates the application threads of a parallel program by considering their criticality in the execution and selects the most critical thread or threads to be replicated. Moreover, we extend our analysis by exploring critical regions of individual threads and execute redundantly only those regions to reduce redundancy overhead. Our experimental evaluation indicates that the replication of the most critical thread improves the system reliability more (up to 10% for blackscholes application) than the replication of any other thread. The partial thread replication based on critical region analysis also reduces the vulnerability of the system by considering a fine-grained approach.Article Citation - WoS: 3Citation - Scopus: 3Regression-Based Prediction for Task-Based Program Performance(World Scientific Publishing, 2019) Öz, Işıl; Bhatti, Muhammad Khurram; Popov, Konstantin; Brorsson, MatsAs multicore systems evolve by increasing the number of parallel execution units, parallel programming models have been released to exploit parallelism in the applications. Task-based programming model uses task abstractions to specify parallel tasks and schedules tasks onto processors at runtime. In order to increase the efficiency and get the highest performance, it is required to identify which runtime configuration is needed and how processor cores must be shared among tasks. Exploring design space for all possible scheduling and runtime options, especially for large input data, becomes infeasible and requires statistical modeling. Regression-based modeling determines the effects of multiple factors on a response variable, and makes predictions based on statistical analysis. In this work, we propose a regression-based modeling approach to predict the task-based program performance for different scheduling parameters with variable data size. We execute a set of task-based programs by varying the runtime parameters, and conduct a systematic measurement for influencing factors on execution time. Our approach uses executions with different configurations for a set of input data, and derives different regression models to predict execution time for larger input data. Our results show that regression models provide accurate predictions for validation inputs with mean error rate as low as 6.3%, and 14% on average among four task-based programs.Article Citation - WoS: 23Citation - Scopus: 31A Survey on Multithreading Alternatives for Soft Error Fault Tolerance(Association for Computing Machinery (ACM), 2019) Öz, Işıl; Arslan, SanemSmaller transistor sizes and reduction in voltage levels in modern microprocessors induce higher soft error rates. This trend makes reliability a primary design constraint for computer systems. Redundant multithreading (RMT) makes use of parallelism in modern systems by employing thread-level time redundancy for fault detection and recovery. RMT can detect faults by running identical copies of the program as separate threads in parallel execution units with identical inputs and comparing their outputs. In this article, we present a survey of RMT implementations at different architectural levels with several design considerations. We explain the implementations in seminal papers and their extensions and discuss the design choices employed by the techniques. We review both hardware and software approaches by presenting the main characteristics and analyze the studies with different design choices regarding their strengths and weaknesses. We also present a classification to help potential users find a suitable method for their requirement and to guide researchers planning to work on this area by providing insights into the future trend.Article Citation - WoS: 9Citation - Scopus: 8Scalable Parallel Implementation of Migrating Birds Optimization for the Multi-Objective Task Allocation Problem(Springer Verlag, 2021) Öz, Dindar; Öz, IşılAs the distributed computing systems have been widely used in many research and industrial areas, the problem of allocating tasks to available processors in the system efficiently has been an important concern. Since the problem is proven to be NP-hard, heuristic-based optimization techniques have been proposed to solve the task allocation problem. Particularly, the current cloud-based systems have been grown massively requiring multiple features like lower cost, higher reliability, and higher throughput; therefore, the problem has become more challenging and approximate methods have gained more importance. Migrating birds optimization (MBO) algorithm offers successful solutions, especially for quadratic assignment problems. Inspired by the movement of the birds, it exhibits good results by its population-based approach . Since the algorithm needs to deal with many individuals in the population, and the neighbor solution generation phase takes substantial time for large problem instances, we need parallelism to have execution time improvements and make the algorithm practical for large-scale problems. In this work, we propose a scalable parallel implementation of the MBO algorithm, PMBO, for the multi-objective task allocation problem. We redesigned the implementation of the MBO algorithm so that its computationally heavy independent tasks are executed concurrently in separate threads. We compare our implementation with three parallel island-based approaches. The experimental results demonstrate that our implementation exhibits substantial solution quality improvements for difficult problem instances as the computing resources, namely parallelism, increase. Our scalability analysis also presents that higher parallelism levels offer larger solution improvement for the PMBO over the island-based parallel implementations on very hard problem instances.
