Scopus İndeksli Yayınlar Koleksiyonu / Scopus Indexed Publications Collection

Permanent URI for this collectionhttps://hdl.handle.net/11147/7148

Browse

Search Results

Now showing 1 - 3 of 3
  • Conference Object
    Teaching Accelerated Computing with Hands-On Experience
    (IEEE Computer Soc, 2025) Oz, Isil; Iheme, Leonardo O.
    Heterogeneous computing systems maintain high-performance executions with parallel hardware resources. Graphics Processing Units (GPUs) with many parallel efficient cores and high-bandwidth memory structures enable accelerated computing for high-performance, deep learning, and embedded programs from diverse domains. The expertise in GPU programming requires a significant effort to utilize parallel computational units efficiently. Teaching programming for heterogeneous systems also becomes difficult due to dedicated hardware requirements and up-to-date course materials. In this paper, we present our teaching experience in an undergraduate parallel programming course, where we adopt NVIDIA Deep Learning Institute workshop and teaching kit contents and GPU devices at different scales to expose students to a set of hardware platforms with hands-on coding experience.
  • Article
    Demystifying Power and Performance Variations in Gpu Systems Through Microarchitectural Analysis
    (Comsis Consortium, 2025) Topcu, Burak; Karabacak, Deniz; Oz, Isil
    Graphics Processing Units (GPUs) serve efficient parallel execution for general-purpose computations at high-performance computing and embedded systems. While performance concerns guide the main optimization efforts, power issues become significant for energy-efficient and sustainable GPU executions. Profilers and simulators report statistics about the target execution; however, they either present only performance metrics in a coarse kernel function level or lack visualization support that can enable microarchitectural performance analysis or performance-power consumption comparison. Evaluating runtime performance and power consumption dynamically across GPU components enables a comprehensive tradeoff analysis for GPU architects and software developers. In this work, we present a novel memory performance and power monitoring tool for GPU programs, GPPRMon, which performs a systematic metric collection and provides useful visualization views to guide power and performance analysis for target executions. Our simulation-based framework dynamically gathers SM and memory-related microarchitectural metrics by monitoring individual instructions and reports dynamic performance and power values. Our interface presents spatial and temporal views of the execution. While the first demonstrates the performance and power metrics across GPU memory components, the latter shows the corresponding information at the instruction granularity in a timeline. We demonstrate performance and power analysis for memory-bound graph applications and resource-critical embedded programs from GPU benchmark suites. Our case studies reveal potential usages of our tool in memory-bound kernel identification, performance bottleneck analysis of a memory-intensive workload, performance-power evaluation of an embedded application, and the impact of input size on the memory structures of an embedded system.
  • Conference Object
    Community Detection for Large Graphs on GPUs With Unified Memory
    (Institute of Electrical and Electronics Engineers Inc., 2024) Dincer, Emre; Öz, Işıl; Oz, Isil
    While GPUs accelerate applications from different domains with different characteristics, processing large datasets gets infeasible on target systems with limited device memory. Unified memory support makes it possible to work with data larger than available GPU memory. However, page migration overhead for executions with irregular memory access patterns, like graph processing workloads, induces severe performance degradation. While memory hints help to deal with page movements by keeping data in suitable memory spaces, coarse-grain configurations can still not avoid migrations for executions having diverse data structures. In this work, we target the state-of-the-art CUDA implementation of the Louvain community detection algorithm and evaluate the impacts of the fine-grained unified memory hints on the performance. Our experimental evaluation shows that memory hints configured for specific data structures reveal significant performance improvements and enable us to work efficiently with large graphs.