Behnam Pourghassemi

My name is Behnam Pourghassemi and I'm a Ph.D. student in computer engineering at the University of California, Irvine. My research interests lie in high performance computing (HPC), network systems and performance analysis. I am interested in designing and developing scalable and low-overhead profiling tools for web services as well as large-scale and parallel code-base such as web browsers. I also research in HPC and parallel computing with the focus on GPU computing and optimizing deep learning networks.


Ph.D. in Computer Engineering, University of California, Irvine

M.S. in Computer Engineering, University of California, Irvine

B.S. in Electrical Engineering, Sharif University of Technology


  • AdPerf: Characterizing the Performance of Third-party Ads
    SIGMETRICS 2021 Details PDF
  • Only Relative Speed Matters: Virtual Causal Profiling
    IFIP Performance 2020 Details PDF Presentation
  • On the Limits of Parallelizing Convolutional Neural Networks on GPUs
  • Scalable Dynamic Analysis of Browsers for Privacy and Performance
    SIGMETRICS PER 2020 Details PDF
  • What-If Analysis of Page Load Time in Web Browsers Using Causal Profiling
    SIGMETRICS 2019 (nominated for best paper award) Details PDF
  • Platform for Concurrent Execution of GPU Operations
    US Patent application number: 16/442,440 (in contract with Samsung Electronics) Details PDF
  • Platform for Concurrent Execution of GPU Operations
    US Patent application number: 16/442,447 (in contract with Samsung Electronics) Details PDF
  • CudaCR: An In-kernel Application-level Checkpoint/restart Scheme for CUDA-enabled GPUs
    CLUSTER 2017 Details PDF
  • Unsteady Navier-stokes Computations on GPU Architectures
    AIAA 2017 Details PDF

Selected Projects

Leveraging parallelism for non-linear convolutional neural networks on GPU

GPUs are currently the platform of choice for training neural networks. However, training a deep neural network (DNN) is a time-consuming process even on GPUs because of the massive number of parameters that have to be learned. As a result, accelerating DNN training has been an area of significant research in the last couple of years.While earlier networks such as AlexNet had a linear dependency between layers and operations, more recent networks such as ResNet, PathNet, and GoogleNet have a non-linear structure that exhibits a higher level of inter-operation parallelism. However, popular deep learning (DL) frameworks such as TensorFlow and PyTorch launch the majority of neural network operations, especially convolutions, serially on GPUs and do not exploit this inter-op parallelism. Accordingly, we make a case for the need and potential benefit of exploiting this rich parallelism in state-of-the-art non-linear networks for reducing the training time. We identify the challenges and limitations in enabling concurrent layer execution on GPU backends (such as cuDNN) of DL frameworks and propose potential solutions.

adPerf: Perfomance Characterization of Web ads

We apply an in-depth and first-of-a-kind performance evaluation of web ads without using adblockers. We aim to characterize the cost by every component of an ad, so the publisher, ad syndicate, and advertiser can improve ad's performance with detailed guidance. For this purpose, we develop an infrastructure, adPerf, for the Chrome browser that classifies page loading workloads into ad-related and main-content at the granularity of browser activities (such as Javascript and Layout). Our evaluations show that online advertising entails more than 15% of browser page loading workload and approximately 88% of that is spent on JavaScript. AdPerf also tracks the sources and delivery chain of web ads and analyze performance considering the origin of the ad contents. We observe that 2 of well-known third-party ad domains contribute to 35% of the ads performance cost and surprisingly, top news websites implicitly include unknown third-party ads which in some cases build up to more than 37% of the ads performance cost.

COZ+: Causal Performance Analysis of Browsers

Apply comprehensive and quantitative what-if analysis on the web browser’s page loading process to detect performance bottlenecks. Unlike conventional profiling methods, we apply causal profiling to precisely determine the impact of each computation stage such as HTML parsing and Layout on PLT. For this purpose, we develop COZ+, a high-performance causal profiler capable of analyzing large software systems such as the Chromium browser. COZ+ highlights the most influential spots for further optimization, which can be leveraged by browser developers and/or website designers. For instance, COZ+ shows that optimizing JavaScript by 40% is expected to improve the Chromium desktop browser’s page loading performance by more than 8.5% under typical network conditions.

cudaCR: In-kernel Checkpoint/restart for GPU

By shifting from peta-scale to exa-scale, mean-time-between-failure of large-scale machines is dropping so that errors might happen when GPU nodes are executing their kernel. Unlike previous frameworks, we design and implement an application-level checkpoint/restart scheme for CUDA application that can capture in-kernel state of GPU and restart from previous clean state inside the kernel. We inject CR code into the application source code that handles data-movement and memory footprint of threads. This scheme is well-designed for compute-intensive long-running kernels.

VCoz: Virtual Causal Profiler

Causal profiling is a novel and powerful profiling technique that quantifies the potential impact of optimizing a code segment on the program’s execution time. In this project, We first theoretically model and prove causal profiling, the missing piece in the original paper; then we assert the necessary condition to achieve virtual causal profiling on the secondary device. Building upon the theory, we design VCoz, a virtual causal profiler that enables profiling applications on the target devices by running experiments on the host device. We implement a prototype of VCoz by tuning multiple hardware components to preserve the relative execution speed of code segments. Our experiments demonstrate that VCoz can generate causal profiling reports of Nexus 6P (ARM-based device) on a host x86 system with less than 16% variance.

HiPer: Computational Fluid Dynamics Simulation on Heterogeneous System

HPC Lab started a joint-project in collaboration with department of Mechanical and Aerospace Engineering at UCI to improve performance of their CFD simulator. Our team implement scalable and high-performance 2nd order finite-volume Navier-Stokes simulator on heterogeneous system. My contribution was to accelerate GPU stencil computation using Geometric Multi-grid and simulate different test cases such as cylinder channel for steady/unsteady flows.

HetroFHE: Fully Homomorphic Encryption on Heterogeneous CPU-GPU System

Fully Homomorphic Encryption is an almost new asymmetric cryptosystem (The primary lattice-based implementation of this scheme presented in Gentry's PhD thesis in 2009) that carries out computation on ciphertext. In other word, it lets users to apply all operations on encrypted data and get the same result if they apply them on unencrypted data (plaintext). So, FHE is a good option for substitution of existing standard cryptosystems like AES in cloud computing. However, its operations such as encryption, key generation, multiplication and so on require intensive computation over big integers. We implemented some of these operations on heterogeneous CPU-GPU system. In CPU, we used big-integer library NTL for initialization and Enc/Dec and in device side, we used Chinese Reminder Theorem (CRT) for multiplication and addition.


  • Programming: C/C++, Java, Verilog, Python, Matlab
  • Parallel Environment: CUDA, MPI, OpenMP
  • Hardware: GPU, FPGA, X86, ARM
  • Operating System: Linux
  • Software: Visual Studio, CUDA debugger/profiler, Vivado (HLS & design suite), Quartus
  • Miscellaneous: Latex, R, Git

Curriculum Vitae

You can download my CV from here.

Contact Me

You can send me your message here.