Zehra Sura  Zehra Sura photo       

contact information

RSM
Thomas J. Watson Research Center, Yorktown Heights, NY USA
  +1dash914dash945dash1653

links


profile


I am interested in compiler technology, computer architecture, and programming models.

My work has focused on parallel computing, multithreading, and memory access optimizations for emerging multicore systems, including heterogeneous and accelerator-based systems. I have worked on several high performance computing architectures, including systems with GPUs, in-memory processors, the BlueGene/Q system, and the Cell Broadband Engine.

 

PREVIOUS RESEARCH PROJECTS

Compiler for the Active Memory Cube (AMC) Processing-in-Memory System

The AMC system is a heterogeneous system design that integrates in-memory processors in a logic layer within 3D DRAM. The in-memory processors were custom-designed for high performance and power efficiency. Several architectural features made it challenging to compile for an AMC system, including the need to exploit multiple dimensions of parallelism, an exposed pipeline, non-conventional register files, no caches, and a software managed instruction buffer. The AMC compiler was able to match or beat hand-optimized code for several targeted applications.

 

Performance Acceleration of a Single Thread Using Fine-grained Parallelism

This work defined an execution model using groups of cores, each group with a primary core and some associated secondary cores that collaborate to speed up execution of sequential code. Cores within a group have dedicated queues for low-latency transfer of values between them. Compiler analyses and transformations were developed to automatically derive fine-grained parallel code from sequential code, in order to target such groups of cores. The execution model was simulated to determine speedup and transfer latency tradeoffs (observed 2.05x average speedup using groups of 4 cores).

 

Dynamic Optimization with the XL Open Framework (XLOF)

XLOF is an Eclipse Java plugin that allows users to reuse and customize transformations implemented in the IBM XL compiler. XLOF provides users with the capability to both view and modify code at intermediate stages in the compilation process. The XLOF framework was used to implement an online optimization pass that dynamically profiles, analyzes, recompiles and patches executing code to improve data prefetching.

 

Assist Threads for Software Data Prefetching

This work explored data prefetching using separate hardware threads (assist threads) running asynchronously with the main application thread. Data prefetching is a technique used to bring data into a processor’s cache ahead-of-time, thus reducing the number of memory stall cycles and improving performance. Compiler transformations were used to automatically generate code for the assist threads, and to synchronize their execution with the application thread.

 

OpenMP Compiler for the Cell Broadband Engine (BE) Architecture

This work developed the first single-source compiler for the Cell BE architecture. The Cell BE, used in Sony PlayStation 3 systems, has multiple heterogeneous cores on a single chip. The compiler automatically handled the complexity of multiple ISAs and multiple levels of available parallelism (SIMD, multithreading, multiple cores, and cores with heterogeneous capabilities). Static buffers were used to optimize data transfers between cores and to overlap computation with communication. The sizes of static buffers were also optimized for the limited local memory available.

This work also explored automatically configuring a software cache to maximize its performance on a per-application basis. Compiler analysis was used to estimate data access properties, including re-use distance, the size of data accessed before the next use of a data item, and the size of data spatially and temporally co-located.

 

Analysis of Inter-thread Dependences for Parallel Execution

This work used escape analysis, synchronization analysis, and delay set analysis to determine inter-thread dependences in a parallel program. Novel analysis algorithms were designed for efficient, incremental, and just-in-time compilation. They were used in a software implementation of memory consistency to determine the ordering constraints imposed by the consistency model. This work demonstrated the feasibility of providing sequential consistency in a Java virtual machine with tolerable performance degradation (10% on average for an Intel Xeon system).

 

Pointer Analysis Extended for Array Elements

A novel pointer analysis algorithm was developed for improving precision when analyzing array elements in Java. This pointer analysis was used to automatically parallelize numerical codes, and to generate optimized executable code.

 

Performance of Numerical Java Codes

This work determined a set of kernel templates that are often used in the domain of numerical computing. A static compiler was used to expose (via code transformations) and recognize code sections conforming to a kernel template, and mark them for specific optimization. This technique enabled a Java virtual machine runtime interpreter to deliver performance comparable to pre-compiled code for compute-intensive kernels.