Posts by Collection

portfolio

publications

Power, Energy and Thermal Considerations in SSD-Based I/O Acceleration

Published in USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage), 2014

Solid State Disks (SSDs) have risen to prominence as an I/O accelerator with low power consumption and high energy efficiency. In this paper, we question some common assumptions regarding SSDs’ operating temperature, dynamic power, and energy consumption through extensive empirical analysis. We examine three different real high-end SSDs that respectively employ multiple channels, cores, and flash chips. Our evaluations reveal that dynamic power consumption of many-resource SSD is, on average, 5x and 4x worse than an enterprise-scale SSD and HDD, respectively…

CoDEN: A Hardware/Software CoDesign Emulation Platform for SSD-Accelerated Near Data Processing

Published in IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA), 2015

For the past few decades, solid state disks (SSDs) significantly revamped their internal system architecture by employing more compute resources, multiple data channels, and tens or hundreds of non-volatile memory (NVM) packages. These ample internal resources in turn enable modern SSDs to accelerate near data processing. While the prior simulation-based work uncovered potential benefits of offloading the computation from a host to the SSDs, their analytical models make several assumptions that ignore not only detailed…

OpenNVM: An Open-Sourced FPGA-based NVM Controller for Low Level Memory Characterization

Published in IEEE International Conference on Computer Design (ICCD), 2015

In this paper, we present Open-NVM, an open-sourced, highly configurable FPGA based evaluation/characterization platform for various NVM technologies. Through our OpenNVM, this work reveals important low-level NVM characteristics, including i) static and dynamic latency disparity, ii) error rate variation, iii) power consumption behavior, vi) interrelationship between frequency and NVM operational current. In addition, we also examine state-of-the-art write-once-memory (WOM) codes on a real NVM device and study diverse system-level performance impacts based on our findings…

NVMMU: A Non-Volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures

Published in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2015

In this work, NVMMU unifies two discrete software stacks (one for the SSD and other for the GPU) in two major ways. While a new interface provided by our NVMMU directly forwards file data between the GPU runtime library and the I/O runtime library, it supports non-volatile direct memory access (NDMA) that pairs those GPU and SSD devices via physically shared system memory blocks. This unification in turn can eliminate unnecessary user/kernel-mode switching, improve memory management, and remove data copy overheads…

Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing

Published in International Conference on Parallel Architectures and Compilation Techniques (PACT), 2015

In this work, we redesign the shared last-level cache (LLC) of GPU devices by introducing non-volatile memory (NVM), which can address the cache thrashing issues with low energy consumption. Specifically, we investigate two architectural approaches, one of each employs a 2D planar resistive random-access memory (RRAM) as our baseline NVM-cache and a 3D-stacked RRAM technology. Our baseline NVM-cache replaces the SRAM-based L2 cache with RRAM of similar area size…

DUANG: Fast and Lightweight Page Migration in Asymmetric Memory Systems

Published in IEEE Symposium on High Performance Computer Architecture (HPCA), 2016

In this paper, we propose a novel resistive memory architecture sharing a set of row buffers between a pair of neighboring banks. It enables two attractive techniques: (1) migrating memory pages between slow and fast banks with little performance overhead and (2) adaptively allocating more row buffers to busier banks based on memory access patterns…

An In-Depth Study of Next Generation Interface for Emerging Non-Volatile Memories

Published in IEEE Non-Volatile Memory Systems and Applications Symposium (NVMSA), 2016

Non-Volatile Memory Express (NVMe) is designed with the goal of unlocking the potential of low-latency, randomaccess, memory-based storage devices. Specifically, NVMe employs various rich communication and queuing mechanism that can ideally schedule four billion I/O instructions for a single storage device. To explore NVMe with assorted user scenarios, we model diverse interface-level design parameters such as PCI Express, NVMe protocol, and different rich queuing mechanisms by considering a wide spectrum of host-level system configurations. In this work, we also assemble a comprehensive memory stack with different types of emerging NVM technologies, which can give us detailed NVMe related statistics like I/O request lifespans and I/O thread-related parallelism…

ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture

Published in USENIX Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW), 2016

In this paper, we propose a hybrid non-uniform cache architecture (NUCA) by employing STT-MRAM as a read-oriented on-chip storage. The key observation here is that many cache lines in LLC are only touched by read operations without any further write updates. These cache lines, referred to as singular-writes, can be internally migrated from SRAM to STT-MRAM in our hybrid NUCA. Our approach can significantly improve the system performance by avoiding many cache read misses with the larger STT-MRAM cache blocks, while it maintains the cache lines requiring write updates in the SRAM cache…

Couture: Tailoring STT-MRAM for Persistent Main Memory

Published in USENIX Workshop on Interactions of NVM/Flash with Operating Systems and Workloads (INFLOW), 2016

In this work, we present Couture – a main memory design using tailored STT-MRAM that can offer a storage density comparable to DRAM and high performance with low-power consumption. In addition, we propose an intelligent data scrubbing method (iScrub) to ensure data integrity with minimum overhead…

SimpleSSD: Modeling Solid State Drive for Holistic System Simulation

Published in IEEE Computer Architecture Letters (CAL), 2017

Existing solid state drive (SSD) simulators unfortunately lack hardware and/or software architecture models. Consequently, they are far from capturing the critical features of contemporary SSD devices. More importantly, while the performance of modern systems that adopt SSDs can vary based on their numerous internal design parameters and storage-level configurations, a full system simulation with traditional SSD models often requires unreasonably long runtimes and excessive computational resources. In this work, we propose SimpleSSD, a high-fidelity simulator that models all detailed characteristics of hardware and software…

Enabling Realistic Logical Device Interface and Driver for NVM Express Enabled Full System Simulations

Published in IFIP International Conference on Network and Parallel Computing (NPC) and Invited for International Journal of Parallel Programming (IJPP), 2017

In this work, we implement an NVMe disk and controller to enable a realistic storage stack of next generation interfaces and integrate them into gem5 and a high-fidelity solid state disk simulation model. We verify the functionalities of NVMe that we implemented, using a standard user-level tool, called NVMe command line interface…

An In-depth Performance Analysis of Many-Integrated Core for Communication Efficient Heterogeneous Computing

Published in IFIP International Conference on Network and Parallel Computing (NPC), 2017

Many-integrated core (MIC) architecture combines dozens of reduced x86 cores onto a single chip to offer high degrees of parallelism. The parallel user applications executed across many cores that exist in one or more MICs require a series of work related to data sharing and synchronization with the host. In this work, we build a real CPU+MIC heterogeneous cluster and analyze its performance behaviors by examining different communication methods such as message passing method and remote direct memory accesses…

Understanding System Characteristics of Online Erasure Coding on Scalable, Distributed and Large-Scale SSD Array Systems

Published in IEEE International Symposium on Workload Characterization (IISWC), 2017

Large-scale systems with arrays of solid state disks (SSDs) have become increasingly common in many computing segments. To make such systems resilient, we can adopt erasure coding such as Reed-Solomon (RS) code as an alternative to replication because erasure coding can offer a significantly lower storage cost than replication. To understand the impact of using erasure coding on system performance and other system aspects such as CPU utilization and network traffic, we build a storage cluster consisting of approximately one hundred processor cores with more than fifty high-performance SSDs…

TraceTracker: Hardware/Software Co-Evaluation for Large-Scale I/O Workload Reconstruction

Published in IEEE International Symposium on Workload Characterization (IISWC), 2017

Block traces are widely used for system studies, model verifications, and design analyses in both industry and academia. While such traces include detailed block access patterns, existing trace-driven research unfortunately often fails to find true-north due to a lack of runtime contexts such as user idle periods and system delays, which are fundamentally linked to the characteristics of target storage hardware. In this work, we propose TraceTracker, a novel hardware/software co-evaluation method that allows users to reuse a broad range of the existing block traces by keeping most their execution contexts and user scenarios while adjusting them with new system information…

ReveNAND: A Fast-Drift Aware Resilient 3D NAND Flash Design

Published in ACM Transactions on Architecture and Code Optimization (TACO), 2018

In this work, we first present an elastic read reference (VRef) scheme (ERR) for reducing such errors in ReveNAND—our fast-drift aware 3D NAND design. To address the inherent limitation of the adaptive VRef, we introduce a new intra-block page organization (hitch-hike) that can enable stronger error correction for the error-prone pages. In addition, we propose a novel reinforcement-learning-based smart data refill scheme (iRefill) to counter the impact of fast-drift with minimum performance and hardware overhead. Finally, we present the first analytic model to characterize fast-drift and evaluate its system-level impact….

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Published in 32nd IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2018

A modern GPU aims to simultaneously execute more warps for higher Thread-Level Parallelism (TLP) and performance. When generating many memory requests, however, warps contend for limited cache space and thrash cache, which in turn severely degrades performance. To reduce such cache thrashing, we may adopt cache locality-aware warp scheduling which gives higher execution priority to warps with higher potential of data locality. However, we observe that warps with high potential of data locality often incurs far more cache thrashing or interference than warps with low potential of data locality….

FlashAbacus: A Self-governing Flash-based Accelerator for Low-power Systems

Published in The European Conference on Computer Systems (EuroSys), 2018

Energy efficiency and computing flexibility are some of the primary design constraints of heterogeneous computing. In this paper, we present FlashAbacus, a data-processing accelerator that self-governs heterogeneous kernel executions and data storage accesses by integrating many flash modules in lightweight multiprocessors. The proposed accelerator can simultaneously process data from different applications with diverse types of operational functions, and it allows multiple kernels to directly access flash without the assistance of a host-level file system or an I/O runtime library…

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

Published in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018

In this paper, we propose FlashShare to assist ULL SSDs to satisfy different levels of I/O service latency requirements for different co-running applications. Specifically, FlashShare is a holistic cross-stack approach, which can significantly reduce I/O interferences among co-running applications at a server without any change in applications. At the kernel-level, we extend the data structures of the storage stack to pass attributes of (co-running) applications through all the layers of the underlying storage stack spanning from the OS kernel to the SSD firmware…

Amber: Enabling Precise Full-System Simulation with Detailed Modeling of All SSD Resources

Published in The 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018

SSDs become a major storage component in modern memory hierarchies, and SSD research demands exploring future simulation-based studies by integrating SSD subsystems into a full-system environment. However, several challenges exist to model SSDs under a full-system simulations; SSDs are composed upon their own complete system and architecture, which employ all necessary hardware, such as CPUs, DRAM and interconnect network. Employing the hardware components, SSDs also require to have…

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads

Published in International Symposium on High Performance Computer Architecture (HPCA), 2019

In this work, we propose FUSE, a novel GPU cache system that integrates spin-transfer torque magnetic random-access memory (STT-MRAM) into the on-chip L1D cache. FUSE can minimize the number of outgoing memory accesses over the interconnection network of GPUs multiprocessors, which in turn can considerably improve the level of massive computing parallelism in GPUs…

FlashGPU: Placing New Flash Next to GPU Cores

Published in The 56th Design Automation Conference (DAC), 2019

We propose FlashGPU, a new GPU architecture that tightly blends new flash (Z-NAND) with massive GPU cores. Specifically, we replace global memory with Z-NAND that exhibits ultra-low latency. We also architect a flash core to manage request dispatches and address translations underneath L2 cache banks of GPU cores…

Maximizing GPU Cache Utilization with Adjustable Cache Line Management

Published in Korea Computer Congress (KCC), 2019

Executing the irregular applications in general-purpose graphics processing units (GPGPUs) exposes serious challenges to their cache system. This paper proposes JUSTIT, an adjustable cache line management design that maximizes the GPU L1D cache utilization by being aware of the memory request access granularity…

Exploring Fault-Tolerant Erasure Codes for Scalable All-Flash Array Clusters

Published in IEEE Transactions on Parallel and Distributed Systems (TPDS), 2019

To understand the impact of using erasure coding on the system performance and other system aspects such as CPU utilization and network traffic, we build a storage cluster that consists of approximately 100 processor cores with more than 50 high-performance solid-state drives (SSDs), and evaluate the cluster with a popular open-source distributed parallel file system, called Ceph…

Faster than Flash: An In-Depth Study of System Challenges for Emerging Ultra-Low Latency SSDs

Published in IEEE International Symposium on Workload Characterization (IISWC), 2019

In this work, we comprehensively perform empirical evaluations with 800GB ULL SSD prototypes and characterize ULL behaviors by considering a wide range of I/O path parameters, such as different queues and access patterns. We then analyze the efficiencies and challenges of the polled-mode and hybrid polling I/O completion methods (added into Linux kernels 4.4 and 4.10, respectively) and compare them with the efficiencies of a conventional interrupt-based I/O path…

DRAM-less: Hardware Acceleration of Data Processing with New Memory

Published in International Symposium on High Performance Computer Architecture (HPCA), 2020

In this work, we propose, DRAM-less, a hardware automation approach that precisely integrates many state-of-the-art phase change memory (PRAM) modules into its data processing network to dramatically reduce unnecessary data copies with a minimum of software modifications. We implement a new memory controller that plugs a real 3x nm multi-partition PRAM to 28nm technology FPGA logic cells and interoperate its design into a real PCIe accelerator emulation platform…

Scalable Parallel Flash Firmware for Many-core Architectures

Published in USENIX Conference on File and Storage Technologies (FAST), 2020

We propose DeepFlash, a novel manycore-based storage platform that can process more than a million I/O requests in a second (1MIOPS) while hiding long latencies imposed by its internal flash media. Inspired by a parallel data analysis system, we design the firmware based on many-to-many threading model that can be scaled horizontally. The proposed DeepFlash can extract the maximum performance of the underlying flash memory complex by concurrently executing multiple firmware components across many cores within the device…

FastDrain: Removing Page Victimization Overheads in NVMe Storage Stack

Published in Computer Architecture Letters, 2020

Host-side page victimizations can easily overflow the SSD internal buffer, which interferes I/O services of diverse user applications thereby degrading user-level experiences. To address this, we propose FastDrain, a co-design of OS kernel and flash firmware to avoid the buffer overflow, caused by page victimizations. Specifically, FastDrain can detect a triggering point where a near-future page victimization introduces an overflow of the SSD internal buffer…

ZnG: Architecting GPU Multi-Processors with New Flash for Scalable Data Analysis

Published in International Symposium on Computer Architecture (ISCA), 2020

We propose ZnG, a new GPU-SSD integrated architecture, which can maximize the memory capacity in a GPU and address performance penalties imposed by an SSD. Specifically, ZnG replaces all GPU internal DRAMs with an ultra-low-latency SSD to maximize the GPU memory capacity. ZnG further removes performance bottleneck of the SSD by replacing its flash channels with a high-throughput flash network and integrating SSD firmware in the GPUs MMU to reap the benefits of hardware accelerations…

Revamping Storage Class Memory With Hardware Automated Memory-Over-Storage Solution

Published in International Symposium on Computer Architecture (ISCA), 2021

HAMS aggregates the capacity of NVDIMM and ultra-low latency flash archives (ULL-Flash) into a single large memory space, which can be used as a working memory expansion or persistent memory expansion, in an OS-transparent manner.to make HAMS more energy-efficient and reliable, we propose an “advanced HAMS” which removes unnecessary data transfers between NVDIMM and ULL-Flash after optimizing the datapath and hardware modules of HAMS …

Ohm-GPU: Integrating New Optical Network and Heterogeneous Memory into GPU Multi-Processors

Published in IEEE/ACM International Symposium on Microarchitecture (MICRO), 2021

We propose Ohm-GPU, a new optical network based heterogeneous memory design for GPUs. Specifically, Ohm-GPU can expand the memory capacity by combing a set of high-density 3D XPoint and DRAM modules as heterogeneous memory. To prevent memory channels from throttling throughput of GPU memory system, Ohm-GPU replaces the electrical lanes in the traditional memory channel with a high-performance optical network…

BeaconGNN: Large-Scale GNN Acceleration with Out-of-Order Streaming In-Storage Computing

Published in IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024

We propose BeaconGNN, an instorage computing (ISC) design for GNN that supports both large-scale graph structures and feature tables. First, it utilizes a novel graph format to enable out-of-order GNN neighbor sampling, improving flash resource utilization. Second, it deploys near-data processing engines across multiple levels of the flash hierarchy (i.e., controller, channel, and die)…

LearnedFTL: A Learning-based Page-level FTL for Reducing Double Reads in Flash-based SSDs

Published in IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024

We present LearnedFTL, a new on-demand pagelevel flash translation layer (FTL) design, which employs learned indexes to improve the address translation efficiency of flashbased SSDs. The first of its kind, it reduces the number of double reads induced by address translation in random read accesses. LearnedFTL proposes three key techniques…

Midas Touch: Invalid-Data Assisted Reliability and Performance Boost for 3D High-Density Flash

Published in IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024

This work proposes invalid-data assisted strategies for performance and reliability boosting of valid data in 3D QLC-based flash storage systems. We first propose a high-efficiency re-programming (RP) scheme to reprogram the valid data and a high-reliability not-programming (NP) scheme to program data on the partially-invalid WLs…

StreamPIM: Streaming Matrix Computation in Racetrack Memory

Published in IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2024

We propose StreamPIM, a new processing-in-RM architecture, which tightly couples the memory core and the computation units. Specifically, StreamPIM directly constructs a matrix processor from domain-wall nanowires without the usage of CMOS-based computation units. It also designs a domainwall nanowire-based bus, which can eliminate electromagnetic conversion…

Achieving Near-Zero Read Retry for 3D NAND Flash Memory

Published in ACM Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

We characterize different types of real flash chips, based on which we further develop models for the correlation among the optimal read offsets of read voltages required for reading each page. By leveraging characterization observations and the models, we propose a methodology to generate a tailored RRT for each flash model…

Flagger: Cooperative Acceleration for Large-Scale Cross-Silo Federated Learning Aggregation

Published in IEEE/ACM International Symposium on Computer Architecture (ISCA), 2024

We propose Flagger, an efficient and high-performance FL aggregator. Flagger meticulously integrates the data processing unit (DPU) with computational storage drives (CSD), employing these two distinct near-data processing (NDP) accelerators as a holistic architecture to collaboratively enhance FL aggregation…

talks

FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

Published:

A modern datacenter server aims to achieve high energy efficiency by co-running multiple applications. Some of such applications (e.g., web search) are latency sensitive. Therefore, they require low-latency I/O services to fast respond to requests from clients. However, we observe that simply replacing the storage devices of servers with Ultra-Low-Latency (ULL) SSDs does not notably reduce the latency of I/O services, especially when co-running multiple applications. In this paper, we propose FlashShare to assist ULL SSDs to satisfy different levels of I/O service latency requirements for different co-running applications. Specifically, FlashShare is a holistic cross-stack approach, which can significantly reduce I/O interferences among co-running applications at a server without any change in applications. At the kernel-level, we extend the data structures of the storage stack to pass attributes of (co-running) applications through all the layers of the underlying storage stack spanning from the OS kernel to the SSD firmware. For given attributes, the block layer and NVMe driver of FlashShare differently manage the I/O scheduler and interrupt handler of NVMe. We also enhance the NVMe controller and cache layer at the SSD firmware-level, by dynamically partitioning DRAM in the ULL SSD and adjusting its caching strategies to meet diverse user requirements. The evaluation results demonstrate that FlashShare can shorten the average and 99th-percentile turnaround response times of co-running applications by 22% and 31%, respectively.

Bridging the New Memory Technologies with Computer Systems

Published:

Nowadays, the memory and storage systems have experienced significant technology shifts. Such technology promotions have motivated researchers to re-think and re-design the existing system organization and hardware architecture. This talk mainly shares our research experience of building up the connection between the existing computer hardware/software and the new memory/storage technologies, including the emerging non-volatile memory and the new types of NAND flash media. For example, the emerging Z-NAND flash exhibits a high memory density, retains a long lifetime and reduces the flash access latency from hundreds of microseconds to only a few microseconds, which becomes a promising candidate to replace the traditional DRAM. I will firstly introduce the research with the system-level characterization of the Z-NAND flash. I will explain our research efforts to make the existing server storage stack (from kernel to firmware) adaptive to the new flash technology. In addition, I will further shows how the new flash technology can change the current computer architecture to satisfy various demands.

teaching

IIT 6036 Computer Organization and Design: teaching assistant

Teaching assistant, Graduate course, School of Integrated Technology, Yonsei University, 2015

This course is related to architecture fundamentals. Students are expected to be familiar with some hardware and computer science background. In this course, we provide two or three simple projects, one of each leveraging a different style of simulation models built for an educational purpose. The one of the goal behind these projects is that students can learn i) how to use full system simulation software and ii) how to perform simulation-based architectural studies, which in turn can be a good steppingstone for your future research. The simulation framework built on both most 32-bit and 64-bit flavors of UNIX and Windows NT-based operating systems. Although the projects will be relatively simple (compared to what an advanced computer architecture course usually deals with), we expect these projects can aid the students of the capability of freely analyzing/modifying C/C++ written software models.

IIT 3002 Operating Systems: teaching assistant

Teaching assistant, Undergraduate course, School of Integrated Technology, Yonsei University, 2015

The purpose of this course is to teach the general concepts and principles behind operating systems. The topic we have covered through this class, including i) kernel and process abstractions and programming, ii) scheduling and synchronization, iii) memory management and address translation, iv) caching and virtual memory v) file systems, storage devices, files and reliability, vi) full and para-virtualization. In addition to these lectures, we also provided term projects, which use an operating system simulator/emulator built for an educational purpose. In these projects, we expect that the students not only can learn Linux practices but also make great strides in studies on operating systems design and implementation. This is in C/C++, rather than Java or Python. We believe that these projects can provide a more realistic experience in operating systems to students. In this course, all homeworks are treated in an individual assignment; whereas projects are considered as a group assignment. In typical, it is difficult to figure out the contributions that each member committed, the submission for these projects will be done through a git repository per group (e.g., bitbucket), and I checked all push and pull transactions to grade a team.

IIT 7024 Advanced System Architecture

Teaching assistant, Graduate course, School of Integrated Technology, Yonsei University, 2016

This course mainly introduced computer organization and design, including the following topics: i) instruction-level parallelism, including parallel processing, superscalar, VILW, static instruction scheduling dynamic scheduling and precise exception handling, ii) memory-level parallelism, iii) data-level parallelism including multi-core architecture, GPU, iv) thread-level parallelism and v) NVM-level parallelism. This course is a project-centric; I prepared five gem5 lab projects. Most projects are a step-by-step tutorial to teach students how they can do simulation-based architectural explorations and studies. It will include CPU design analysis, exploring different branch predictors, multi-threading on full-system mode evaluations, and SSD internal parallelism analysis on gem5. Considering undergraduate students, this course will also include quick review lectures, which will include, instruction set architecture, MIPS/RISC architecture, pipelining, hazard and cache architecture.