High Bandwith Flash (HBF) SK Hynix Introduces Hybrid Memory Architecture for Improved AI Inference

From Sebastian Gerstl | Translated by AI 3 min Reading Time

Related Vendors

SK Hynix wants to address the increased demands of memory-intensive AI applications with a new approach to memory architectures. The concept, called H3, combines High Bandwidth Memory (HBM) and High Bandwidth Flash (HBF) on an interposer. According to a study, simulations show significant efficiency gains.

Left: High Bandwidth Flash (HBF) stacks multiple layers of NAND chips to significantly increase storage capacity; Right: Concept of the "hybrid" H³ architecture presented in the IEEE study.(Image: Sandisk (left) / Sk hynix (right))
Left: High Bandwidth Flash (HBF) stacks multiple layers of NAND chips to significantly increase storage capacity; Right: Concept of the "hybrid" H³ architecture presented in the IEEE study.
(Image: Sandisk (left) / Sk hynix (right))

With the H3 architecture, SK Hynix describes a hybrid memory concept for AI accelerators in an IEEE paper. The aim is to adapt bandwidth and capacity more closely to the requirements of large language models in the inference phase.

The core of the idea is the combination of High Bandwidth Memory (HBM) and High Bandwidth Flash (HBF) on a common interposer next to the GPU. While current designs - such as those based on Blackwell B200 - only connect HBM directly, H3 supplements the DRAM stack with stacked NAND flash with high parallelism.

HBF as Capacity Level Next to HBM

HBF stacks several 3D NAND dies in an HBM-like package structure. Unlike classic SSD architectures, the concept relies on a highly parallelized sub-array structure with independent read and write channels. This shortens internal data paths and increases the effective I/O parallelism.

Compared to HBM, HBF offers a significantly higher capacity, but with higher access latency and limited write endurance of typically around 100,000 cycles. The bandwidth is significantly higher than that of NVMe SSDs, but remains below the DRAM latency characteristics.

In H3, HBM and HBF stacks are connected in cascade. Access is via shared addressing; the GPU can use both memory areas as main memory. A prefetch or latency-hiding buffer integrated in the HBM base die is intended to cushion the higher NAND latencies.

Focus on KV Cache in Inference

The concept is driven by the growing memory requirements of large language models in inference. In particular, the key-value cache (KV cache), which caches context information, scales strongly with sequence lengths and batch sizes.

Sequences in the million-token range can require cache sizes in the terabyte range. In today's systems, the limited HBM capacity means that data has to be outsourced to local SSDs or GPUs have to be scaled. Both increase latency and energy requirements.

H3 provides for read-only data such as model weights or pre-calculated, shared KV caches to be stored in the HBF, while dynamic data remains in the HBM. This decouples HBM from capacity load and focuses more strongly on bandwidth-critical operations.

According to SK Hynix, simulations with eight HBM3E stacks and eight HBF stacks in combination with a Blackwell B200 GPU show an up to 2.69-fold increase in performance per watt compared to HBM-only configurations. With a KV cache of 10 million tokens, the possible batch size increased by a factor of 18.8.

Technical Hurdles and Standardization

The integration of NAND into HBM-related packaging poses considerable challenges. In addition to latency, controller design, wear leveling and the management of block-based addressing are particularly critical. Write performance is also becoming increasingly important for KV cache applications.

The energy requirement per access is also higher than that of HBM. The architecture therefore assumes that workloads are clearly read-intensive or are optimized accordingly by software. Cache-augmented generation is a possible application scenario here.

At the same time, several providers are driving standardization forward. Samsung Electronics and SK Hynix are working together with SanDisk in a consortium on specifications for HBF. The aim is commercialization from 2027.

In the competition for memory-centric inference architectures, H3 is thus positioning itself as a supplement to HBM, not a replacement for it. Whether the concept prevails will largely depend on packaging complexity, cost structure and the software ecosystem.(sg)

Subscribe to the newsletter now

Don't Miss out on Our Best Content

By clicking on „Subscribe to Newsletter“ I agree to the processing and use of my data according to the consent form (please expand for details) and accept the Terms of Use. For more information, please see our Privacy Policy. The consent declaration relates, among other things, to the sending of editorial newsletters by email and to data matching for marketing purposes with selected advertising partners (e.g., LinkedIn, Google, Meta)

Unfold for details of your consent