## Intro Jemalloc HPA is hugepages aware implementation of pages allocator. HPA leverages hugepages to reduce cost of TLB misses and thereby improve application performance. ## Glossary ### Pageslab Pageslab is hugepage aligned and sized memory range. You can think of it as a set of pages packed together into a hugepage. Pageslab is not necessary backed up by a hugepage. ### Active Page Active page is a memory page that potentially stores application data. ### Dirty Page Dirty page is a memory page that might have had application data on it in the past, but has no application data now. It can be reused (became active again) or returned back to the OS anytime. ### Purge Purge is a process of returning dirty pages back to the OS. ### Hugification Hugification is a request to the OS back pageslab by a hugepage. ## Constants ### `PAGE` `PAGE` is the size of a jemalloc page. By default it is `4096` bytes on x86\_64. ### `HUGEPAGE` `HUGEPAGE` is the size of the OS hugepage. By default it is `2097152` bytes (2 MiB) on x86\_64. ### `HUGEPAGES_PAGES` Number of pages in a single hugepage: `HUGEPAGE / PAGE`. By default it is `512` on x86\_64. ## Documentation HPA is under active development and options are not described in official documentation yet. Below is a brief description of currently available options. ### hpa HPA enabled/disabled. Master switch to enable HPA. This option should be enabled to make other options to work. Boolean. Default value: `false`. ### hpa_slab_max_alloc Maximum allocation size in bytes allowed to be served from HPA page allocator. Allocations of greater size will be served from a more general (for now) classic page allocator (PAC), which can handle allocation requests of any size. Slab allocations will always be served out of HPA, even when the `hpa_slab_max_alloc` option is set to a small value like `PAGE` due to implementation quirks. This implementation quirks can be leveraged to serve out of HPA only small allocations (small in jemalloc definition is allocation less than 16 KiB). Unsigned integer. Default value: `65536` bytes. Minimum value: `PAGE` bytes. Maximum value: `HUGEPAGE` bytes. ### hpa_hugification_threshold Minimum number of active bytes in a pageslab necessary for pageslab to be placed into hugification queue. Pageslab always produced in a non-huge state. Over time, when number of active bytes became greater or equal than `hpa_hugification_threshold`, jemalloc puts pageslab into hugification queue. Unsigned integer. Default value: `0.95 * HUGEPAGE` bytes. Minimum value: `PAGE` bytes. Maximum value: `HUGEPAGE` bytes. ### hpa_hugification_threshold_ratio Minimum percent of active bytes in a pageslab necessary for pageslab to be placed into hugification queue. This option has the same semantic as `hpa_hugification_threshold`, but in percent notation. Fixed-point fractional. Default value: `0.95`. Minimum value: `0`. Maximum value: `1.0`. ### hpa_hugify_delay_ms Time in milliseconds required for pageslab to spent in hugification queue, before jemalloc requests OS to back pageslab by a hugepage. Hugification queue is ordered by timestamp, when pageslab was placed into the queue, with head of the queue being pageslab placed into the queue earliest and tail of the queue being pageslab there latest. When pageslab stops meeting hugification criteria: number of active bytes is less than `hpa_hugification_threshold` it is **not** removed from hugification queue. Only purge can remove pageslab from hugification queue. Unsigned integer. Default value: `10000` milliseconds. ### hpa_hugify_sync Switch to use synchronous hugification requests. Use `madvise(..., MADV_COLLAPSE)` to request OS back up pageslab by a hugepage alongside `madvise(..., MADV_HUGEPAGE)`. Increments `stats.arenas..hpa_shard.nhugify_failures` counter on failure. Usual asynchronous hugification introduces delay of unknown length, between request to OS has been made to hugify a pageslab and OS actually backs up pageslab by a hugepage. This option allows to eliminate this delay. Requires Linux 6.1 or higher. Boolean. Default value: `false`. ### hpa_min_purge_interval_ms Minimum time between two consecutive purge phases in milliseconds. Each `hpa_min_purge_interval_ms` jemalloc will check if purging criteria are met and if they are, it will purge as much pageslabs as needed until purging criteria are no longer met. Minimal unit of purging is pageslab, meaning all dirty pages will be returned back to the OS from chosen pageslab, even if less pages required to be purged to reach purging target. If there are few consecutive dirty pages, one syscall will be issued to purge them together in one go. Unsigned integer. Default value: 5000 milliseconds. ### hpa_peak_demand_window_ms Length of peak demand sliding window in milliseconds. Time component of purging criteria. Jemalloc will track the maximum number of active pages used within `hpa_peak_demand_window_ms` milliseconds sliding window. Jemalloc will purge dirty pages above that peak usage. It is easier to explain in an example. Suppose `ncurrent` is the number of active pages currently in use and `npeak` is the peak (maximum) number of active pages within the last 10 seconds. Then jemalloc is allowed to keep `npeak - ncurrent` dirty pages and will purge the rest of them if there are any. Option `hpa_peak_demand_window_ms` works in combination with `hpa_dirty_mult`. Unsigned integer. Default value: 0 milliseconds (disabled by default). ### hpa_dirty_mult Maximum of dirty to active pages ratio jemalloc is allowed to keep. Ratio based component of purging criteria. Jemalloc is trying to estimate the maximum amount of active memory application might likely need in the near future. It does so by projecting future active memory demand (based on peak active memory usage observed in the past within a sliding window) and adds slack on top of it (an overhead it is reasonable to have in exchange on higher hugepages coverage). When peak demand tracking is off, projection of future active memory is current active memory usage. Estimation is essentially the same as `npeak * (1 + hpa_dirty_mult)`. In case, when `hpa_peak_demand_window_ms` is set to `0`, then `npeak` equals to `ncurrent` and expression became `ncurrent * hpa_dirty_mult`. When `hpa_dirty_mult` is `0`, then the expression becomes just `npeak`. Option `hpa_dirty_mult` works in combination with `hpa_peak_demand_window_ms`. Fixed-point fractional or `-1`. Default value is `0.25` (not a great default). When set to `-1` disables purging completely. ### hpa_sec_nshards Number of small extent cache (SEC) shards. SEC is a cache layer above the HPA page allocator. Requests are distributed across small extent cache shards `[0, nshards - 1)`. If a request can not be served out of SEC, it will be forwarded to the HPA page allocator. I can not say I saw cases when the SEC helped much. Probably, more work is required to make SEC useful. Unsigned integer. Default value: 4 shards. When set to `0` disables small extent cache (SEC). ### hpa_sec_max_alloc Maximum size of allocation in bytes, that can be served out of SEC. Jemalloc will refuse to cache any objects if their size is greater than `hpa_sec_max_alloc` and forward such objects to the HPA page allocator. Unsigned integer. Default value: 32768 bytes. Minimum value: `PAGE` bytes. Maximum value: `32768` bytes. ### hpa_sec_max_bytes Maximum number of bytes small extent cache shard allowed to cache. When shard cached bytes size exceeds `hpa_sec_max_bytes`, jemalloc will flush bins until the number of cached bytes falls below `hpa_sec_bytes_after_flush`. Unsigned integer. Default value: `262144` bytes. Minimum value: `PAGE`. ### hpa_sec_bytes_after_flush Maximum number of bytes SEC is allowed to have after flush caused by exceeding `hpa_sec_max_bytes`. This option should be less than `hpa_sec_max_bytes` for SEC to be useful. Unsigned integer. Default value: `131072` bytes. Minimum value: `PAGE`. ### hpa_sec_batch_fill_extra Number of extra objects to fill on SEC miss. When allocation request can not be satisfied out of SEC, because there are no available ones cached, jemalloc brings `hpa_sec_batch_fill_extra` additional objects to SEC out of HPA page allocator. Unsigned integer. Default value: `0`. Maximum value: `HUGEPAGES_PAGES`. ### experimental_hpa_max_purge_nhp Maximum number of pageslabs to purge on each purging phase. Experimental option that likely will be removed soon. Limits number of pageslab to purge on each purging phase. Signed integer. Default value: `-1` (disabled by default). ## Acknowledgements Thanks to Kevin Svetlitski, whose note introduced me to the HPA world.