Dmitry Olshansky
2 min readJun 24, 2023


Reserve huge mmap-ed arena. Reserve total machine memory + chunk size. Adjust starting address to be multiple of chunk size.

Let chunk be 2mb huge page.

Reserve pool table:

TotalMem / 2Mb of uints, 0. — empty chunk, 1 — pool #0, etc.

(2Tb heap requires 4mb reserve)

Global state:

  • slice of memory arena (head immutable)
  • pointer to pool table (head immutable)
  • pool table freelist
  • pointer to each size class chunks scan/noscan
  • pointer to past last chunk in the arena, bump allocated when a given size class exhausted and no recycled chunks
  • circular doubly-linked list of thread-caches
  • spinlock on thread-cache list

Small size classes — 16, 32, 48, 64, … 2048, 2560, 3072, 3580, 4096.

Sustained pattern — free-list on top of mmaped region.

2Mb chunks, each chunk is split into blocks of 64 objects in small pools.

Metablock for small chunk (64 bytes for metadata):

  1. size class
  2. size class reciprocal multiplier
  3. scan/noscan attribute of this pool
  4. counter of blocks with some free objects
  5. markbits pointer
  6. freebits pointer
  7. appendable bits pointer

Per thread cache is assortment of TLABs — small for each size class. Intrusive list next/prev pointers.

Freebits are used as 64bit ulongs for each block in small pool.

Small TLAB:

  • pointer to block of objects, the block treated as marked during collection
  • word of freebits
  • last allocated offset in word, shift mask

Searching freebits TLABs is obtained by CASing 0 to freebits table. For small segregated pool use:

Option 1:

if(!bits) goto find_block;

ptr = block + size*tzcnt(bits);

bits = bits & (bits-1);

return ptr;

Option 2:

// initialization

end = block + 64*size;

mask = 1;

// the common path

If(!bits) goto find_block;

for (ptr<end;ptr+= size) {

. if (bits & mask) {

. bits ^= mask;

. mask <<=1;

. ptr += size;

. return ptr;

. }

. mask <<= 1;

. ptr += size;



  1. Atomic load for counter of objects, if less then certain threshold may go and find a better pool
  2. bit-scan for first non-zero byte starting at random offset, wrap around once ar the end. Round down to ulong then try CAS. If we did stale reads during bitscan, CAS will fail. On CAS failure continue from 1.
  3. On CAS success atomic subtract object count


  1. Linear scan of metadata for size class in allocated range (atomically load ranges from global state)
  2. If found with a free counter > X, pick that pool
  3. Failing that, acquire spinlock
  4. Scan metadata from last loaded offset to current value (typically nothing), if found proper pool, release lock
  5. Pick recycled pool from free-list or bump pointer
  6. Prepare metadata pool variables except size class, release lock
  7. Allocate bits
  8. Atomically set size class of pool

Large pool may span multiple chunks;

  1. Size class = max small+1
  2. Pool slice (ptr+length)
  3. Freelists pointer (4, 8, 12, 16, …, 128kb, beyond)
  4. Markbits per page
  5. Freebits per page
  6. Appendable bits per page

Huge pool indicates single huge allocation:

  1. Size class size_t.max
  2. Pool slice (ptr+length)
  3. Free / mark / appendable flag