Memory 1

2025-11-30

Introduction

This article organizes how Linux operates from the perspective of memory management, particularly memory pages. I wanted to cover the fundamental aspects of memory comprehensively, so it became quite long, though I ran out of steam toward the end and couldn’t cover everything. Through this article, you should be able to understand the mental model of how page management works.

Preparation

Virtual Memory Space

This is a concept that provides virtual memory beyond the actual physical memory. It is realized through modules such as the kernel, page tables, and MMU. I think one of the major benefits is that a virtual memory space is created for each process.

What is MMU

MMU (Memory Management Unit) is implemented as hardware rather than software because it “cannot exist on the memory it manages itself.” In modern systems, it exists as a circuit within the CPU. The MMU has the function of converting virtual memory addresses to physical memory addresses. Essentially, it achieves efficient memory management through page tables and flag management.

Physical Aspects

To understand performance characteristics, it’s necessary to know the characteristics of hardware and its physical location.

DRAM (Dynamic Random-Access Memory) is used as main memory. In SMP, memory is accessed via the system bus. All DRAM is accessed uniformly. In NUMA, each CPU socket has local memory, and access to remote memory has higher latency. Therefore, NUMA locality is very important.

CPU cache exists as something faster than main memory access. CPU L1 cache hits are around 1ns. This is about 100 times faster than main memory access. Note that since physical addresses are used as cache keys, this refers to cases where the TLB hits and the physical address can be found quickly. (The TLB lookup itself is included in the 1ns)

Since PTE (Page Table Entry) mentioned later can be cached in CPU cache, page tables are involved in cache coherence issues. From here on, I won’t touch on CPU cache, but it’s necessary to know as a related area.

Demo: Learning the Mental Model of MMU

This time, I wrote code to learn the mental model. I grasped how the major module groups coordinate with each other. I didn’t write actually working code, but rather abstracted it to learn the general mechanism.

STEP0

First, I’ll introduce the defined structures.

PhysicalMemory: Physical memory
PTE: Page Table Entry. This is composed of 64 bits
PageTable: Page table. In reality, it only holds the starting physical address and has read and write implemented.
TLBEntry: TLB entry. Holds the virtual page number and corresponding physical frame number
TLB: Manages TLBEntry using a queue approach.
MMU: Has a TLB internally and converts virtual addresses to physical addresses with the translate method.
Process: Holds the physical address of its own page table.
Kernel: Manages PhysicalMemory, Process, etc.
CPU: Has an MMU. Also has a cr3 register.

Using these structures, we understand the relationships between them at an abstracted level.

STEP1: Basics

I implemented the MMU’s translate method as follows.

    /// Virtual address → Physical address conversion
    fn translate<F>(
        &mut self,
        cr3: usize,
        pmem: &mut PhysicalMemory,
        vaddr: u16,
        mut page_fault_handler: F,
    ) -> Result<u16, Fault>
    where
        F: FnMut(u8, &mut PhysicalMemory, usize) -> u8,
    {
        let vpn = (vaddr >> 8) as u8;
        let offset = (vaddr & 0xFF) as u8;

        // 1. TLB lookup
        if let Some(pfn) = self.tlb.lookup(vpn) {
            return Ok(((pfn as u16) << 8) | offset as u16);
        }

        // 2. Page Table lookup
        let pt = PageTable { phys_addr: cr3 };

        let pte = pt.read_pte(pmem, vpn);

        let pfn = if !pte_present(pte) {
            // MMU returns Page Fault issue to CPU
            let new_pfn = page_fault_handler(vpn, pmem, cr3);
            pt.write_pte(pmem, vpn, new_pfn);
            new_pfn
        } else {
            pte_pfn(pte)
        };

        // TLB 更新
        self.tlb.insert(vpn, pfn);

        Ok(((pfn as u16) << 8) | offset as u16)
    }

In this code, virtual addresses are composed of 16 bits. The first half represents VPN (Virtual Page Number), and the second half represents the offset. The offset indicates the position within the page.

First, we search the cache in the TLB, which is the cache for the page table. If found, it contains the corresponding PFN (Physical Frame Number), so we can reconstruct the actual physical address from that value. We can understand that in the TLB, the offset value is irrelevant. This is possible because VPN and PFN are calculated based on a fixed page size.

Next, if it’s not in the TLB, we access the page table. We can see that the cr3 register stores the physical address of the page table that the currently executing process should use. For example, when a context switch occurs, the value of cr3 is rewritten.

We need to read the PTE at the corresponding location. This becomes the target PTE by advancing the address from the starting address by the VPN amount. Since 1 PTE is 8 bytes (64 bits), we multiply by 8.

    fn read_pte(&self, pmem: &PhysicalMemory, vpn: u8) -> PTE {
        let addr = self.phys_addr + (vpn as usize) * 8;
        pmem.read_u64(addr)
    }

After reading, we check the present bit in the PTE. If true, it means there is data corresponding to physical memory, so we get the corresponding PFN. Otherwise, a page fault occurs. In page_fault_handler, new memory is allocated and the corresponding PFN is returned. After that, we write the content to the PTE and return the PFN.

After obtaining a valid PFN in this way, we update the TLB and finally construct and return the actual physical address.

This part is the first thing that should be understood.

STEP2: Multi-level Paging

Next, we introduce multi-level paging. Here, we use 8 bits to represent VPN, but in reality x86_64 uses 48 bits, and if we prepare PTEs flat for this, we would need 2^48 entries. This is inefficient, so we split it. Here, we divide 8 bits into 4 bits * 2 and adopt a 2-level page table. The size of PTEs we need to prepare initially is only 2^4. That entry holds the address to the second-level page table. Since we only need to initialize the second-level page table when we know that entry will be used, we can reduce memory overall. This kind of design is sometimes called a sparse structure.

Below is the section performing a two-level page walk. For L1, if it’s not present, we allocate the L2 page table and update the PTE. After that, we also perform a present check for L2 and call the page fault handler we saw in the previous step.

        // 2. Page Table lookup
        fn split_vpn(vpn: u8) -> (u8, u8) {
            let l1 = vpn >> 4; // 上位4bit
            let l2 = vpn & 0x0F; // 下位4bit
            (l1, l2)
        }
        // 2. Start two-level page walk
        // Using 2 levels reduces the number of PTEs.
        let (l1_index, l2_index) = split_vpn(vpn);

        // ---- L1 Table (Directory) ----
        let l1 = PageTable { phys_addr: cr3 };
        let pte = l1.read_pte(&kernel.pmem, l1_index);

        let (_l1_new_pte, l2_pfn) = if pte_present(pte) {
            (pte, pte_pfn(pte))
        } else {
            // L1 doesn't exist yet
            // As a simplification, here we "arbitrarily allocate a page for the L2 table on the MMU side"
            // In a real OS, the kernel builds the page table, but we do it here for learning purposes.
            let (_l2_phys, l2_pfn) = kernel.alloc_phys_page();

            // Write "L2 table location" to L1 PTE (present=1, RW=1, USER=0 is fine)
            let l1_pte_new = make_pte(l2_pfn, true, true, false, false);
            // Physical address = L1 table start (desired entry number × size of 1 entry)
            // Physical address = cr3 (l1_index × 8 bytes (64 bits))
            let addr = l1.phys_addr + (l1_index as usize) * 8;

            kernel.pmem.write_u64(addr, l1_pte_new);
            (l1_pte_new, l2_pfn)
        };

        // From here, traverse to the actual page
        // Restore the address of the page table for l2
        let l2_phys = (l2_pfn as usize) * PAGE_SIZE;
        let l2 = PageTable { phys_addr: l2_phys };

        // ---- L2 Table (PTE to actual page) ----
        let pte = l2.read_pte(&kernel.pmem, l2_index);

        // Get the updated l2 PTE and the physical frame number to access
        let (l2_new_pte, pfn) = if pte_present(pte) {
            (pte, pte_pfn(pte))
        } else {
            println!(
                "page fault on VPN {:02x} (L1 {:x}, L2 {:x})",
                vpn, l1_index, l2_index
            );
            let new_pfn = page_fault_handler(kernel, pid, vpn, pte, vaddr);
            l2.write_pte(&mut kernel.pmem, l2_index, new_pfn);
            let new_pte = l2.read_pte(&kernel.pmem, l2_index);
            (new_pte, new_pfn)
        };

Comparison of Single-level and Two-level

PTE size is constant (8 bytes)
VPN size is also constant (1 byte)
- The number of PTEs that can be represented doesn’t change.
Single-level
- VPN layout [ 00000000 ]
- 2^8 = 256 PTEs
Two-level
- VPN layout [ 0000 | 0000 ]
- 2^4 = 16 PTEs
The difference is that the PTE area that needs to be initialized initially is dramatically smaller.

In actual x86_64

The 48-bit virtual address (Canonical form) is divided as follows:

PML4 9 bits
PDPT 9 bits
PD 9 bits
PT 9 bits
Offset 12 bits

STEP3: Copy On Write

Next, I implemented Copy On Write, a technique to save memory. A use case is when a process forks. In this case, the process state at that point needs to be copied, so memory also needs to be copied, but in reality, this doesn’t happen until a write occurs. This is because if it’s only reads, we can share the same thing. So, what happens at fork time is copying the page table and updating the COW-related flags in the PTE. I haven’t explained the PTE layout until now, but the PTE layout at this point is as follows.

type PTE = u64;

// フラグ定義
const PTE_PRESENT: u64 = 1 << 0;
const PTE_RW: u64 = 1 << 1;
const PTE_USER: u64 = 1 << 2;
const PTE_COW: u64 = 1 << 3;
const PTE_NX: u64 = 1 << 63;

// PFN は bit8 以降に置く
const PFN_SHIFT: u64 = 8;

fn make_pte(pfn: u8, present: bool, rw: bool, user: bool, nx: bool) -> PTE {
    let mut pte = (pfn as u64) << PFN_SHIFT;

    if present {
        pte |= PTE_PRESENT;
    }
    if rw {
        pte |= PTE_RW;
    }
    if user {
        pte |= PTE_USER;
    }
    if nx {
        pte |= PTE_NX;
    }

    pte
}

Whether it’s present is of course included here. At fork time, we set RW to false to prohibit writes. We also set the COW flag to true. This indicates that physical memory allocation is needed on the next write.

I said we update the PTE, but the implementation is a bit more complex, so I’ll show it below.

    // Create the child process's page table and update both parent and child page tables appropriately.
    fn fork(&mut self, parent_pid: u32, cpu: &mut CPU) -> u32 {
        //----------------------------------------------------------------------
        // (1) Get the parent process's L1 table
        //----------------------------------------------------------------------

        // Get the parent process (Process struct)
        let parent = self.find_process_mut(parent_pid);

        // Treat the physical address of the L1 table used by the parent as a PageTable
        // This is the "top level (first level) of the virtual address space"
        let parent_l1 = PageTable {
            phys_addr: parent.page_table_phys,
        };

        //----------------------------------------------------------------------
        // (2) Allocate a new L1 table for the child process
        //----------------------------------------------------------------------

        // Since the L1 table is page-sized (256B = 32 PTEs), we only need to allocate 1 page
        let (child_l1_phys, _) = self.alloc_phys_page();

        // Treat as PageTable type (contents are empty, nothing filled yet)
        let child_l1 = PageTable {
            phys_addr: child_l1_phys,
        };

        //----------------------------------------------------------------------
        // (3) Scan all L1 entries (0~15 = 4 bits)
        //
        // L1 being 4 bits → only 16 slots is a simplified model for learning
        // Real x86_64 is 9 bits (512 entries)
        //----------------------------------------------------------------------

        for l1_index in 0u8..16 {
            // Read the parent's L1 PTE
            let l1_pte = parent_l1.read_pte(&self.pmem, l1_index);

            // If it doesn't exist in parent L1 (= this area is unallocated), it's OK for child to also be unallocated
            if !pte_present(l1_pte) {
                continue;
            }

            //------------------------------------------------------------------
            // (4) Restore the location of "parent's L2 table" from L1 PTE
            //------------------------------------------------------------------

            // parent L2 PFN (Physical Frame Number)
            let parent_l2_pfn = pte_pfn(l1_pte);

            // Convert PFN → physical address (1 page = 256 B)
            let parent_l2_phys = (parent_l2_pfn as usize) * PAGE_SIZE;

            // Treat as parent L2 table
            let parent_l2 = PageTable {
                phys_addr: parent_l2_phys,
            };

            //------------------------------------------------------------------
            // (5) Allocate a completely new L2 table for the child side as well
            //
            // This is the point: "At fork time, we copy the page table itself,
            //                    but we don't copy data pages (lazy copy)."
            //------------------------------------------------------------------

            let (child_l2_phys, child_l2_pfn) = self.alloc_phys_page();

            //------------------------------------------------------------------
            // (6) Write "child L2 table address" to child's L1 entry
            //
            // Create PTE with make_pte():
            //   - present = 1
            //   - RW = 1 (L2 is a directory, so RW is fine)
            //   - USER = 1
            //   - NX = 0
            //------------------------------------------------------------------

            let child_l1_pte = make_pte(
                child_l2_pfn,
                true,  // present
                true,  // RW
                true,  // USER
                false, // NX
            );

            // Write to L1 slot
            let child_l1_addr = child_l1.phys_addr + (l1_index as usize) * 8;
            self.pmem.write_u64(child_l1_addr, child_l1_pte);

            //------------------------------------------------------------------
            // (7) Scan all L2 entries (0~15)
            //     Here we "COW-ify" the PTE to data pages
            //------------------------------------------------------------------

            for l2_index in 0u8..16 {
                let pte = parent_l2.read_pte(&self.pmem, l2_index);

                // Skip PTEs that haven't been allocated yet
                if !pte_present(pte) {
                    continue;
                }

                //------------------------------------------------------------------
                // (8) Change parent and child PTEs as follows:
                //
                //   Original:  PFN = x, RW = 1, USER = 1, COW = 0
                //   New:       PFN = x (still shared)
                //              RW = 0 (changed to write-prohibit)
                //              COW = 1 (copy on write)
                //
                // We want to detect writes.
                // The essence is: if it's not a write, sharing the PFN reference is fine.
                // Whether parent or child, if a write occurs, sharing is no longer possible, so a new memory area is needed.
                // This can be confirmed as COW behavior in the page fault handler.
                // Here we set the PTE flags appropriately so the next write can be detected.
                //
                // This is the same mechanism as Linux's fork().
                //------------------------------------------------------------------

                let new_pte = (pte & !PTE_RW) | PTE_COW;

                // COW-ify parent's L2 PTE
                let parent_l2_addr = parent_l2.phys_addr + (l2_index as usize) * 8;
                self.pmem.write_u64(parent_l2_addr, new_pte);

                // Also COW-ify child's L2 PTE
                let child_l2_addr = child_l2_phys + (l2_index as usize) * 8;
                self.pmem.write_u64(child_l2_addr, new_pte);
            }
        }

        // I realized later that TLB entries should hold the flag information contained in PTEs.
        // And we should flush at this point. If old information remains in TLB entries, physical memory values might be rewritten without noticing.
        if self.current_pid == parent_pid {
            cpu.mmu.flush_tlb();
        }

        //----------------------------------------------------------------------
        // (9) Assign and register child PID
        //----------------------------------------------------------------------
        let child_pid = self.processes.len() as u32;

        // Child process holds "child's L1 root physical address"
        self.processes.push(Process::new(child_pid, child_l1_phys));

        child_pid
    }

First, we need the L1 page table for the newly created process, so we allocate it. Additionally, we also need to prepare the L2 page table here. At this time, we only need to initialize what exists, so we check each PTE in L1, and if present, we update the child process’s L1 PTE and initialize the child process’s L2 page table. Then, we update the parent process’s L2 PTE and the child process’s L2 PTE. Initially, I didn’t understand why we update the parent-side PTE, but at the point when fork occurs, existing memory values are used as shared read-only, so even on the parent side, if a write occurs next, we’ll need to allocate a new memory area and write the value. Whether parent or child, as long as no write occurs, we can just read the existing value. That’s why it’s called Copy On Write.

STEP4: filemap

What we’ve seen so far is a type of memory page called Anonymous Page. This is a memory area that a process dynamically allocates and is not tied to a file. However, in actual systems, the ability to map file contents to memory and access them is important.

In file mapping, a specific range of the virtual address space is mapped to file contents. At this time, a structure called VMA (Virtual Memory Area) manages the virtual address range and its attributes (file ID, offset, protection attributes, etc.). There are two types of VMAs: one for anonymous pages and one for file mapping.

Type	Data Source	Sharing	Writes
Anonymous	Generated within process (initial value zero)	Generally not shared	Written to swap
File-backed	Reads file contents	Shared between processes	Writeback when dirty

Below, to distinguish from anonymous pages, we call the map_file function when allocating file-backed memory areas.

#[derive(Clone)]
struct Process {
    pid: u32,
    page_table_phys: usize,
    vmas: Vec<(u16, u16, VmaKind)>, // (start_vaddr, end_vaddr, VmaKind)
}

impl Process {
    fn new(pid: u32, page_table_phys: usize) -> Self {
        Self {
            pid,
            page_table_phys,
            vmas: Vec::new(),
        }
    }

    fn map_file(&mut self, start: u16, end: u16, file_id: u32) {
        self.vmas.push((
            start,
            end,
            VmaKind::File {
                file_id,
                start_offset: 0,
            },
        ));
    }
}

When accessing a file-mapped area, a page fault occurs initially. At this time, the page fault handler performs the following processing:

Search for the corresponding VMA and confirm it’s a file mapping
Check the page cache (Page Cache) from the file ID and offset
If the corresponding page exists in the page cache (PageFaultFileHit):
- Reuse the existing physical page
- Record that page’s PageKind as File
If it doesn’t exist in the page cache (PageFaultFileMiss):
- Allocate a new physical page
- Read data from the file at the corresponding offset (in actual implementation)
- Register it in the page cache
- Record PageKind as File

The modified page_fault_handler function is shown below.

    // Determine file-backed from VMA
    let vmas = {
        let process = kernel.find_process_mut(pid);
        process.vmas.clone()
    };
    for (start, end, kind) in &vmas {
        if *start <= vaddr && vaddr < *end {
            if let VmaKind::File {
                file_id,
                start_offset,
            } = kind
            {
                let file_offset = (*start_offset + (vaddr - start) as u64) / PAGE_SIZE as u64;
                // pagecache hit
                if let Some(&pfn) = kernel.pagecache.get(&(*file_id, file_offset)) {
                    kernel.events.push(Event::PageFaultFileHit {
                        pid,
                        file_id: *file_id,
                        file_offset,
                        pfn,
                    });
                    return pfn;
                }
                // pagecache miss → 新しい PFN を割り当て
                let (_phys, pfn) = kernel.alloc_phys_page();
                kernel.pmem.kinds[pfn as usize] = PageKind::File {
                    file_id: *file_id,
                    offset: file_offset,
                };
                kernel.pagecache.insert((*file_id, file_offset), pfn);
                kernel.events.push(Event::PageFaultFileMiss {
                    pid,
                    file_id: *file_id,
                    file_offset,
                    pfn,
                });
                return pfn;
            }
        }
    }

kernel.pagecache represents the page cache area in the kernel’s memory space and shows “which file’s which offset is on which PFN.”

I think it’s similar in role to page tables. Both are structures for looking up physical addresses in the kernel’s memory space. The differences are that it uses file ID and offset as keys, and it’s not separated per process. This means that even if it’s a different virtual address in a different process, if they point to the same thing, the page cache works.

Key Concepts

I’ve grasped the concepts to some extent, so now I’ll learn concepts that couldn’t be accurately represented or weren’t handled in the above demo.

HugePage

This is a feature that aims to improve performance by making the page size larger than the default 4KB (sizes vary by architecture, such as 2MB or 1GB).

Since the size becomes larger, the number of page table entries decreases in comparison, and TLB cache hit rate increases. The page table itself also becomes smaller, so the memory capacity for page tables decreases. HugePages are effectively protected and only provided to applications that request HugePages. Swap-out and page-out don’t occur.

HugePage support depends on the CPU architecture and kernel. To use them, you use functions like:

mmap() system call
shmget() function

There are mainly two types of implementations.

HugeTLB

Requires explicitly allocating memory areas before process startup. Also requires application-side support. At kernel startup, it’s configured as a kernel parameter. Or you can create HugePages by changing sysfs parameter values. /proc/sys/vm/nr_hugepages
Transparent HugePage

A later development where the kernel automatically allocates HugePages. Certain conditions need to be met. No need to be aware from the application side. However, when conditions are no longer met, there’s a risk of performance degradation due to the process of splitting HugePages back into 4KB pages. Therefore, it’s often intentionally disabled in databases and the like.

Memory Allocation

About memory allocation. Since allocators at various layers are called allocators, I want to organize this here.

Rough understanding

[0] Hardware / Physical memory
[1] Kernel: Page allocator (Buddy)
[2] Kernel: Slab / SLUB / SLOB
[3] User space: libc malloc (ptmalloc / jemalloc / tcmalloc)
[4] Application: new/malloc/free

(1) Buddy Allocator

This layer manages physical memory. Since pages need to be composed, if the page size is 4KiB, it allocates 4096 consecutive physical addresses. It manages in units of (2^n × 4 KiB). It cannot be split further.

(2) Slab Allocator

Uses the Buddy allocator. Manages small objects used by the kernel. Think of it as managing within 1 page.

(3) libc Allocator

brk/sbrk A method of expanding the heap area. For small mallocs
mmap Used when large areas or alignment requirements exist

(4) Application Level

For example, in C, developers consciously perform memory allocation and deallocation themselves. In Go and Java (JVM), the runtime allocates appropriately, and deallocation is performed by GC.

Memory Deallocation

There are mainly 5 methods. I’ve listed the deallocation flow by kswapd in rough priority order. (The bottom two are not by kswapd)

Free List

A list of unused pages that can be immediately allocated. Also called idle memory. One free list exists per NUMA node.
Page Cache

The part of physical memory mapped when reading files, as handled in the demo code mentioned earlier. Also called filesystem cache. If the page is not dirty, there’s no need to write it, so it’s lighter than swap.
Swapping

Page-out of anonymous pages. Writing to swap area.
Reaping

When a threshold is exceeded, you can instruct kernel modules and kernel slab allocators to immediately free memory that can be easily freed. Also called shrinking. In other words, this is the stage where the impact reaches kernel-side caches.
OOM Killer

As a last resort, it forcibly terminates processes found by select_bad_process. It activates when memory is needed but no more means of freeing memory remain.

Note that kswapd runs asynchronously, but when available memory falls below a certain amount, synchronous page deallocation called direct reclaim is performed.

出典: How Katalyst Guarantees Memory QoS for Colocated Applications | CNCF

Memory Metrics and How to Think About Them

Memory Usage

The free command provides information about physical memory and swap area. total is literally the total size. Next, free shows the size that’s completely unused. It’s desirable for this to decrease over time. If this is excessive, it means memory is being wasted. shared is shared memory in use, mainly by tmpfs. buff/cache is memory used by buffer cache and page cache. available is the effectively available memory. It’s free plus releasable memory areas like buff/cache. Finally, used is total minus available. (Note that depending on the free command version, the output definitions may change)

Classification of pages in virtual memory

Unallocated
- Virtual address space has not been allocated by mmap/brk.
VMA exists but unmapped (no page fault yet)
- Unaccessed area immediately after malloc
- Unaccessed portion of file-backed mmap
- Physical memory has not been allocated yet.
Mapped to physical memory (resident)
- State where physical pages have been allocated by page fault.
- Counted in RSS.
Evicted to swap (swapped-out)
- Only anonymous memory is targeted.
- Physical pages have been reclaimed, and swap-in is needed if required.

VSS (Virtual Set Size) is the sum of all virtual address spaces (VMA) allocated by a process (2+3+4) and is unrelated to physical memory. RSS (Resident Set Size) is the sum of pages that actually exist in physical memory (3).

sar -B

Basically, memory is allocated for cache, but when memory runs low, page scanning runs, so you can judge by metrics like the following being continuously recorded (10s or more).

pgscank/s
        Number of pages scanned by the kswapd daemon per
        second.

pgscand/s
        Number of pages scanned directly per second

https://tech-lab.sios.jp/archives/43710

/proc/pressure/memory

It seems to be recorded here by a function called psi_memstall_enter(). The function psi_memstall_enter() is called before processing that might cause memory to stall. When that processing ends, recording stops via psi_memstall_leave().

It seems psi_memstall_enter() is called during memory deallocation processing.

blkcg_maybe_throttle_blkg()
submit_bio()
wait_on_page_bit_common()
mem_cgroup_handle_over_high()
__alloc_pages_direct_compact()
__perform_reclaim()
try_to_free_mem_cgroup_pages()
kcompactd() ... kcompactd
balance_pgdat() ... kswapd

https://hiboma.hatenadiary.jp/entry/2019/11/19/123940

What is stall = being waited = memory-bound processing?

The kernel recovers physical memory in the event of a shortage by page reclamation and/or compaction. Both methods are implemented in a similar fashion. As the amount of free memory falls below the low threshold (watermark), memory pages are reclaimed asynchronously via kswapd or compacted via kcompactd. If the free memory continues to fall below a minimum watermark, any allocation request is forced to perform reclamation/compaction synchronously before it can be fulfilled. The latter synchronous method is referred to as the “direct” path and is considerably slower owing to being stalled waiting for memory to be reclaimed. The corresponding stall in the caller results in a non-deterministic increased latency for the operation it is performing and is typically perceived as an impact on performance.

https://blogs.oracle.com/linux/anticipating-your-memory-needs

When the minimum watermark is exceeded, synchronous reclamation/compaction occurs. This is the nature of stalls. My understanding is that since it’s synchronous processing, waiting occurs.

vmstat si, so

Number of swap IN/OUT. In environments where swap is not configured, this should always be 0. (Many AWS EC2 instances)