Memory 2
Introduction
This article covers advanced content, understanding what is explained in Chapter 3 “Let’s Implement Memory Management” of Build and Learn: How OS Works I (hereinafter, OS book). In the previous article, we learned the basics of memory management, but this time, we’ll look at how it’s implemented (or how it’s being implemented).
Byte-level Allocator
- After receiving memory information from UEFI, we call
exit_from_efi_boot_servicesto exit the UEFI world, then allocate memory. - This manages what is called the kernel heap.
pub fn init_basic_runtime(
image_handle: EfiHandle,
efi_system_table: &EfiSystemTable,
) -> MemoryMapHolder {
let mut memory_map = MemoryMapHolder::new();
exit_from_efi_boot_services(image_handle, efi_system_table, &mut memory_map);
ALLOCATOR.init_with_mmap(&memory_map);
memory_map
}
- The
global_allocatorattribute allows us to replace the memory allocator with a custom implementation.
#[global_allocator]
pub static ALLOCATOR: FirstFitAllocator = FirstFitAllocator {
first_header: RefCell::new(None),
};
- What this ALLOCATOR is: it holds available physical memory areas in a LinkedList.
- We implement the alloc method. This method returns an address.
- How to allocate initially:
- Get areas with the CONVENTIONAL_MEMORY attribute from the memory map
- Build a LinkedList based on that
- The LinkedList node looks like this. Note that the node itself also consumes capacity.
struct Header { next_header: Option<Box<Header>>, size: usize, is_allocated: bool, _reserved: usize, }
Paging
The paging mechanism (where the OS instructs the CPU about address translation) differs by architecture, so we’ll focus on x86_64.
It has evolved through: 16-bit mode 32-bit mode 64-bit mode In 64-bit mode, paging is mandatory. At the point when OS code is executed by UEFI, it’s in 64-bit mode and paging itself is already enabled.
About Hierarchical Page Tables
- From a speed perspective
ページテーブルは基本的には「仮想アドレス(の一部)をイ ンデックスとして、変換先のページの物理アドレスを書き並べた配列」のようなもの
- From a capacity perspective
x86_64 では、一つのページテーブル構造体は 4KiB の大きさとアライメントを 持ちます。それぞれの構造体は大雑把に言えば、次の段階のページテーブル(も しくはページ)への物理アドレスが格納された配列になっています。アドレスは 64 ビット、つまり 8 バイトなので、一つのページテーブルは次の段階へのポイ ンタを 512 個書き並べたものになっている
The conceptual parts are as introduced in the previous article.
Actual Verification
- Since UEFI provides paging functionality, we can verify how it’s implemented from the memory state.
- As we also implemented in the previous article, we reference the page table from the cr3 register.
pub type RootPageTable = [u8; 1024];
pub fn read_cr3() -> *mut RootPageTable {
let mut cr3: *mut RootPageTable;
unsafe { asm!("mov rax, cr3", out("rax") cr3) }
cr3
}
let cr3 = wasabi::x86::read_cr3();
println!("cr3 = {cr3:#p}");
hexdump(unsafe { &*cr3 });
- The output I verified locally in this way is as follows.
今回、cr3 の値が 0xbfc01000 になっているので、このアドレスから 4KiB の バイト列が、起点となるページテーブルのデータに相当するわけです。ページテー ブルには、8 バイトを 1 エントリとして、次のレベルのページテーブルへのポイ ンタや、その属性(読み書き可能かなど)が記録されています
cr3 = 0x00000000bfc01000
00000000: 23 20 C0 BF 00 00 00 00 00 00 00 00 00 00 00 00 |# ..............|
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
(snip)
(16 bytes per line)
ページテー ブルには、8 バイトを 1 エントリとして、次のレベルのページテーブルへのポイ ンタや、その属性(読み書き可能かなど)が記録されています。
- In the implementation from the previous article, we also made PTEs 64 bits (8 bytes).
- The first entry “23 20 C0 BF 00 00 00 00”
23 20 C0 BF 00 00 00 00
=> 0x00000000_BFC02023 // Interpret as little-endian
=> 0xBFC02000 | 0x23 // Extract the lower 8 bits
=> 0xBFC02000 | 0b0010_0011 // To binary
-
The format definition is documented in the Intel specification, p140.
-
Looking at “PML4E: present”, from bit 0 onwards it’s as follows:
- 1 (Present)
- R/W
- U/S
- PWT
- PCD
- A
- Ign
- Rsvd
-
According to the OS book, these are:
- P: Whether the content is valid
- W: Controls write access
- U: Controls access from user mode
- T: Controls write cache behavior
- C: Whether cache is enabled
- A: CPU sets this to 1 when this entry is used for memory access
- X: Ignored
- 0: Fixed to 0
-
In this example, the PWA flags are set.
-
Since it’s hierarchical, we follow the addresses.
cr3 = 0x00000000bfc01000
(snip)
Some(L4Table @ 0x00000000bfc01000 {
entry[ 0] = L4Entry @ 0x00000000bfc01000 { 0x00000000BFC02023 PWS }
}
(snip)
Some(L3Table @ 0x00000000bfc02000 {
entry[ 0] = L3Entry @ 0x00000000bfc02000 { 0x00000000BFC03023 PWS }
(snip)
Some(L2Table @ 0x00000000bfc03000 {
entry[ 0] = L2Entry @ 0x00000000bfc03000 { 0x00000000000000E3 PWS }
(snip)
None
-
The lower 12 bits are flags, and bit 49 onwards are not used (we treat the 12-48 bit range as the address.)
-
L4 -> L3:
0000BFC02 -
L3 -> L2:
0000BFC03 -
L2 -> ??:
000000000 -
Here we can see that L2 points to
000000000. -
Looking at the specification, it seems that in L2 or L3 entries, whether bit 7 is 1 or 0 changes the interpretation of the address part.
-
Looking at flag
0E3, we can see that bit 7 is 1. -
The rightmost two
00of000000000are ignored, and we can see that0000000is the address of a 2MB page frame. -
In other words, HugePage is being used.
For example, considering virtual address 0, we can see it becomes index 0 of the L4 table, index 0 of the L3 table, and index 0 of the L2 table, so in that case we can see it’s converted to physical address 0.
Interrupt Handling and Exception Handling
This is content we didn’t cover much in the previous article. We had roughly implemented page_fault_handler. It seems that verifying paging behavior also becomes easier.
-
When an unexpected situation occurs during program execution
- CPU transfers control to the interrupt handler (interrupt in the broad sense)
-
Synchronous interrupt
- Occurs as a result of CPU instruction execution
-
Asynchronous interrupt (interrupt in the narrow sense)
- Caused by external devices
-
Number assignment for each type of interrupt
割り込み番号 ごとの設定を書き並べたデータ構造のことを、x86_64 では割り込み記述子テー ブル(IDT:Interrupt Descriptor Table)と呼びます
OS は IDT(= IdtDescriptor の配列)をメモリ上に構築したあと、CPU に IDT の場所を知らせる必要があります。このために用いられる命令が LIDT(Load Interrupt Descriptor Table)です。この命令に IDT のベースアドレスとサイズを 記述したデータ構造 IdtrParameters へのポインタを渡すことで、CPU は以降 の例外や割り込みを処理する際に、ここで設定された IDT を参照するようにな ります。
もう一点、IDT を設定するのに先立って設定する必要のあるものが存在しま す。それが、GDT(Global Descriptor Table)です
Setting aside the historical flow, understand it as follows:
- Global Descriptor Table
- Historically important, but currently we need to tell the CPU about the TSS descriptor, so we store it.
- In 32-bit mode (protected mode), logical addresses were managed with segment registers + offset
GDT の各エントリには、ベースアドレス、リミット(サイズ)、アクセス権限(読 み取り専用、書き込み可能、実行可能など)、そして特権レベルなどが格納され ています。セグメントレジスタは合計で 6 つ(CS、DS、ES、FS、GS、SS)存 在します
- Task State Segment
- A structure for specifying the “safe kernel stack” that the CPU uses when exceptions occur
- Born when task switching was introduced as a CPU feature in 32-bit mode (protected mode)
- In 64-bit mode, it’s only used for stack switching when interrupts occur
- Stack pointer (RSP)
- IST (Interrupt Stack Table)
IST は 1 番から 7 番までの 7 つのスタックポインタを設定できる配列で、 TSS の中に存在します
-
Interrupt Descriptor Table
- A table for telling the CPU which function to jump to from the exception number
- Works on the premise of GDT and TSS
- Order:
- If the IDT entry’s IST is not 0, use IST[n]
- If IST specification is 0, TSS.RSP0 is used
- For Ring0 → Ring0 exceptions, no stack switching
- Once the stack is determined, the CPU pushes register groups onto that stack
-
This topic is explained in this video.
-
These data structures are needed to handle interrupts:
- GDT is needed to hold TSS
- TSS is for a safe stack during exceptions
- IDT is the mapping between exception numbers and handlers
- IST is a special stack
- RSP0 is the general kernel stack for exceptions
・ 割り込みには要因に応じて異なる番号が割り当てられている(割り込み番号) ・ 割り込み番号をもとに IDT を参照して、CPU は割り込み処理を行う ・ IDT のエントリには、割り込みハンドラのアドレスとスタックの設定などが入っ ている ・ GDT は 32 ビット時代に作られたものだが、64 ビット時代でも TSS のために いまだに使われている ・ TSS の中には IST があり、割り込み処理でスタックを切り替える際に参照される
Page Table Implementation
- UEFI’s page table is only for UEFI, so we need to reimplement the page table to match OS requirements.
- First stage: Initialize PML4 table
- Next, determine the maximum virtual address from the memory information received from UEFI.
- Then, initialize PTEs for the virtual address range.
- The ReadWriteKernel attribute is Present and Writable
- Finally, by calling write_cr3, we transition from UEFI’s page table to our own. At this time, the TLB is also cleared.
pub fn init_paging(memory_map: &MemoryMapHolder) {
let mut table = PML4::new();
let mut end_of_mem = 0x1_0000_0000u64;
for e in memory_map.iter() {
match e.memory_type() {
CONVENTIONAL_MEMORY | LOADER_CODE | LOADER_DATA => {
end_of_mem = max(
end_of_mem,
e.physical_start() + e.number_of_pages() * (PAGE_SIZE as u64),
)
}
_ => (),
}
}
table
.create_mapping(0, end_of_mem, 0, PageAttr::ReadWriteKernel)
.expect("Failed to create initial page mapping");
unsafe {
write_cr3(Box::into_raw(table));
}
}
Thoughts/Questions
- I felt that the part implementing a custom allocator based on memory map information received from UEFI is truly the core part
- The mechanism where the heap user requests memory with Layout(size, alignment), and the allocator returns an address that fits the Layout from available areas
- The part using LinkedList was also good
- The question arose: how does this program itself run? It runs in areas prepared by UEFI
- UEFI application binaries are apparently in PE/COFF format.
- UEFI determines the areas. UEFI applications (kernel) use them.
.text ← Code
.rdata ← Read-only data
.data ← Initialized data
.pdata ← Exception handling table (for x64)
.reloc ← Relocation information
- UEFI also provides the stack, but in practice, it seems the kernel itself initializes its own kernel stack after ExitBootServices()
- As mentioned above, the heap is made available by the kernel itself implementing an allocator based on the memory map provided by UEFI.
- I wonder if UEFI providing paging means it writes the necessary data for paging in memory (on the page table).
- Is it the OS’s job to interpret that?
- And UEFI is something closely related to architecture, and such things are defined in Intel’s specification, right?
- I learned that UEFI’s paging is ultimately paging for UEFI
- As an OS, we rebuild it ourselves to realize necessary functionality
- That page table is built on the kernel heap
- Regarding paging logic, the thinking was the same as the mental model demo code we implemented in the previous article, which was good
- The interrupt part was difficult with many parts implemented using the asm! macro
- I need to understand GDT, TSS, and the like a bit more
- The interrupt part was difficult with many parts implemented using the asm! macro