1.. SPDX-License-Identifier: GPL-2.0
2
3=================
4Process Addresses
5=================
6
7.. toctree::
8   :maxdepth: 3
9
10
11Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
12'VMA's of type :c:struct:`!struct vm_area_struct`.
13
14Each VMA describes a virtually contiguous memory range with identical
15attributes, each described by a :c:struct:`!struct vm_area_struct`
16object. Userland access outside of VMAs is invalid except in the case where an
17adjacent stack VMA could be extended to contain the accessed address.
18
19All VMAs are contained within one and only one virtual address space, described
20by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
21threads) which share the virtual address space. We refer to this as the
22:c:struct:`!mm`.
23
24Each mm object contains a maple tree data structure which describes all VMAs
25within the virtual address space.
26
27.. note:: An exception to this is the 'gate' VMA which is provided by
28          architectures which use :c:struct:`!vsyscall` and is a global static
29          object which does not belong to any specific mm.
30
31-------
32Locking
33-------
34
35The kernel is designed to be highly scalable against concurrent read operations
36on VMA **metadata** so a complicated set of locks are required to ensure memory
37corruption does not occur.
38
39.. note:: Locking VMAs for their metadata does not have any impact on the memory
40          they describe nor the page tables that map them.
41
42Terminology
43-----------
44
45* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
46  which locks at a process address space granularity which can be acquired via
47  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
48* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
49  as a read/write semaphore in practice. A VMA read lock is obtained via
50  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
51  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
52  automatically when the mmap write lock is released). To take a VMA write lock
53  you **must** have already acquired an :c:func:`!mmap_write_lock`.
54* **rmap locks** - When trying to access VMAs through the reverse mapping via a
55  :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
56  (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via
57  :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
58  anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
59  :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
60  locks as the reverse mapping locks, or 'rmap locks' for brevity.
61
62We discuss page table locks separately in the dedicated section below.
63
64The first thing **any** of these locks achieve is to **stabilise** the VMA
65within the MM tree. That is, guaranteeing that the VMA object will not be
66deleted from under you nor modified (except for some specific fields
67described below).
68
69Stabilising a VMA also keeps the address space described by it around.
70
71Lock usage
72----------
73
74If you want to **read** VMA metadata fields or just keep the VMA stable, you
75must do one of the following:
76
77* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
78  suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
79  you're done with the VMA, *or*
80* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
81  acquire the lock atomically so might fail, in which case fall-back logic is
82  required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
83  *or*
84* Acquire an rmap lock before traversing the locked interval tree (whether
85  anonymous or file-backed) to obtain the required VMA.
86
87If you want to **write** VMA metadata fields, then things vary depending on the
88field (we explore each VMA field in detail below). For the majority you must:
89
90* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
91  suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
92  you're done with the VMA, *and*
93* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
94  modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
95  called.
96* If you want to be able to write to **any** field, you must also hide the VMA
97  from the reverse mapping by obtaining an **rmap write lock**.
98
99VMA locks are special in that you must obtain an mmap **write** lock **first**
100in order to obtain a VMA **write** lock. A VMA **read** lock however can be
101obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then
102release an RCU lock to lookup the VMA for you).
103
104This constrains the impact of writers on readers, as a writer can interact with
105one VMA while a reader interacts with another simultaneously.
106
107.. note:: The primary users of VMA read locks are page fault handlers, which
108          means that without a VMA write lock, page faults will run concurrent with
109          whatever you are doing.
110
111Examining all valid lock states:
112
113.. table::
114
115   ========= ======== ========= ======= ===== =========== ==========
116   mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
117   ========= ======== ========= ======= ===== =========== ==========
118   \-        \-       \-        N       N     N           N
119   \-        R        \-        Y       Y     N           N
120   \-        \-       R/W       Y       Y     N           N
121   R/W       \-/R     \-/R/W    Y       Y     N           N
122   W         W        \-/R      Y       Y     Y           N
123   W         W        W         Y       Y     Y           Y
124   ========= ======== ========= ======= ===== =========== ==========
125
126.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
127             attempting to do the reverse is invalid as it can result in deadlock - if
128             another task already holds an mmap write lock and attempts to acquire a VMA
129             write lock that will deadlock on the VMA read lock.
130
131All of these locks behave as read/write semaphores in practice, so you can
132obtain either a read or a write lock for each of these.
133
134.. note:: Generally speaking, a read/write semaphore is a class of lock which
135          permits concurrent readers. However a write lock can only be obtained
136          once all readers have left the critical region (and pending readers
137          made to wait).
138
139          This renders read locks on a read/write semaphore concurrent with other
140          readers and write locks exclusive against all others holding the semaphore.
141
142VMA fields
143^^^^^^^^^^
144
145We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
146easier to explore their locking characteristics:
147
148.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
149          are in effect an internal implementation detail.
150
151.. table:: Virtual layout fields
152
153   ===================== ======================================== ===========
154   Field                 Description                              Write lock
155   ===================== ======================================== ===========
156   :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
157                         VMA describes.                           VMA write,
158                                                                  rmap write.
159   :c:member:`!vm_end`   Exclusive end virtual address of range   mmap write,
160                         VMA describes.                           VMA write,
161                                                                  rmap write.
162   :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write,
163                         the original page offset within the      VMA write,
164                         virtual address space (prior to any      rmap write.
165                         :c:func:`!mremap`), or PFN if a PFN map
166                         and the architecture does not support
167                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
168   ===================== ======================================== ===========
169
170These fields describes the size, start and end of the VMA, and as such cannot be
171modified without first being hidden from the reverse mapping since these fields
172are used to locate VMAs within the reverse mapping interval trees.
173
174.. table:: Core fields
175
176   ============================ ======================================== =========================
177   Field                        Description                              Write lock
178   ============================ ======================================== =========================
179   :c:member:`!vm_mm`           Containing mm_struct.                    None - written once on
180                                                                         initial map.
181   :c:member:`!vm_page_prot`    Architecture-specific page table         mmap write, VMA write.
182                                protection bits determined from VMA
183                                flags.
184   :c:member:`!vm_flags`        Read-only access to VMA flags describing N/A
185                                attributes of the VMA, in union with
186                                private writable
187                                :c:member:`!__vm_flags`.
188   :c:member:`!__vm_flags`      Private, writable access to VMA flags    mmap write, VMA write.
189                                field, updated by
190                                :c:func:`!vm_flags_*` functions.
191   :c:member:`!vm_file`         If the VMA is file-backed, points to a   None - written once on
192                                struct file object describing the        initial map.
193                                underlying file, if anonymous then
194                                :c:macro:`!NULL`.
195   :c:member:`!vm_ops`          If the VMA is file-backed, then either   None - Written once on
196                                the driver or file-system provides a     initial map by
197                                :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
198                                object describing callbacks to be
199                                invoked on VMA lifetime events.
200   :c:member:`!vm_private_data` A :c:member:`!void *` field for          Handled by driver.
201                                driver-specific metadata.
202   ============================ ======================================== =========================
203
204These are the core fields which describe the MM the VMA belongs to and its attributes.
205
206.. table:: Config-specific fields
207
208   ================================= ===================== ======================================== ===============
209   Field                             Configuration option  Description                              Write lock
210   ================================= ===================== ======================================== ===============
211   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
212                                                           :c:struct:`!struct anon_vma_name`        VMA write.
213                                                           object providing a name for anonymous
214                                                           mappings, or :c:macro:`!NULL` if none
215                                                           is set or the VMA is file-backed. The
216							   underlying object is reference counted
217							   and can be shared across multiple VMAs
218							   for scalability.
219   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read,
220                                                           to perform readahead. This field is      swap-specific
221                                                           accessed atomically.                     lock.
222   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
223                                                           describes the NUMA behaviour of the      VMA write.
224                                                           VMA. The underlying object is reference
225							   counted.
226   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read,
227                                                           describes the current state of           numab-specific
228                                                           NUMA balancing in relation to this VMA.  lock.
229                                                           Updated under mmap read lock by
230                                                           :c:func:`!task_numa_work`.
231   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
232                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
233                                                           either of zero size if userfaultfd is
234                                                           disabled, or containing a pointer
235                                                           to an underlying
236                                                           :c:type:`!userfaultfd_ctx` object which
237                                                           describes userfaultfd metadata.
238   ================================= ===================== ======================================== ===============
239
240These fields are present or not depending on whether the relevant kernel
241configuration option is set.
242
243.. table:: Reverse mapping fields
244
245   =================================== ========================================= ============================
246   Field                               Description                               Write lock
247   =================================== ========================================= ============================
248   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write, VMA write,
249                                       mapping is file-backed, to place the VMA  i_mmap write.
250                                       in the
251                                       :c:member:`!struct address_space->i_mmap`
252                                       red/black interval tree.
253   :c:member:`!shared.rb_subtree_last` Metadata used for management of the       mmap write, VMA write,
254                                       interval tree if the VMA is file-backed.  i_mmap write.
255   :c:member:`!anon_vma_chain`         List of pointers to both forked/CoW’d     mmap read, anon_vma write.
256                                       :c:type:`!anon_vma` objects and
257                                       :c:member:`!vma->anon_vma` if it is
258                                       non-:c:macro:`!NULL`.
259   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        When :c:macro:`NULL` and
260                                       anonymous folios mapped exclusively to    setting non-:c:macro:`NULL`:
261                                       this VMA. Initially set by                mmap read, page_table_lock.
262                                       :c:func:`!anon_vma_prepare` serialised
263                                       by the :c:macro:`!page_table_lock`. This  When non-:c:macro:`NULL` and
264                                       is set as soon as any page is faulted in. setting :c:macro:`NULL`:
265                                                                                 mmap write, VMA write,
266                                                                                 anon_vma write.
267   =================================== ========================================= ============================
268
269These fields are used to both place the VMA within the reverse mapping, and for
270anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
271and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should
272reside.
273
274.. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set
275          then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap`
276          trees at the same time, so all of these fields might be utilised at
277          once.
278
279Page tables
280-----------
281
282We won't speak exhaustively on the subject but broadly speaking, page tables map
283virtual addresses to physical ones through a series of page tables, each of
284which contain entries with physical addresses for the next page table level
285(along with flags), and at the leaf level the physical addresses of the
286underlying physical data pages or a special entry such as a swap entry,
287migration entry or other special marker. Offsets into these pages are provided
288by the virtual address itself.
289
290In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
291pages might eliminate one or two of these levels, but when this is the case we
292typically refer to the leaf level as the PTE level regardless.
293
294.. note:: In instances where the architecture supports fewer page tables than
295	  five the kernel cleverly 'folds' page table levels, that is stubbing
296	  out functions related to the skipped levels. This allows us to
297	  conceptually act as if there were always five levels, even if the
298	  compiler might, in practice, eliminate any code relating to missing
299	  ones.
300
301There are four key operations typically performed on page tables:
302
3031. **Traversing** page tables - Simply reading page tables in order to traverse
304   them. This only requires that the VMA is kept stable, so a lock which
305   establishes this suffices for traversal (there are also lockless variants
306   which eliminate even this requirement, such as :c:func:`!gup_fast`).
3072. **Installing** page table mappings - Whether creating a new mapping or
308   modifying an existing one in such a way as to change its identity. This
309   requires that the VMA is kept stable via an mmap or VMA lock (explicitly not
310   rmap locks).
3113. **Zapping/unmapping** page table entries - This is what the kernel calls
312   clearing page table mappings at the leaf level only, whilst leaving all page
313   tables in place. This is a very common operation in the kernel performed on
314   file truncation, the :c:macro:`!MADV_DONTNEED` operation via
315   :c:func:`!madvise`, and others. This is performed by a number of functions
316   including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`.
317   The VMA need only be kept stable for this operation.
3184. **Freeing** page tables - When finally the kernel removes page tables from a
319   userland process (typically via :c:func:`!free_pgtables`) extreme care must
320   be taken to ensure this is done safely, as this logic finally frees all page
321   tables in the specified range, ignoring existing leaf entries (it assumes the
322   caller has both zapped the range and prevented any further faults or
323   modifications within it).
324
325.. note:: Modifying mappings for reclaim or migration is performed under rmap
326          lock as it, like zapping, does not fundamentally modify the identity
327          of what is being mapped.
328
329**Traversing** and **zapping** ranges can be performed holding any one of the
330locks described in the terminology section above - that is the mmap lock, the
331VMA lock or either of the reverse mapping locks.
332
333That is - as long as you keep the relevant VMA **stable** - you are good to go
334ahead and perform these operations on page tables (though internally, kernel
335operations that perform writes also acquire internal page table locks to
336serialise - see the page table implementation detail section for more details).
337
338When **installing** page table entries, the mmap or VMA lock must be held to
339keep the VMA stable. We explore why this is in the page table locking details
340section below.
341
342.. warning:: Page tables are normally only traversed in regions covered by VMAs.
343             If you want to traverse page tables in areas that might not be
344             covered by VMAs, heavier locking is required.
345             See :c:func:`!walk_page_range_novma` for details.
346
347**Freeing** page tables is an entirely internal memory management operation and
348has special requirements (see the page freeing section below for more details).
349
350.. warning:: When **freeing** page tables, it must not be possible for VMAs
351             containing the ranges those page tables map to be accessible via
352             the reverse mapping.
353
354             The :c:func:`!free_pgtables` function removes the relevant VMAs
355             from the reverse mappings, but no other VMAs can be permitted to be
356             accessible and span the specified range.
357
358Lock ordering
359-------------
360
361As we have multiple locks across the kernel which may or may not be taken at the
362same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
363the **order** in which locks are acquired and released becomes very important.
364
365.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
366   but in doing so inadvertently cause a mutual deadlock.
367
368   For example, consider thread 1 which holds lock A and tries to acquire lock B,
369   while thread 2 holds lock B and tries to acquire lock A.
370
371   Both threads are now deadlocked on each other. However, had they attempted to
372   acquire locks in the same order, one would have waited for the other to
373   complete its work and no deadlock would have occurred.
374
375The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required
376ordering of locks within memory management code:
377
378.. code-block::
379
380  inode->i_rwsem        (while writing or truncating, not reading or faulting)
381    mm->mmap_lock
382      mapping->invalidate_lock (in filemap_fault)
383        folio_lock
384          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
385            vma_start_write
386              mapping->i_mmap_rwsem
387                anon_vma->rwsem
388                  mm->page_table_lock or pte_lock
389                    swap_lock (in swap_duplicate, swap_info_get)
390                      mmlist_lock (in mmput, drain_mmlist and others)
391                      mapping->private_lock (in block_dirty_folio)
392                          i_pages lock (widely used)
393                            lruvec->lru_lock (in folio_lruvec_lock_irq)
394                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
395                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
396                        sb_lock (within inode_lock in fs/fs-writeback.c)
397                        i_pages lock (widely used, in set_page_dirty,
398                                  in arch-dependent flush_dcache_mmap_lock,
399                                  within bdi.wb->list_lock in __sync_single_inode)
400
401There is also a file-system specific lock ordering comment located at the top of
402:c:macro:`!mm/filemap.c`:
403
404.. code-block::
405
406  ->i_mmap_rwsem                        (truncate_pagecache)
407    ->private_lock                      (__free_pte->block_dirty_folio)
408      ->swap_lock                       (exclusive_swap_page, others)
409        ->i_pages lock
410
411  ->i_rwsem
412    ->invalidate_lock                   (acquired by fs in truncate path)
413      ->i_mmap_rwsem                    (truncate->unmap_mapping_range)
414
415  ->mmap_lock
416    ->i_mmap_rwsem
417      ->page_table_lock or pte_lock     (various, mainly in memory.c)
418        ->i_pages lock                  (arch-dependent flush_dcache_mmap_lock)
419
420  ->mmap_lock
421    ->invalidate_lock                   (filemap_fault)
422      ->lock_page                       (filemap_fault, access_process_vm)
423
424  ->i_rwsem                             (generic_perform_write)
425    ->mmap_lock                         (fault_in_readable->do_page_fault)
426
427  bdi->wb.list_lock
428    sb_lock                             (fs/fs-writeback.c)
429    ->i_pages lock                      (__sync_single_inode)
430
431  ->i_mmap_rwsem
432    ->anon_vma.lock                     (vma_merge)
433
434  ->anon_vma.lock
435    ->page_table_lock or pte_lock       (anon_vma_prepare and various)
436
437  ->page_table_lock or pte_lock
438    ->swap_lock                         (try_to_unmap_one)
439    ->private_lock                      (try_to_unmap_one)
440    ->i_pages lock                      (try_to_unmap_one)
441    ->lruvec->lru_lock                  (follow_page_mask->mark_page_accessed)
442    ->lruvec->lru_lock                  (check_pte_range->folio_isolate_lru)
443    ->private_lock                      (folio_remove_rmap_pte->set_page_dirty)
444    ->i_pages lock                      (folio_remove_rmap_pte->set_page_dirty)
445    bdi.wb->list_lock                   (folio_remove_rmap_pte->set_page_dirty)
446    ->inode->i_lock                     (folio_remove_rmap_pte->set_page_dirty)
447    bdi.wb->list_lock                   (zap_pte_range->set_page_dirty)
448    ->inode->i_lock                     (zap_pte_range->set_page_dirty)
449    ->private_lock                      (zap_pte_range->block_dirty_folio)
450
451Please check the current state of these comments which may have changed since
452the time of writing of this document.
453
454------------------------------
455Locking Implementation Details
456------------------------------
457
458.. warning:: Locking rules for PTE-level page tables are very different from
459             locking rules for page tables at other levels.
460
461Page table locking details
462--------------------------
463
464In addition to the locks described in the terminology section above, we have
465additional locks dedicated to page tables:
466
467* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
468  and PUD each make use of the process address space granularity
469  :c:member:`!mm->page_table_lock` lock when modified.
470
471* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
472  either kept within the folios describing the page tables or allocated
473  separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
474  set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
475  mapped into higher memory (if a 32-bit system) and carefully locked via
476  :c:func:`!pte_offset_map_lock`.
477
478These locks represent the minimum required to interact with each page table
479level, but there are further requirements.
480
481Importantly, note that on a **traversal** of page tables, sometimes no such
482locks are taken. However, at the PTE level, at least concurrent page table
483deletion must be prevented (using RCU) and the page table must be mapped into
484high memory, see below.
485
486Whether care is taken on reading the page table entries depends on the
487architecture, see the section on atomicity below.
488
489Locking rules
490^^^^^^^^^^^^^
491
492We establish basic locking rules when interacting with page tables:
493
494* When changing a page table entry the page table lock for that page table
495  **must** be held, except if you can safely assume nobody can access the page
496  tables concurrently (such as on invocation of :c:func:`!free_pgtables`).
497* Reads from and writes to page table entries must be *appropriately*
498  atomic. See the section on atomicity below for details.
499* Populating previously empty entries requires that the mmap or VMA locks are
500  held (read or write), doing so with only rmap locks would be dangerous (see
501  the warning below).
502* As mentioned previously, zapping can be performed while simply keeping the VMA
503  stable, that is holding any one of the mmap, VMA or rmap locks.
504
505.. warning:: Populating previously empty entries is dangerous as, when unmapping
506             VMAs, :c:func:`!vms_clear_ptes` has a window of time between
507             zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via
508             :c:func:`!free_pgtables`), where the VMA is still visible in the
509             rmap tree. :c:func:`!free_pgtables` assumes that the zap has
510             already been performed and removes PTEs unconditionally (along with
511             all other page tables in the freed range), so installing new PTE
512             entries could leak memory and also cause other unexpected and
513             dangerous behaviour.
514
515There are additional rules applicable when moving page tables, which we discuss
516in the section on this topic below.
517
518PTE-level page tables are different from page tables at other levels, and there
519are extra requirements for accessing them:
520
521* On 32-bit architectures, they may be in high memory (meaning they need to be
522  mapped into kernel memory to be accessible).
523* When empty, they can be unlinked and RCU-freed while holding an mmap lock or
524  rmap lock for reading in combination with the PTE and PMD page table locks.
525  In particular, this happens in :c:func:`!retract_page_tables` when handling
526  :c:macro:`!MADV_COLLAPSE`.
527  So accessing PTE-level page tables requires at least holding an RCU read lock;
528  but that only suffices for readers that can tolerate racing with concurrent
529  page table updates such that an empty PTE is observed (in a page table that
530  has actually already been detached and marked for RCU freeing) while another
531  new page table has been installed in the same location and filled with
532  entries. Writers normally need to take the PTE lock and revalidate that the
533  PMD entry still refers to the same PTE-level page table.
534  If the writer does not care whether it is the same PTE-level page table, it
535  can take the PMD lock and revalidate that the contents of pmd entry still meet
536  the requirements. In particular, this also happens in :c:func:`!retract_page_tables`
537  when handling :c:macro:`!MADV_COLLAPSE`.
538
539To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or
540:c:func:`!pte_offset_map` can be used depending on stability requirements.
541These map the page table into kernel memory if required, take the RCU lock, and
542depending on variant, may also look up or acquire the PTE lock.
543See the comment on :c:func:`!__pte_offset_map_lock`.
544
545Atomicity
546^^^^^^^^^
547
548Regardless of page table locks, the MMU hardware concurrently updates accessed
549and dirty bits (perhaps more, depending on architecture). Additionally, page
550table traversal operations in parallel (though holding the VMA stable) and
551functionality like GUP-fast locklessly traverses (that is reads) page tables,
552without even keeping the VMA stable at all.
553
554When performing a page table traversal and keeping the VMA stable, whether a
555read must be performed once and only once or not depends on the architecture
556(for instance x86-64 does not require any special precautions).
557
558If a write is being performed, or if a read informs whether a write takes place
559(on an installation of a page table entry say, for instance in
560:c:func:`!__pud_install`), special care must always be taken. In these cases we
561can never assume that page table locks give us entirely exclusive access, and
562must retrieve page table entries once and only once.
563
564If we are reading page table entries, then we need only ensure that the compiler
565does not rearrange our loads. This is achieved via :c:func:`!pXXp_get`
566functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`,
567:c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
568
569Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
570the page table entry only once.
571
572However, if we wish to manipulate an existing page table entry and care about
573the previously stored data, we must go further and use an hardware atomic
574operation as, for example, in :c:func:`!ptep_get_and_clear`.
575
576Equally, operations that do not rely on the VMA being held stable, such as
577GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like
578:c:func:`!gup_fast_pte_range`), must very carefully interact with page table
579entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for
580higher level page table levels.
581
582Writes to page table entries must also be appropriately atomic, as established
583by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
584:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
585
586Equally functions which clear page table entries must be appropriately atomic,
587as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`,
588:c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and
589:c:func:`!pte_clear`.
590
591Page table installation
592^^^^^^^^^^^^^^^^^^^^^^^
593
594Page table installation is performed with the VMA held stable explicitly by an
595mmap or VMA lock in read or write mode (see the warning in the locking rules
596section for details as to why).
597
598When allocating a P4D, PUD or PMD and setting the relevant entry in the above
599PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
600acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
601:c:func:`!__pmd_alloc` respectively.
602
603.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
604   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
605   references the :c:member:`!mm->page_table_lock`.
606
607Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
608:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD
609physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
610:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
611:c:func:`!__pte_alloc`.
612
613Finally, modifying the contents of the PTE requires special treatment, as the
614PTE page table lock must be acquired whenever we want stable and exclusive
615access to entries contained within a PTE, especially when we wish to modify
616them.
617
618This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
619ensure that the PTE hasn't changed from under us, ultimately invoking
620:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
621the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
622must be released via :c:func:`!pte_unmap_unlock`.
623
624.. note:: There are some variants on this, such as
625   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
626   for brevity we do not explore this.  See the comment for
627   :c:func:`!__pte_offset_map_lock` for more details.
628
629When modifying data in ranges we typically only wish to allocate higher page
630tables as necessary, using these locks to avoid races or overwriting anything,
631and set/clear data at the PTE level as required (for instance when page faulting
632or zapping).
633
634A typical pattern taken when traversing page table entries to install a new
635mapping is to optimistically determine whether the page table entry in the table
636above is empty, if so, only then acquiring the page table lock and checking
637again to see if it was allocated underneath us.
638
639This allows for a traversal with page table locks only being taken when
640required. An example of this is :c:func:`!__pud_alloc`.
641
642At the leaf page table, that is the PTE, we can't entirely rely on this pattern
643as we have separate PMD and PTE locks and a THP collapse for instance might have
644eliminated the PMD entry as well as the PTE from under us.
645
646This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
647for the PTE, carefully checking it is as expected, before acquiring the
648PTE-specific lock, and then *again* checking that the PMD entry is as expected.
649
650If a THP collapse (or similar) were to occur then the lock on both pages would
651be acquired, so we can ensure this is prevented while the PTE lock is held.
652
653Installing entries this way ensures mutual exclusion on write.
654
655Page table freeing
656^^^^^^^^^^^^^^^^^^
657
658Tearing down page tables themselves is something that requires significant
659care. There must be no way that page tables designated for removal can be
660traversed or referenced by concurrent tasks.
661
662It is insufficient to simply hold an mmap write lock and VMA lock (which will
663prevent racing faults, and rmap operations), as a file-backed mapping can be
664truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
665
666As a result, no VMA which can be accessed via the reverse mapping (either
667through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
668address_space->i_mmap` interval trees) can have its page tables torn down.
669
670The operation is typically performed via :c:func:`!free_pgtables`, which assumes
671either the mmap write lock has been taken (as specified by its
672:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
673
674It carefully removes the VMA from all reverse mappings, however it's important
675that no new ones overlap these or any route remain to permit access to addresses
676within the range whose page tables are being torn down.
677
678Additionally, it assumes that a zap has already been performed and steps have
679been taken to ensure that no further page table entries can be installed between
680the zap and the invocation of :c:func:`!free_pgtables`.
681
682Since it is assumed that all such steps have been taken, page table entries are
683cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
684:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions.
685
686.. note:: It is possible for leaf page tables to be torn down independent of
687          the page tables above it as is done by
688          :c:func:`!retract_page_tables`, which is performed under the i_mmap
689          read lock, PMD, and PTE page table locks, without this level of care.
690
691Page table moving
692^^^^^^^^^^^^^^^^^
693
694Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
695page tables). Most notable of these is :c:func:`!mremap`, which is capable of
696moving higher level page tables.
697
698In these instances, it is required that **all** locks are taken, that is
699the mmap lock, the VMA lock and the relevant rmap locks.
700
701You can observe this in the :c:func:`!mremap` implementation in the functions
702:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
703side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
704
705VMA lock internals
706------------------
707
708Overview
709^^^^^^^^
710
711VMA read locking is entirely optimistic - if the lock is contended or a competing
712write has started, then we do not obtain a read lock.
713
714A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first
715calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
716critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
717before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
718
719VMA read locks hold the read lock on the :c:member:`!vma->vm_lock` semaphore for
720their duration and the caller of :c:func:`!lock_vma_under_rcu` must release it
721via :c:func:`!vma_end_read`.
722
723VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
724VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
725acquired. An mmap write lock **must** be held for the duration of the VMA write
726lock, releasing or downgrading the mmap write lock also releases the VMA write
727lock so there is no :c:func:`!vma_end_write` function.
728
729Note that a semaphore write lock is not held across a VMA lock. Rather, a
730sequence number is used for serialisation, and the write semaphore is only
731acquired at the point of write lock to update this.
732
733This ensures the semantics we require - VMA write locks provide exclusive write
734access to the VMA.
735
736Implementation details
737^^^^^^^^^^^^^^^^^^^^^^
738
739The VMA lock mechanism is designed to be a lightweight means of avoiding the use
740of the heavily contended mmap lock. It is implemented using a combination of a
741read/write semaphore and sequence numbers belonging to the containing
742:c:struct:`!struct mm_struct` and the VMA.
743
744Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
745operation, i.e. it tries to acquire a read lock but returns false if it is
746unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
747called to release the VMA read lock.
748
749Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has
750been called first, establishing that we are in an RCU critical section upon VMA
751read lock acquisition. Once acquired, the RCU lock can be released as it is only
752required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which
753is the interface a user should use.
754
755Writing requires the mmap to be write-locked and the VMA lock to be acquired via
756:c:func:`!vma_start_write`, however the write lock is released by the termination or
757downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
758
759All this is achieved by the use of per-mm and per-VMA sequence counts, which are
760used in order to reduce complexity, especially for operations which write-lock
761multiple VMAs at once.
762
763If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
764sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
765they differ, then it is not.
766
767Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
768:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
769also increments :c:member:`!mm->mm_lock_seq` via
770:c:func:`!mm_lock_seqcount_end`.
771
772This way, we ensure that, regardless of the VMA's sequence number, a write lock
773is never incorrectly indicated and that when we release an mmap write lock we
774efficiently release **all** VMA write locks contained within the mmap at the
775same time.
776
777Since the mmap write lock is exclusive against others who hold it, the automatic
778release of any VMA locks on its release makes sense, as you would never want to
779keep VMAs locked across entirely separate write operations. It also maintains
780correct lock ordering.
781
782Each time a VMA read lock is acquired, we acquire a read lock on the
783:c:member:`!vma->vm_lock` read/write semaphore and hold it, while checking that
784the sequence count of the VMA does not match that of the mm.
785
786If it does, the read lock fails. If it does not, we hold the lock, excluding
787writers, but permitting other readers, who will also obtain this lock under RCU.
788
789Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
790are also RCU safe, so the whole read lock operation is guaranteed to function
791correctly.
792
793On the write side, we acquire a write lock on the :c:member:`!vma->vm_lock`
794read/write semaphore, before setting the VMA's sequence number under this lock,
795also simultaneously holding the mmap write lock.
796
797This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
798until these are finished and mutual exclusion is achieved.
799
800After setting the VMA's sequence number, the lock is released, avoiding
801complexity with a long-term held write lock.
802
803This clever combination of a read/write semaphore and sequence count allows for
804fast RCU-based per-VMA lock acquisition (especially on page fault, though
805utilised elsewhere) with minimal complexity around lock ordering.
806
807mmap write lock downgrading
808---------------------------
809
810When an mmap write lock is held one has exclusive access to resources within the
811mmap (with the usual caveats about requiring VMA write locks to avoid races with
812tasks holding VMA read locks).
813
814It is then possible to **downgrade** from a write lock to a read lock via
815:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
816implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
817importantly does not relinquish the mmap lock while downgrading, therefore
818keeping the locked virtual address space stable.
819
820An interesting consequence of this is that downgraded locks are exclusive
821against any other task possessing a downgraded lock (since a racing task would
822have to acquire a write lock first to downgrade it, and the downgraded lock
823prevents a new write lock from being obtained until the original lock is
824released).
825
826For clarity, we map read (R)/downgraded write (D)/write (W) locks against one
827another showing which locks exclude the others:
828
829.. list-table:: Lock exclusivity
830   :widths: 5 5 5 5
831   :header-rows: 1
832   :stub-columns: 1
833
834   * -
835     - R
836     - D
837     - W
838   * - R
839     - N
840     - N
841     - Y
842   * - D
843     - N
844     - Y
845     - Y
846   * - W
847     - Y
848     - Y
849     - Y
850
851Here a Y indicates the locks in the matching row/column are mutually exclusive,
852and N indicates that they are not.
853
854Stack expansion
855---------------
856
857Stack expansion throws up additional complexities in that we cannot permit there
858to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
859prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.
860