1// Copyright 2019 The Go Authors. All rights reserved.
2// Use of this source code is governed by a BSD-style
3// license that can be found in the LICENSE file.
4
5// Scavenging free pages.
6//
7// This file implements scavenging (the release of physical pages backing mapped
8// memory) of free and unused pages in the heap as a way to deal with page-level
9// fragmentation and reduce the RSS of Go applications.
10//
11// Scavenging in Go happens on two fronts: there's the background
12// (asynchronous) scavenger and the allocation-time (synchronous) scavenger.
13//
14// The former happens on a goroutine much like the background sweeper which is
15// soft-capped at using scavengePercent of the mutator's time, based on
16// order-of-magnitude estimates of the costs of scavenging. The latter happens
17// when allocating pages from the heap.
18//
19// The scavenger's primary goal is to bring the estimated heap RSS of the
20// application down to a goal.
21//
22// Before we consider what this looks like, we need to split the world into two
23// halves. One in which a memory limit is not set, and one in which it is.
24//
25// For the former, the goal is defined as:
26//   (retainExtraPercent+100) / 100 * (heapGoal / lastHeapGoal) * lastHeapInUse
27//
28// Essentially, we wish to have the application's RSS track the heap goal, but
29// the heap goal is defined in terms of bytes of objects, rather than pages like
30// RSS. As a result, we need to take into account for fragmentation internal to
31// spans. heapGoal / lastHeapGoal defines the ratio between the current heap goal
32// and the last heap goal, which tells us by how much the heap is growing and
33// shrinking. We estimate what the heap will grow to in terms of pages by taking
34// this ratio and multiplying it by heapInUse at the end of the last GC, which
35// allows us to account for this additional fragmentation. Note that this
36// procedure makes the assumption that the degree of fragmentation won't change
37// dramatically over the next GC cycle. Overestimating the amount of
38// fragmentation simply results in higher memory use, which will be accounted
39// for by the next pacing up date. Underestimating the fragmentation however
40// could lead to performance degradation. Handling this case is not within the
41// scope of the scavenger. Situations where the amount of fragmentation balloons
42// over the course of a single GC cycle should be considered pathologies,
43// flagged as bugs, and fixed appropriately.
44//
45// An additional factor of retainExtraPercent is added as a buffer to help ensure
46// that there's more unscavenged memory to allocate out of, since each allocation
47// out of scavenged memory incurs a potentially expensive page fault.
48//
49// If a memory limit is set, then we wish to pick a scavenge goal that maintains
50// that memory limit. For that, we look at total memory that has been committed
51// (memstats.mappedReady) and try to bring that down below the limit. In this case,
52// we want to give buffer space in the *opposite* direction. When the application
53// is close to the limit, we want to make sure we push harder to keep it under, so
54// if we target below the memory limit, we ensure that the background scavenger is
55// giving the situation the urgency it deserves.
56//
57// In this case, the goal is defined as:
58//    (100-reduceExtraPercent) / 100 * memoryLimit
59//
60// We compute both of these goals, and check whether either of them have been met.
61// The background scavenger continues operating as long as either one of the goals
62// has not been met.
63//
64// The goals are updated after each GC.
65//
66// Synchronous scavenging happens for one of two reasons: if an allocation would
67// exceed the memory limit or whenever the heap grows in size, for some
68// definition of heap-growth. The intuition behind this second reason is that the
69// application had to grow the heap because existing fragments were not sufficiently
70// large to satisfy a page-level memory allocation, so we scavenge those fragments
71// eagerly to offset the growth in RSS that results.
72//
73// Lastly, not all pages are available for scavenging at all times and in all cases.
74// The background scavenger and heap-growth scavenger only release memory in chunks
75// that have not been densely-allocated for at least 1 full GC cycle. The reason
76// behind this is likelihood of reuse: the Go heap is allocated in a first-fit order
77// and by the end of the GC mark phase, the heap tends to be densely packed. Releasing
78// memory in these densely packed chunks while they're being packed is counter-productive,
79// and worse, it breaks up huge pages on systems that support them. The scavenger (invoked
80// during memory allocation) further ensures that chunks it identifies as "dense" are
81// immediately eligible for being backed by huge pages. Note that for the most part these
82// density heuristics are best-effort heuristics. It's totally possible (but unlikely)
83// that a chunk that just became dense is scavenged in the case of a race between memory
84// allocation and scavenging.
85//
86// When synchronously scavenging for the memory limit or for debug.FreeOSMemory, these
87// "dense" packing heuristics are ignored (in other words, scavenging is "forced") because
88// in these scenarios returning memory to the OS is more important than keeping CPU
89// overheads low.
90
91package runtime
92
93import (
94	"internal/goos"
95	"internal/runtime/atomic"
96	"runtime/internal/sys"
97	"unsafe"
98)
99
100const (
101	// The background scavenger is paced according to these parameters.
102	//
103	// scavengePercent represents the portion of mutator time we're willing
104	// to spend on scavenging in percent.
105	scavengePercent = 1 // 1%
106
107	// retainExtraPercent represents the amount of memory over the heap goal
108	// that the scavenger should keep as a buffer space for the allocator.
109	// This constant is used when we do not have a memory limit set.
110	//
111	// The purpose of maintaining this overhead is to have a greater pool of
112	// unscavenged memory available for allocation (since using scavenged memory
113	// incurs an additional cost), to account for heap fragmentation and
114	// the ever-changing layout of the heap.
115	retainExtraPercent = 10
116
117	// reduceExtraPercent represents the amount of memory under the limit
118	// that the scavenger should target. For example, 5 means we target 95%
119	// of the limit.
120	//
121	// The purpose of shooting lower than the limit is to ensure that, once
122	// close to the limit, the scavenger is working hard to maintain it. If
123	// we have a memory limit set but are far away from it, there's no harm
124	// in leaving up to 100-retainExtraPercent live, and it's more efficient
125	// anyway, for the same reasons that retainExtraPercent exists.
126	reduceExtraPercent = 5
127
128	// maxPagesPerPhysPage is the maximum number of supported runtime pages per
129	// physical page, based on maxPhysPageSize.
130	maxPagesPerPhysPage = maxPhysPageSize / pageSize
131
132	// scavengeCostRatio is the approximate ratio between the costs of using previously
133	// scavenged memory and scavenging memory.
134	//
135	// For most systems the cost of scavenging greatly outweighs the costs
136	// associated with using scavenged memory, making this constant 0. On other systems
137	// (especially ones where "sysUsed" is not just a no-op) this cost is non-trivial.
138	//
139	// This ratio is used as part of multiplicative factor to help the scavenger account
140	// for the additional costs of using scavenged memory in its pacing.
141	scavengeCostRatio = 0.7 * (goos.IsDarwin + goos.IsIos)
142
143	// scavChunkHiOcFrac indicates the fraction of pages that need to be allocated
144	// in the chunk in a single GC cycle for it to be considered high density.
145	scavChunkHiOccFrac  = 0.96875
146	scavChunkHiOccPages = uint16(scavChunkHiOccFrac * pallocChunkPages)
147)
148
149// heapRetained returns an estimate of the current heap RSS.
150func heapRetained() uint64 {
151	return gcController.heapInUse.load() + gcController.heapFree.load()
152}
153
154// gcPaceScavenger updates the scavenger's pacing, particularly
155// its rate and RSS goal. For this, it requires the current heapGoal,
156// and the heapGoal for the previous GC cycle.
157//
158// The RSS goal is based on the current heap goal with a small overhead
159// to accommodate non-determinism in the allocator.
160//
161// The pacing is based on scavengePageRate, which applies to both regular and
162// huge pages. See that constant for more information.
163//
164// Must be called whenever GC pacing is updated.
165//
166// mheap_.lock must be held or the world must be stopped.
167func gcPaceScavenger(memoryLimit int64, heapGoal, lastHeapGoal uint64) {
168	assertWorldStoppedOrLockHeld(&mheap_.lock)
169
170	// As described at the top of this file, there are two scavenge goals here: one
171	// for gcPercent and one for memoryLimit. Let's handle the latter first because
172	// it's simpler.
173
174	// We want to target retaining (100-reduceExtraPercent)% of the heap.
175	memoryLimitGoal := uint64(float64(memoryLimit) * (1 - reduceExtraPercent/100.0))
176
177	// mappedReady is comparable to memoryLimit, and represents how much total memory
178	// the Go runtime has committed now (estimated).
179	mappedReady := gcController.mappedReady.Load()
180
181	// If we're below the goal already indicate that we don't need the background
182	// scavenger for the memory limit. This may seems worrisome at first, but note
183	// that the allocator will assist the background scavenger in the face of a memory
184	// limit, so we'll be safe even if we stop the scavenger when we shouldn't have.
185	if mappedReady <= memoryLimitGoal {
186		scavenge.memoryLimitGoal.Store(^uint64(0))
187	} else {
188		scavenge.memoryLimitGoal.Store(memoryLimitGoal)
189	}
190
191	// Now handle the gcPercent goal.
192
193	// If we're called before the first GC completed, disable scavenging.
194	// We never scavenge before the 2nd GC cycle anyway (we don't have enough
195	// information about the heap yet) so this is fine, and avoids a fault
196	// or garbage data later.
197	if lastHeapGoal == 0 {
198		scavenge.gcPercentGoal.Store(^uint64(0))
199		return
200	}
201	// Compute our scavenging goal.
202	goalRatio := float64(heapGoal) / float64(lastHeapGoal)
203	gcPercentGoal := uint64(float64(memstats.lastHeapInUse) * goalRatio)
204	// Add retainExtraPercent overhead to retainedGoal. This calculation
205	// looks strange but the purpose is to arrive at an integer division
206	// (e.g. if retainExtraPercent = 12.5, then we get a divisor of 8)
207	// that also avoids the overflow from a multiplication.
208	gcPercentGoal += gcPercentGoal / (1.0 / (retainExtraPercent / 100.0))
209	// Align it to a physical page boundary to make the following calculations
210	// a bit more exact.
211	gcPercentGoal = (gcPercentGoal + uint64(physPageSize) - 1) &^ (uint64(physPageSize) - 1)
212
213	// Represents where we are now in the heap's contribution to RSS in bytes.
214	//
215	// Guaranteed to always be a multiple of physPageSize on systems where
216	// physPageSize <= pageSize since we map new heap memory at a size larger than
217	// any physPageSize and released memory in multiples of the physPageSize.
218	//
219	// However, certain functions recategorize heap memory as other stats (e.g.
220	// stacks) and this happens in multiples of pageSize, so on systems
221	// where physPageSize > pageSize the calculations below will not be exact.
222	// Generally this is OK since we'll be off by at most one regular
223	// physical page.
224	heapRetainedNow := heapRetained()
225
226	// If we're already below our goal, or within one page of our goal, then indicate
227	// that we don't need the background scavenger for maintaining a memory overhead
228	// proportional to the heap goal.
229	if heapRetainedNow <= gcPercentGoal || heapRetainedNow-gcPercentGoal < uint64(physPageSize) {
230		scavenge.gcPercentGoal.Store(^uint64(0))
231	} else {
232		scavenge.gcPercentGoal.Store(gcPercentGoal)
233	}
234}
235
236var scavenge struct {
237	// gcPercentGoal is the amount of retained heap memory (measured by
238	// heapRetained) that the runtime will try to maintain by returning
239	// memory to the OS. This goal is derived from gcController.gcPercent
240	// by choosing to retain enough memory to allocate heap memory up to
241	// the heap goal.
242	gcPercentGoal atomic.Uint64
243
244	// memoryLimitGoal is the amount of memory retained by the runtime (
245	// measured by gcController.mappedReady) that the runtime will try to
246	// maintain by returning memory to the OS. This goal is derived from
247	// gcController.memoryLimit by choosing to target the memory limit or
248	// some lower target to keep the scavenger working.
249	memoryLimitGoal atomic.Uint64
250
251	// assistTime is the time spent by the allocator scavenging in the last GC cycle.
252	//
253	// This is reset once a GC cycle ends.
254	assistTime atomic.Int64
255
256	// backgroundTime is the time spent by the background scavenger in the last GC cycle.
257	//
258	// This is reset once a GC cycle ends.
259	backgroundTime atomic.Int64
260}
261
262const (
263	// It doesn't really matter what value we start at, but we can't be zero, because
264	// that'll cause divide-by-zero issues. Pick something conservative which we'll
265	// also use as a fallback.
266	startingScavSleepRatio = 0.001
267
268	// Spend at least 1 ms scavenging, otherwise the corresponding
269	// sleep time to maintain our desired utilization is too low to
270	// be reliable.
271	minScavWorkTime = 1e6
272)
273
274// Sleep/wait state of the background scavenger.
275var scavenger scavengerState
276
277type scavengerState struct {
278	// lock protects all fields below.
279	lock mutex
280
281	// g is the goroutine the scavenger is bound to.
282	g *g
283
284	// timer is the timer used for the scavenger to sleep.
285	timer *timer
286
287	// sysmonWake signals to sysmon that it should wake the scavenger.
288	sysmonWake atomic.Uint32
289
290	// parked is whether or not the scavenger is parked.
291	parked bool
292
293	// printControllerReset instructs printScavTrace to signal that
294	// the controller was reset.
295	printControllerReset bool
296
297	// targetCPUFraction is the target CPU overhead for the scavenger.
298	targetCPUFraction float64
299
300	// sleepRatio is the ratio of time spent doing scavenging work to
301	// time spent sleeping. This is used to decide how long the scavenger
302	// should sleep for in between batches of work. It is set by
303	// critSleepController in order to maintain a CPU overhead of
304	// targetCPUFraction.
305	//
306	// Lower means more sleep, higher means more aggressive scavenging.
307	sleepRatio float64
308
309	// sleepController controls sleepRatio.
310	//
311	// See sleepRatio for more details.
312	sleepController piController
313
314	// controllerCooldown is the time left in nanoseconds during which we avoid
315	// using the controller and we hold sleepRatio at a conservative
316	// value. Used if the controller's assumptions fail to hold.
317	controllerCooldown int64
318
319	// sleepStub is a stub used for testing to avoid actually having
320	// the scavenger sleep.
321	//
322	// Unlike the other stubs, this is not populated if left nil
323	// Instead, it is called when non-nil because any valid implementation
324	// of this function basically requires closing over this scavenger
325	// state, and allocating a closure is not allowed in the runtime as
326	// a matter of policy.
327	sleepStub func(n int64) int64
328
329	// scavenge is a function that scavenges n bytes of memory.
330	// Returns how many bytes of memory it actually scavenged, as
331	// well as the time it took in nanoseconds. Usually mheap.pages.scavenge
332	// with nanotime called around it, but stubbed out for testing.
333	// Like mheap.pages.scavenge, if it scavenges less than n bytes of
334	// memory, the caller may assume the heap is exhausted of scavengable
335	// memory for now.
336	//
337	// If this is nil, it is populated with the real thing in init.
338	scavenge func(n uintptr) (uintptr, int64)
339
340	// shouldStop is a callback called in the work loop and provides a
341	// point that can force the scavenger to stop early, for example because
342	// the scavenge policy dictates too much has been scavenged already.
343	//
344	// If this is nil, it is populated with the real thing in init.
345	shouldStop func() bool
346
347	// gomaxprocs returns the current value of gomaxprocs. Stub for testing.
348	//
349	// If this is nil, it is populated with the real thing in init.
350	gomaxprocs func() int32
351}
352
353// init initializes a scavenger state and wires to the current G.
354//
355// Must be called from a regular goroutine that can allocate.
356func (s *scavengerState) init() {
357	if s.g != nil {
358		throw("scavenger state is already wired")
359	}
360	lockInit(&s.lock, lockRankScavenge)
361	s.g = getg()
362
363	s.timer = new(timer)
364	f := func(s any, _ uintptr, _ int64) {
365		s.(*scavengerState).wake()
366	}
367	s.timer.init(f, s)
368
369	// input: fraction of CPU time actually used.
370	// setpoint: ideal CPU fraction.
371	// output: ratio of time worked to time slept (determines sleep time).
372	//
373	// The output of this controller is somewhat indirect to what we actually
374	// want to achieve: how much time to sleep for. The reason for this definition
375	// is to ensure that the controller's outputs have a direct relationship with
376	// its inputs (as opposed to an inverse relationship), making it somewhat
377	// easier to reason about for tuning purposes.
378	s.sleepController = piController{
379		// Tuned loosely via Ziegler-Nichols process.
380		kp: 0.3375,
381		ti: 3.2e6,
382		tt: 1e9, // 1 second reset time.
383
384		// These ranges seem wide, but we want to give the controller plenty of
385		// room to hunt for the optimal value.
386		min: 0.001,  // 1:1000
387		max: 1000.0, // 1000:1
388	}
389	s.sleepRatio = startingScavSleepRatio
390
391	// Install real functions if stubs aren't present.
392	if s.scavenge == nil {
393		s.scavenge = func(n uintptr) (uintptr, int64) {
394			start := nanotime()
395			r := mheap_.pages.scavenge(n, nil, false)
396			end := nanotime()
397			if start >= end {
398				return r, 0
399			}
400			scavenge.backgroundTime.Add(end - start)
401			return r, end - start
402		}
403	}
404	if s.shouldStop == nil {
405		s.shouldStop = func() bool {
406			// If background scavenging is disabled or if there's no work to do just stop.
407			return heapRetained() <= scavenge.gcPercentGoal.Load() &&
408				gcController.mappedReady.Load() <= scavenge.memoryLimitGoal.Load()
409		}
410	}
411	if s.gomaxprocs == nil {
412		s.gomaxprocs = func() int32 {
413			return gomaxprocs
414		}
415	}
416}
417
418// park parks the scavenger goroutine.
419func (s *scavengerState) park() {
420	lock(&s.lock)
421	if getg() != s.g {
422		throw("tried to park scavenger from another goroutine")
423	}
424	s.parked = true
425	goparkunlock(&s.lock, waitReasonGCScavengeWait, traceBlockSystemGoroutine, 2)
426}
427
428// ready signals to sysmon that the scavenger should be awoken.
429func (s *scavengerState) ready() {
430	s.sysmonWake.Store(1)
431}
432
433// wake immediately unparks the scavenger if necessary.
434//
435// Safe to run without a P.
436func (s *scavengerState) wake() {
437	lock(&s.lock)
438	if s.parked {
439		// Unset sysmonWake, since the scavenger is now being awoken.
440		s.sysmonWake.Store(0)
441
442		// s.parked is unset to prevent a double wake-up.
443		s.parked = false
444
445		// Ready the goroutine by injecting it. We use injectglist instead
446		// of ready or goready in order to allow us to run this function
447		// without a P. injectglist also avoids placing the goroutine in
448		// the current P's runnext slot, which is desirable to prevent
449		// the scavenger from interfering with user goroutine scheduling
450		// too much.
451		var list gList
452		list.push(s.g)
453		injectglist(&list)
454	}
455	unlock(&s.lock)
456}
457
458// sleep puts the scavenger to sleep based on the amount of time that it worked
459// in nanoseconds.
460//
461// Note that this function should only be called by the scavenger.
462//
463// The scavenger may be woken up earlier by a pacing change, and it may not go
464// to sleep at all if there's a pending pacing change.
465func (s *scavengerState) sleep(worked float64) {
466	lock(&s.lock)
467	if getg() != s.g {
468		throw("tried to sleep scavenger from another goroutine")
469	}
470
471	if worked < minScavWorkTime {
472		// This means there wasn't enough work to actually fill up minScavWorkTime.
473		// That's fine; we shouldn't try to do anything with this information
474		// because it's going result in a short enough sleep request that things
475		// will get messy. Just assume we did at least this much work.
476		// All this means is that we'll sleep longer than we otherwise would have.
477		worked = minScavWorkTime
478	}
479
480	// Multiply the critical time by 1 + the ratio of the costs of using
481	// scavenged memory vs. scavenging memory. This forces us to pay down
482	// the cost of reusing this memory eagerly by sleeping for a longer period
483	// of time and scavenging less frequently. More concretely, we avoid situations
484	// where we end up scavenging so often that we hurt allocation performance
485	// because of the additional overheads of using scavenged memory.
486	worked *= 1 + scavengeCostRatio
487
488	// sleepTime is the amount of time we're going to sleep, based on the amount
489	// of time we worked, and the sleepRatio.
490	sleepTime := int64(worked / s.sleepRatio)
491
492	var slept int64
493	if s.sleepStub == nil {
494		// Set the timer.
495		//
496		// This must happen here instead of inside gopark
497		// because we can't close over any variables without
498		// failing escape analysis.
499		start := nanotime()
500		s.timer.reset(start+sleepTime, 0)
501
502		// Mark ourselves as asleep and go to sleep.
503		s.parked = true
504		goparkunlock(&s.lock, waitReasonSleep, traceBlockSleep, 2)
505
506		// How long we actually slept for.
507		slept = nanotime() - start
508
509		lock(&s.lock)
510		// Stop the timer here because s.wake is unable to do it for us.
511		// We don't really care if we succeed in stopping the timer. One
512		// reason we might fail is that we've already woken up, but the timer
513		// might be in the process of firing on some other P; essentially we're
514		// racing with it. That's totally OK. Double wake-ups are perfectly safe.
515		s.timer.stop()
516		unlock(&s.lock)
517	} else {
518		unlock(&s.lock)
519		slept = s.sleepStub(sleepTime)
520	}
521
522	// Stop here if we're cooling down from the controller.
523	if s.controllerCooldown > 0 {
524		// worked and slept aren't exact measures of time, but it's OK to be a bit
525		// sloppy here. We're just hoping we're avoiding some transient bad behavior.
526		t := slept + int64(worked)
527		if t > s.controllerCooldown {
528			s.controllerCooldown = 0
529		} else {
530			s.controllerCooldown -= t
531		}
532		return
533	}
534
535	// idealFraction is the ideal % of overall application CPU time that we
536	// spend scavenging.
537	idealFraction := float64(scavengePercent) / 100.0
538
539	// Calculate the CPU time spent.
540	//
541	// This may be slightly inaccurate with respect to GOMAXPROCS, but we're
542	// recomputing this often enough relative to GOMAXPROCS changes in general
543	// (it only changes when the world is stopped, and not during a GC) that
544	// that small inaccuracy is in the noise.
545	cpuFraction := worked / ((float64(slept) + worked) * float64(s.gomaxprocs()))
546
547	// Update the critSleepRatio, adjusting until we reach our ideal fraction.
548	var ok bool
549	s.sleepRatio, ok = s.sleepController.next(cpuFraction, idealFraction, float64(slept)+worked)
550	if !ok {
551		// The core assumption of the controller, that we can get a proportional
552		// response, broke down. This may be transient, so temporarily switch to
553		// sleeping a fixed, conservative amount.
554		s.sleepRatio = startingScavSleepRatio
555		s.controllerCooldown = 5e9 // 5 seconds.
556
557		// Signal the scav trace printer to output this.
558		s.controllerFailed()
559	}
560}
561
562// controllerFailed indicates that the scavenger's scheduling
563// controller failed.
564func (s *scavengerState) controllerFailed() {
565	lock(&s.lock)
566	s.printControllerReset = true
567	unlock(&s.lock)
568}
569
570// run is the body of the main scavenging loop.
571//
572// Returns the number of bytes released and the estimated time spent
573// releasing those bytes.
574//
575// Must be run on the scavenger goroutine.
576func (s *scavengerState) run() (released uintptr, worked float64) {
577	lock(&s.lock)
578	if getg() != s.g {
579		throw("tried to run scavenger from another goroutine")
580	}
581	unlock(&s.lock)
582
583	for worked < minScavWorkTime {
584		// If something from outside tells us to stop early, stop.
585		if s.shouldStop() {
586			break
587		}
588
589		// scavengeQuantum is the amount of memory we try to scavenge
590		// in one go. A smaller value means the scavenger is more responsive
591		// to the scheduler in case of e.g. preemption. A larger value means
592		// that the overheads of scavenging are better amortized, so better
593		// scavenging throughput.
594		//
595		// The current value is chosen assuming a cost of ~10µs/physical page
596		// (this is somewhat pessimistic), which implies a worst-case latency of
597		// about 160µs for 4 KiB physical pages. The current value is biased
598		// toward latency over throughput.
599		const scavengeQuantum = 64 << 10
600
601		// Accumulate the amount of time spent scavenging.
602		r, duration := s.scavenge(scavengeQuantum)
603
604		// On some platforms we may see end >= start if the time it takes to scavenge
605		// memory is less than the minimum granularity of its clock (e.g. Windows) or
606		// due to clock bugs.
607		//
608		// In this case, just assume scavenging takes 10 µs per regular physical page
609		// (determined empirically), and conservatively ignore the impact of huge pages
610		// on timing.
611		const approxWorkedNSPerPhysicalPage = 10e3
612		if duration == 0 {
613			worked += approxWorkedNSPerPhysicalPage * float64(r/physPageSize)
614		} else {
615			// TODO(mknyszek): If duration is small compared to worked, it could be
616			// rounded down to zero. Probably not a problem in practice because the
617			// values are all within a few orders of magnitude of each other but maybe
618			// worth worrying about.
619			worked += float64(duration)
620		}
621		released += r
622
623		// scavenge does not return until it either finds the requisite amount of
624		// memory to scavenge, or exhausts the heap. If we haven't found enough
625		// to scavenge, then the heap must be exhausted.
626		if r < scavengeQuantum {
627			break
628		}
629		// When using fake time just do one loop.
630		if faketime != 0 {
631			break
632		}
633	}
634	if released > 0 && released < physPageSize {
635		// If this happens, it means that we may have attempted to release part
636		// of a physical page, but the likely effect of that is that it released
637		// the whole physical page, some of which may have still been in-use.
638		// This could lead to memory corruption. Throw.
639		throw("released less than one physical page of memory")
640	}
641	return
642}
643
644// Background scavenger.
645//
646// The background scavenger maintains the RSS of the application below
647// the line described by the proportional scavenging statistics in
648// the mheap struct.
649func bgscavenge(c chan int) {
650	scavenger.init()
651
652	c <- 1
653	scavenger.park()
654
655	for {
656		released, workTime := scavenger.run()
657		if released == 0 {
658			scavenger.park()
659			continue
660		}
661		mheap_.pages.scav.releasedBg.Add(released)
662		scavenger.sleep(workTime)
663	}
664}
665
666// scavenge scavenges nbytes worth of free pages, starting with the
667// highest address first. Successive calls continue from where it left
668// off until the heap is exhausted. force makes all memory available to
669// scavenge, ignoring huge page heuristics.
670//
671// Returns the amount of memory scavenged in bytes.
672//
673// scavenge always tries to scavenge nbytes worth of memory, and will
674// only fail to do so if the heap is exhausted for now.
675func (p *pageAlloc) scavenge(nbytes uintptr, shouldStop func() bool, force bool) uintptr {
676	released := uintptr(0)
677	for released < nbytes {
678		ci, pageIdx := p.scav.index.find(force)
679		if ci == 0 {
680			break
681		}
682		systemstack(func() {
683			released += p.scavengeOne(ci, pageIdx, nbytes-released)
684		})
685		if shouldStop != nil && shouldStop() {
686			break
687		}
688	}
689	return released
690}
691
692// printScavTrace prints a scavenge trace line to standard error.
693//
694// released should be the amount of memory released since the last time this
695// was called, and forced indicates whether the scavenge was forced by the
696// application.
697//
698// scavenger.lock must be held.
699func printScavTrace(releasedBg, releasedEager uintptr, forced bool) {
700	assertLockHeld(&scavenger.lock)
701
702	printlock()
703	print("scav ",
704		releasedBg>>10, " KiB work (bg), ",
705		releasedEager>>10, " KiB work (eager), ",
706		gcController.heapReleased.load()>>10, " KiB now, ",
707		(gcController.heapInUse.load()*100)/heapRetained(), "% util",
708	)
709	if forced {
710		print(" (forced)")
711	} else if scavenger.printControllerReset {
712		print(" [controller reset]")
713		scavenger.printControllerReset = false
714	}
715	println()
716	printunlock()
717}
718
719// scavengeOne walks over the chunk at chunk index ci and searches for
720// a contiguous run of pages to scavenge. It will try to scavenge
721// at most max bytes at once, but may scavenge more to avoid
722// breaking huge pages. Once it scavenges some memory it returns
723// how much it scavenged in bytes.
724//
725// searchIdx is the page index to start searching from in ci.
726//
727// Returns the number of bytes scavenged.
728//
729// Must run on the systemstack because it acquires p.mheapLock.
730//
731//go:systemstack
732func (p *pageAlloc) scavengeOne(ci chunkIdx, searchIdx uint, max uintptr) uintptr {
733	// Calculate the maximum number of pages to scavenge.
734	//
735	// This should be alignUp(max, pageSize) / pageSize but max can and will
736	// be ^uintptr(0), so we need to be very careful not to overflow here.
737	// Rather than use alignUp, calculate the number of pages rounded down
738	// first, then add back one if necessary.
739	maxPages := max / pageSize
740	if max%pageSize != 0 {
741		maxPages++
742	}
743
744	// Calculate the minimum number of pages we can scavenge.
745	//
746	// Because we can only scavenge whole physical pages, we must
747	// ensure that we scavenge at least minPages each time, aligned
748	// to minPages*pageSize.
749	minPages := physPageSize / pageSize
750	if minPages < 1 {
751		minPages = 1
752	}
753
754	lock(p.mheapLock)
755	if p.summary[len(p.summary)-1][ci].max() >= uint(minPages) {
756		// We only bother looking for a candidate if there at least
757		// minPages free pages at all.
758		base, npages := p.chunkOf(ci).findScavengeCandidate(searchIdx, minPages, maxPages)
759
760		// If we found something, scavenge it and return!
761		if npages != 0 {
762			// Compute the full address for the start of the range.
763			addr := chunkBase(ci) + uintptr(base)*pageSize
764
765			// Mark the range we're about to scavenge as allocated, because
766			// we don't want any allocating goroutines to grab it while
767			// the scavenging is in progress. Be careful here -- just do the
768			// bare minimum to avoid stepping on our own scavenging stats.
769			p.chunkOf(ci).allocRange(base, npages)
770			p.update(addr, uintptr(npages), true, true)
771
772			// With that done, it's safe to unlock.
773			unlock(p.mheapLock)
774
775			if !p.test {
776				// Only perform sys* operations if we're not in a test.
777				// It's dangerous to do so otherwise.
778				sysUnused(unsafe.Pointer(addr), uintptr(npages)*pageSize)
779
780				// Update global accounting only when not in test, otherwise
781				// the runtime's accounting will be wrong.
782				nbytes := int64(npages * pageSize)
783				gcController.heapReleased.add(nbytes)
784				gcController.heapFree.add(-nbytes)
785
786				stats := memstats.heapStats.acquire()
787				atomic.Xaddint64(&stats.committed, -nbytes)
788				atomic.Xaddint64(&stats.released, nbytes)
789				memstats.heapStats.release()
790			}
791
792			// Relock the heap, because now we need to make these pages
793			// available allocation. Free them back to the page allocator.
794			lock(p.mheapLock)
795			if b := (offAddr{addr}); b.lessThan(p.searchAddr) {
796				p.searchAddr = b
797			}
798			p.chunkOf(ci).free(base, npages)
799			p.update(addr, uintptr(npages), true, false)
800
801			// Mark the range as scavenged.
802			p.chunkOf(ci).scavenged.setRange(base, npages)
803			unlock(p.mheapLock)
804
805			return uintptr(npages) * pageSize
806		}
807	}
808	// Mark this chunk as having no free pages.
809	p.scav.index.setEmpty(ci)
810	unlock(p.mheapLock)
811
812	return 0
813}
814
815// fillAligned returns x but with all zeroes in m-aligned
816// groups of m bits set to 1 if any bit in the group is non-zero.
817//
818// For example, fillAligned(0x0100a3, 8) == 0xff00ff.
819//
820// Note that if m == 1, this is a no-op.
821//
822// m must be a power of 2 <= maxPagesPerPhysPage.
823func fillAligned(x uint64, m uint) uint64 {
824	apply := func(x uint64, c uint64) uint64 {
825		// The technique used it here is derived from
826		// https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
827		// and extended for more than just bytes (like nibbles
828		// and uint16s) by using an appropriate constant.
829		//
830		// To summarize the technique, quoting from that page:
831		// "[It] works by first zeroing the high bits of the [8]
832		// bytes in the word. Subsequently, it adds a number that
833		// will result in an overflow to the high bit of a byte if
834		// any of the low bits were initially set. Next the high
835		// bits of the original word are ORed with these values;
836		// thus, the high bit of a byte is set iff any bit in the
837		// byte was set. Finally, we determine if any of these high
838		// bits are zero by ORing with ones everywhere except the
839		// high bits and inverting the result."
840		return ^((((x & c) + c) | x) | c)
841	}
842	// Transform x to contain a 1 bit at the top of each m-aligned
843	// group of m zero bits.
844	switch m {
845	case 1:
846		return x
847	case 2:
848		x = apply(x, 0x5555555555555555)
849	case 4:
850		x = apply(x, 0x7777777777777777)
851	case 8:
852		x = apply(x, 0x7f7f7f7f7f7f7f7f)
853	case 16:
854		x = apply(x, 0x7fff7fff7fff7fff)
855	case 32:
856		x = apply(x, 0x7fffffff7fffffff)
857	case 64: // == maxPagesPerPhysPage
858		x = apply(x, 0x7fffffffffffffff)
859	default:
860		throw("bad m value")
861	}
862	// Now, the top bit of each m-aligned group in x is set
863	// that group was all zero in the original x.
864
865	// From each group of m bits subtract 1.
866	// Because we know only the top bits of each
867	// m-aligned group are set, we know this will
868	// set each group to have all the bits set except
869	// the top bit, so just OR with the original
870	// result to set all the bits.
871	return ^((x - (x >> (m - 1))) | x)
872}
873
874// findScavengeCandidate returns a start index and a size for this pallocData
875// segment which represents a contiguous region of free and unscavenged memory.
876//
877// searchIdx indicates the page index within this chunk to start the search, but
878// note that findScavengeCandidate searches backwards through the pallocData. As
879// a result, it will return the highest scavenge candidate in address order.
880//
881// min indicates a hard minimum size and alignment for runs of pages. That is,
882// findScavengeCandidate will not return a region smaller than min pages in size,
883// or that is min pages or greater in size but not aligned to min. min must be
884// a non-zero power of 2 <= maxPagesPerPhysPage.
885//
886// max is a hint for how big of a region is desired. If max >= pallocChunkPages, then
887// findScavengeCandidate effectively returns entire free and unscavenged regions.
888// If max < pallocChunkPages, it may truncate the returned region such that size is
889// max. However, findScavengeCandidate may still return a larger region if, for
890// example, it chooses to preserve huge pages, or if max is not aligned to min (it
891// will round up). That is, even if max is small, the returned size is not guaranteed
892// to be equal to max. max is allowed to be less than min, in which case it is as if
893// max == min.
894func (m *pallocData) findScavengeCandidate(searchIdx uint, minimum, max uintptr) (uint, uint) {
895	if minimum&(minimum-1) != 0 || minimum == 0 {
896		print("runtime: min = ", minimum, "\n")
897		throw("min must be a non-zero power of 2")
898	} else if minimum > maxPagesPerPhysPage {
899		print("runtime: min = ", minimum, "\n")
900		throw("min too large")
901	}
902	// max may not be min-aligned, so we might accidentally truncate to
903	// a max value which causes us to return a non-min-aligned value.
904	// To prevent this, align max up to a multiple of min (which is always
905	// a power of 2). This also prevents max from ever being less than
906	// min, unless it's zero, so handle that explicitly.
907	if max == 0 {
908		max = minimum
909	} else {
910		max = alignUp(max, minimum)
911	}
912
913	i := int(searchIdx / 64)
914	// Start by quickly skipping over blocks of non-free or scavenged pages.
915	for ; i >= 0; i-- {
916		// 1s are scavenged OR non-free => 0s are unscavenged AND free
917		x := fillAligned(m.scavenged[i]|m.pallocBits[i], uint(minimum))
918		if x != ^uint64(0) {
919			break
920		}
921	}
922	if i < 0 {
923		// Failed to find any free/unscavenged pages.
924		return 0, 0
925	}
926	// We have something in the 64-bit chunk at i, but it could
927	// extend further. Loop until we find the extent of it.
928
929	// 1s are scavenged OR non-free => 0s are unscavenged AND free
930	x := fillAligned(m.scavenged[i]|m.pallocBits[i], uint(minimum))
931	z1 := uint(sys.LeadingZeros64(^x))
932	run, end := uint(0), uint(i)*64+(64-z1)
933	if x<<z1 != 0 {
934		// After shifting out z1 bits, we still have 1s,
935		// so the run ends inside this word.
936		run = uint(sys.LeadingZeros64(x << z1))
937	} else {
938		// After shifting out z1 bits, we have no more 1s.
939		// This means the run extends to the bottom of the
940		// word so it may extend into further words.
941		run = 64 - z1
942		for j := i - 1; j >= 0; j-- {
943			x := fillAligned(m.scavenged[j]|m.pallocBits[j], uint(minimum))
944			run += uint(sys.LeadingZeros64(x))
945			if x != 0 {
946				// The run stopped in this word.
947				break
948			}
949		}
950	}
951
952	// Split the run we found if it's larger than max but hold on to
953	// our original length, since we may need it later.
954	size := min(run, uint(max))
955	start := end - size
956
957	// Each huge page is guaranteed to fit in a single palloc chunk.
958	//
959	// TODO(mknyszek): Support larger huge page sizes.
960	// TODO(mknyszek): Consider taking pages-per-huge-page as a parameter
961	// so we can write tests for this.
962	if physHugePageSize > pageSize && physHugePageSize > physPageSize {
963		// We have huge pages, so let's ensure we don't break one by scavenging
964		// over a huge page boundary. If the range [start, start+size) overlaps with
965		// a free-and-unscavenged huge page, we want to grow the region we scavenge
966		// to include that huge page.
967
968		// Compute the huge page boundary above our candidate.
969		pagesPerHugePage := physHugePageSize / pageSize
970		hugePageAbove := uint(alignUp(uintptr(start), pagesPerHugePage))
971
972		// If that boundary is within our current candidate, then we may be breaking
973		// a huge page.
974		if hugePageAbove <= end {
975			// Compute the huge page boundary below our candidate.
976			hugePageBelow := uint(alignDown(uintptr(start), pagesPerHugePage))
977
978			if hugePageBelow >= end-run {
979				// We're in danger of breaking apart a huge page since start+size crosses
980				// a huge page boundary and rounding down start to the nearest huge
981				// page boundary is included in the full run we found. Include the entire
982				// huge page in the bound by rounding down to the huge page size.
983				size = size + (start - hugePageBelow)
984				start = hugePageBelow
985			}
986		}
987	}
988	return start, size
989}
990
991// scavengeIndex is a structure for efficiently managing which pageAlloc chunks have
992// memory available to scavenge.
993type scavengeIndex struct {
994	// chunks is a scavChunkData-per-chunk structure that indicates the presence of pages
995	// available for scavenging. Updates to the index are serialized by the pageAlloc lock.
996	//
997	// It tracks chunk occupancy and a generation counter per chunk. If a chunk's occupancy
998	// never exceeds pallocChunkDensePages over the course of a single GC cycle, the chunk
999	// becomes eligible for scavenging on the next cycle. If a chunk ever hits this density
1000	// threshold it immediately becomes unavailable for scavenging in the current cycle as
1001	// well as the next.
1002	//
1003	// [min, max) represents the range of chunks that is safe to access (i.e. will not cause
1004	// a fault). As an optimization minHeapIdx represents the true minimum chunk that has been
1005	// mapped, since min is likely rounded down to include the system page containing minHeapIdx.
1006	//
1007	// For a chunk size of 4 MiB this structure will only use 2 MiB for a 1 TiB contiguous heap.
1008	chunks     []atomicScavChunkData
1009	min, max   atomic.Uintptr
1010	minHeapIdx atomic.Uintptr
1011
1012	// searchAddr* is the maximum address (in the offset address space, so we have a linear
1013	// view of the address space; see mranges.go:offAddr) containing memory available to
1014	// scavenge. It is a hint to the find operation to avoid O(n^2) behavior in repeated lookups.
1015	//
1016	// searchAddr* is always inclusive and should be the base address of the highest runtime
1017	// page available for scavenging.
1018	//
1019	// searchAddrForce is managed by find and free.
1020	// searchAddrBg is managed by find and nextGen.
1021	//
1022	// Normally, find monotonically decreases searchAddr* as it finds no more free pages to
1023	// scavenge. However, mark, when marking a new chunk at an index greater than the current
1024	// searchAddr, sets searchAddr to the *negative* index into chunks of that page. The trick here
1025	// is that concurrent calls to find will fail to monotonically decrease searchAddr*, and so they
1026	// won't barge over new memory becoming available to scavenge. Furthermore, this ensures
1027	// that some future caller of find *must* observe the new high index. That caller
1028	// (or any other racing with it), then makes searchAddr positive before continuing, bringing
1029	// us back to our monotonically decreasing steady-state.
1030	//
1031	// A pageAlloc lock serializes updates between min, max, and searchAddr, so abs(searchAddr)
1032	// is always guaranteed to be >= min and < max (converted to heap addresses).
1033	//
1034	// searchAddrBg is increased only on each new generation and is mainly used by the
1035	// background scavenger and heap-growth scavenging. searchAddrForce is increased continuously
1036	// as memory gets freed and is mainly used by eager memory reclaim such as debug.FreeOSMemory
1037	// and scavenging to maintain the memory limit.
1038	searchAddrBg    atomicOffAddr
1039	searchAddrForce atomicOffAddr
1040
1041	// freeHWM is the highest address (in offset address space) that was freed
1042	// this generation.
1043	freeHWM offAddr
1044
1045	// Generation counter. Updated by nextGen at the end of each mark phase.
1046	gen uint32
1047
1048	// test indicates whether or not we're in a test.
1049	test bool
1050}
1051
1052// init initializes the scavengeIndex.
1053//
1054// Returns the amount added to sysStat.
1055func (s *scavengeIndex) init(test bool, sysStat *sysMemStat) uintptr {
1056	s.searchAddrBg.Clear()
1057	s.searchAddrForce.Clear()
1058	s.freeHWM = minOffAddr
1059	s.test = test
1060	return s.sysInit(test, sysStat)
1061}
1062
1063// sysGrow updates the index's backing store in response to a heap growth.
1064//
1065// Returns the amount of memory added to sysStat.
1066func (s *scavengeIndex) grow(base, limit uintptr, sysStat *sysMemStat) uintptr {
1067	// Update minHeapIdx. Note that even if there's no mapping work to do,
1068	// we may still have a new, lower minimum heap address.
1069	minHeapIdx := s.minHeapIdx.Load()
1070	if baseIdx := uintptr(chunkIndex(base)); minHeapIdx == 0 || baseIdx < minHeapIdx {
1071		s.minHeapIdx.Store(baseIdx)
1072	}
1073	return s.sysGrow(base, limit, sysStat)
1074}
1075
1076// find returns the highest chunk index that may contain pages available to scavenge.
1077// It also returns an offset to start searching in the highest chunk.
1078func (s *scavengeIndex) find(force bool) (chunkIdx, uint) {
1079	cursor := &s.searchAddrBg
1080	if force {
1081		cursor = &s.searchAddrForce
1082	}
1083	searchAddr, marked := cursor.Load()
1084	if searchAddr == minOffAddr.addr() {
1085		// We got a cleared search addr.
1086		return 0, 0
1087	}
1088
1089	// Starting from searchAddr's chunk, iterate until we find a chunk with pages to scavenge.
1090	gen := s.gen
1091	min := chunkIdx(s.minHeapIdx.Load())
1092	start := chunkIndex(searchAddr)
1093	// N.B. We'll never map the 0'th chunk, so minHeapIdx ensures this loop overflow.
1094	for i := start; i >= min; i-- {
1095		// Skip over chunks.
1096		if !s.chunks[i].load().shouldScavenge(gen, force) {
1097			continue
1098		}
1099		// We're still scavenging this chunk.
1100		if i == start {
1101			return i, chunkPageIndex(searchAddr)
1102		}
1103		// Try to reduce searchAddr to newSearchAddr.
1104		newSearchAddr := chunkBase(i) + pallocChunkBytes - pageSize
1105		if marked {
1106			// Attempt to be the first one to decrease the searchAddr
1107			// after an increase. If we fail, that means there was another
1108			// increase, or somebody else got to it before us. Either way,
1109			// it doesn't matter. We may lose some performance having an
1110			// incorrect search address, but it's far more important that
1111			// we don't miss updates.
1112			cursor.StoreUnmark(searchAddr, newSearchAddr)
1113		} else {
1114			// Decrease searchAddr.
1115			cursor.StoreMin(newSearchAddr)
1116		}
1117		return i, pallocChunkPages - 1
1118	}
1119	// Clear searchAddr, because we've exhausted the heap.
1120	cursor.Clear()
1121	return 0, 0
1122}
1123
1124// alloc updates metadata for chunk at index ci with the fact that
1125// an allocation of npages occurred. It also eagerly attempts to collapse
1126// the chunk's memory into hugepage if the chunk has become sufficiently
1127// dense and we're not allocating the whole chunk at once (which suggests
1128// the allocation is part of a bigger one and it's probably not worth
1129// eagerly collapsing).
1130//
1131// alloc may only run concurrently with find.
1132func (s *scavengeIndex) alloc(ci chunkIdx, npages uint) {
1133	sc := s.chunks[ci].load()
1134	sc.alloc(npages, s.gen)
1135	// TODO(mknyszek): Consider eagerly backing memory with huge pages
1136	// here and track whether we believe this chunk is backed by huge pages.
1137	// In the past we've attempted to use sysHugePageCollapse (which uses
1138	// MADV_COLLAPSE on Linux, and is unsupported elswhere) for this purpose,
1139	// but that caused performance issues in production environments.
1140	s.chunks[ci].store(sc)
1141}
1142
1143// free updates metadata for chunk at index ci with the fact that
1144// a free of npages occurred.
1145//
1146// free may only run concurrently with find.
1147func (s *scavengeIndex) free(ci chunkIdx, page, npages uint) {
1148	sc := s.chunks[ci].load()
1149	sc.free(npages, s.gen)
1150	s.chunks[ci].store(sc)
1151
1152	// Update scavenge search addresses.
1153	addr := chunkBase(ci) + uintptr(page+npages-1)*pageSize
1154	if s.freeHWM.lessThan(offAddr{addr}) {
1155		s.freeHWM = offAddr{addr}
1156	}
1157	// N.B. Because free is serialized, it's not necessary to do a
1158	// full CAS here. free only ever increases searchAddr, while
1159	// find only ever decreases it. Since we only ever race with
1160	// decreases, even if the value we loaded is stale, the actual
1161	// value will never be larger.
1162	searchAddr, _ := s.searchAddrForce.Load()
1163	if (offAddr{searchAddr}).lessThan(offAddr{addr}) {
1164		s.searchAddrForce.StoreMarked(addr)
1165	}
1166}
1167
1168// nextGen moves the scavenger forward one generation. Must be called
1169// once per GC cycle, but may be called more often to force more memory
1170// to be released.
1171//
1172// nextGen may only run concurrently with find.
1173func (s *scavengeIndex) nextGen() {
1174	s.gen++
1175	searchAddr, _ := s.searchAddrBg.Load()
1176	if (offAddr{searchAddr}).lessThan(s.freeHWM) {
1177		s.searchAddrBg.StoreMarked(s.freeHWM.addr())
1178	}
1179	s.freeHWM = minOffAddr
1180}
1181
1182// setEmpty marks that the scavenger has finished looking at ci
1183// for now to prevent the scavenger from getting stuck looking
1184// at the same chunk.
1185//
1186// setEmpty may only run concurrently with find.
1187func (s *scavengeIndex) setEmpty(ci chunkIdx) {
1188	val := s.chunks[ci].load()
1189	val.setEmpty()
1190	s.chunks[ci].store(val)
1191}
1192
1193// atomicScavChunkData is an atomic wrapper around a scavChunkData
1194// that stores it in its packed form.
1195type atomicScavChunkData struct {
1196	value atomic.Uint64
1197}
1198
1199// load loads and unpacks a scavChunkData.
1200func (sc *atomicScavChunkData) load() scavChunkData {
1201	return unpackScavChunkData(sc.value.Load())
1202}
1203
1204// store packs and writes a new scavChunkData. store must be serialized
1205// with other calls to store.
1206func (sc *atomicScavChunkData) store(ssc scavChunkData) {
1207	sc.value.Store(ssc.pack())
1208}
1209
1210// scavChunkData tracks information about a palloc chunk for
1211// scavenging. It packs well into 64 bits.
1212//
1213// The zero value always represents a valid newly-grown chunk.
1214type scavChunkData struct {
1215	// inUse indicates how many pages in this chunk are currently
1216	// allocated.
1217	//
1218	// Only the first 10 bits are used.
1219	inUse uint16
1220
1221	// lastInUse indicates how many pages in this chunk were allocated
1222	// when we transitioned from gen-1 to gen.
1223	//
1224	// Only the first 10 bits are used.
1225	lastInUse uint16
1226
1227	// gen is the generation counter from a scavengeIndex from the
1228	// last time this scavChunkData was updated.
1229	gen uint32
1230
1231	// scavChunkFlags represents additional flags
1232	//
1233	// Note: only 6 bits are available.
1234	scavChunkFlags
1235}
1236
1237// unpackScavChunkData unpacks a scavChunkData from a uint64.
1238func unpackScavChunkData(sc uint64) scavChunkData {
1239	return scavChunkData{
1240		inUse:          uint16(sc),
1241		lastInUse:      uint16(sc>>16) & scavChunkInUseMask,
1242		gen:            uint32(sc >> 32),
1243		scavChunkFlags: scavChunkFlags(uint8(sc>>(16+logScavChunkInUseMax)) & scavChunkFlagsMask),
1244	}
1245}
1246
1247// pack returns sc packed into a uint64.
1248func (sc scavChunkData) pack() uint64 {
1249	return uint64(sc.inUse) |
1250		(uint64(sc.lastInUse) << 16) |
1251		(uint64(sc.scavChunkFlags) << (16 + logScavChunkInUseMax)) |
1252		(uint64(sc.gen) << 32)
1253}
1254
1255const (
1256	// scavChunkHasFree indicates whether the chunk has anything left to
1257	// scavenge. This is the opposite of "empty," used elsewhere in this
1258	// file. The reason we say "HasFree" here is so the zero value is
1259	// correct for a newly-grown chunk. (New memory is scavenged.)
1260	scavChunkHasFree scavChunkFlags = 1 << iota
1261
1262	// scavChunkMaxFlags is the maximum number of flags we can have, given how
1263	// a scavChunkData is packed into 8 bytes.
1264	scavChunkMaxFlags  = 6
1265	scavChunkFlagsMask = (1 << scavChunkMaxFlags) - 1
1266
1267	// logScavChunkInUseMax is the number of bits needed to represent the number
1268	// of pages allocated in a single chunk. This is 1 more than log2 of the
1269	// number of pages in the chunk because we need to represent a fully-allocated
1270	// chunk.
1271	logScavChunkInUseMax = logPallocChunkPages + 1
1272	scavChunkInUseMask   = (1 << logScavChunkInUseMax) - 1
1273)
1274
1275// scavChunkFlags is a set of bit-flags for the scavenger for each palloc chunk.
1276type scavChunkFlags uint8
1277
1278// isEmpty returns true if the hasFree flag is unset.
1279func (sc *scavChunkFlags) isEmpty() bool {
1280	return (*sc)&scavChunkHasFree == 0
1281}
1282
1283// setEmpty clears the hasFree flag.
1284func (sc *scavChunkFlags) setEmpty() {
1285	*sc &^= scavChunkHasFree
1286}
1287
1288// setNonEmpty sets the hasFree flag.
1289func (sc *scavChunkFlags) setNonEmpty() {
1290	*sc |= scavChunkHasFree
1291}
1292
1293// shouldScavenge returns true if the corresponding chunk should be interrogated
1294// by the scavenger.
1295func (sc scavChunkData) shouldScavenge(currGen uint32, force bool) bool {
1296	if sc.isEmpty() {
1297		// Nothing to scavenge.
1298		return false
1299	}
1300	if force {
1301		// We're forcing the memory to be scavenged.
1302		return true
1303	}
1304	if sc.gen == currGen {
1305		// In the current generation, if either the current or last generation
1306		// is dense, then skip scavenging. Inverting that, we should scavenge
1307		// if both the current and last generation were not dense.
1308		return sc.inUse < scavChunkHiOccPages && sc.lastInUse < scavChunkHiOccPages
1309	}
1310	// If we're one or more generations ahead, we know inUse represents the current
1311	// state of the chunk, since otherwise it would've been updated already.
1312	return sc.inUse < scavChunkHiOccPages
1313}
1314
1315// alloc updates sc given that npages were allocated in the corresponding chunk.
1316func (sc *scavChunkData) alloc(npages uint, newGen uint32) {
1317	if uint(sc.inUse)+npages > pallocChunkPages {
1318		print("runtime: inUse=", sc.inUse, " npages=", npages, "\n")
1319		throw("too many pages allocated in chunk?")
1320	}
1321	if sc.gen != newGen {
1322		sc.lastInUse = sc.inUse
1323		sc.gen = newGen
1324	}
1325	sc.inUse += uint16(npages)
1326	if sc.inUse == pallocChunkPages {
1327		// There's nothing for the scavenger to take from here.
1328		sc.setEmpty()
1329	}
1330}
1331
1332// free updates sc given that npages was freed in the corresponding chunk.
1333func (sc *scavChunkData) free(npages uint, newGen uint32) {
1334	if uint(sc.inUse) < npages {
1335		print("runtime: inUse=", sc.inUse, " npages=", npages, "\n")
1336		throw("allocated pages below zero?")
1337	}
1338	if sc.gen != newGen {
1339		sc.lastInUse = sc.inUse
1340		sc.gen = newGen
1341	}
1342	sc.inUse -= uint16(npages)
1343	// The scavenger can no longer be done with this chunk now that
1344	// new memory has been freed into it.
1345	sc.setNonEmpty()
1346}
1347
1348type piController struct {
1349	kp float64 // Proportional constant.
1350	ti float64 // Integral time constant.
1351	tt float64 // Reset time.
1352
1353	min, max float64 // Output boundaries.
1354
1355	// PI controller state.
1356
1357	errIntegral float64 // Integral of the error from t=0 to now.
1358
1359	// Error flags.
1360	errOverflow   bool // Set if errIntegral ever overflowed.
1361	inputOverflow bool // Set if an operation with the input overflowed.
1362}
1363
1364// next provides a new sample to the controller.
1365//
1366// input is the sample, setpoint is the desired point, and period is how much
1367// time (in whatever unit makes the most sense) has passed since the last sample.
1368//
1369// Returns a new value for the variable it's controlling, and whether the operation
1370// completed successfully. One reason this might fail is if error has been growing
1371// in an unbounded manner, to the point of overflow.
1372//
1373// In the specific case of an error overflow occurs, the errOverflow field will be
1374// set and the rest of the controller's internal state will be fully reset.
1375func (c *piController) next(input, setpoint, period float64) (float64, bool) {
1376	// Compute the raw output value.
1377	prop := c.kp * (setpoint - input)
1378	rawOutput := prop + c.errIntegral
1379
1380	// Clamp rawOutput into output.
1381	output := rawOutput
1382	if isInf(output) || isNaN(output) {
1383		// The input had a large enough magnitude that either it was already
1384		// overflowed, or some operation with it overflowed.
1385		// Set a flag and reset. That's the safest thing to do.
1386		c.reset()
1387		c.inputOverflow = true
1388		return c.min, false
1389	}
1390	if output < c.min {
1391		output = c.min
1392	} else if output > c.max {
1393		output = c.max
1394	}
1395
1396	// Update the controller's state.
1397	if c.ti != 0 && c.tt != 0 {
1398		c.errIntegral += (c.kp*period/c.ti)*(setpoint-input) + (period/c.tt)*(output-rawOutput)
1399		if isInf(c.errIntegral) || isNaN(c.errIntegral) {
1400			// So much error has accumulated that we managed to overflow.
1401			// The assumptions around the controller have likely broken down.
1402			// Set a flag and reset. That's the safest thing to do.
1403			c.reset()
1404			c.errOverflow = true
1405			return c.min, false
1406		}
1407	}
1408	return output, true
1409}
1410
1411// reset resets the controller state, except for controller error flags.
1412func (c *piController) reset() {
1413	c.errIntegral = 0
1414}
1415