xref: /aosp_15_r20/external/leveldb/doc/impl.md (revision 9507f98c5f32dee4b5f9e4a38cd499f3ff5c4490)
1*9507f98cSAndroid Build Coastguard Worker## Files
2*9507f98cSAndroid Build Coastguard Worker
3*9507f98cSAndroid Build Coastguard WorkerThe implementation of leveldb is similar in spirit to the representation of a
4*9507f98cSAndroid Build Coastguard Workersingle [Bigtable tablet (section 5.3)](https://research.google/pubs/pub27898/).
5*9507f98cSAndroid Build Coastguard WorkerHowever the organization of the files that make up the representation is
6*9507f98cSAndroid Build Coastguard Workersomewhat different and is explained below.
7*9507f98cSAndroid Build Coastguard Worker
8*9507f98cSAndroid Build Coastguard WorkerEach database is represented by a set of files stored in a directory. There are
9*9507f98cSAndroid Build Coastguard Workerseveral different types of files as documented below:
10*9507f98cSAndroid Build Coastguard Worker
11*9507f98cSAndroid Build Coastguard Worker### Log files
12*9507f98cSAndroid Build Coastguard Worker
13*9507f98cSAndroid Build Coastguard WorkerA log file (*.log) stores a sequence of recent updates. Each update is appended
14*9507f98cSAndroid Build Coastguard Workerto the current log file. When the log file reaches a pre-determined size
15*9507f98cSAndroid Build Coastguard Worker(approximately 4MB by default), it is converted to a sorted table (see below)
16*9507f98cSAndroid Build Coastguard Workerand a new log file is created for future updates.
17*9507f98cSAndroid Build Coastguard Worker
18*9507f98cSAndroid Build Coastguard WorkerA copy of the current log file is kept in an in-memory structure (the
19*9507f98cSAndroid Build Coastguard Worker`memtable`). This copy is consulted on every read so that read operations
20*9507f98cSAndroid Build Coastguard Workerreflect all logged updates.
21*9507f98cSAndroid Build Coastguard Worker
22*9507f98cSAndroid Build Coastguard Worker## Sorted tables
23*9507f98cSAndroid Build Coastguard Worker
24*9507f98cSAndroid Build Coastguard WorkerA sorted table (*.ldb) stores a sequence of entries sorted by key. Each entry is
25*9507f98cSAndroid Build Coastguard Workereither a value for the key, or a deletion marker for the key. (Deletion markers
26*9507f98cSAndroid Build Coastguard Workerare kept around to hide obsolete values present in older sorted tables).
27*9507f98cSAndroid Build Coastguard Worker
28*9507f98cSAndroid Build Coastguard WorkerThe set of sorted tables are organized into a sequence of levels. The sorted
29*9507f98cSAndroid Build Coastguard Workertable generated from a log file is placed in a special **young** level (also
30*9507f98cSAndroid Build Coastguard Workercalled level-0). When the number of young files exceeds a certain threshold
31*9507f98cSAndroid Build Coastguard Worker(currently four), all of the young files are merged together with all of the
32*9507f98cSAndroid Build Coastguard Workeroverlapping level-1 files to produce a sequence of new level-1 files (we create
33*9507f98cSAndroid Build Coastguard Workera new level-1 file for every 2MB of data.)
34*9507f98cSAndroid Build Coastguard Worker
35*9507f98cSAndroid Build Coastguard WorkerFiles in the young level may contain overlapping keys. However files in other
36*9507f98cSAndroid Build Coastguard Workerlevels have distinct non-overlapping key ranges. Consider level number L where
37*9507f98cSAndroid Build Coastguard WorkerL >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB
38*9507f98cSAndroid Build Coastguard Workerfor level-1, 100MB for level-2, ...), one file in level-L, and all of the
39*9507f98cSAndroid Build Coastguard Workeroverlapping files in level-(L+1) are merged to form a set of new files for
40*9507f98cSAndroid Build Coastguard Workerlevel-(L+1). These merges have the effect of gradually migrating new updates
41*9507f98cSAndroid Build Coastguard Workerfrom the young level to the largest level using only bulk reads and writes
42*9507f98cSAndroid Build Coastguard Worker(i.e., minimizing expensive seeks).
43*9507f98cSAndroid Build Coastguard Worker
44*9507f98cSAndroid Build Coastguard Worker### Manifest
45*9507f98cSAndroid Build Coastguard Worker
46*9507f98cSAndroid Build Coastguard WorkerA MANIFEST file lists the set of sorted tables that make up each level, the
47*9507f98cSAndroid Build Coastguard Workercorresponding key ranges, and other important metadata. A new MANIFEST file
48*9507f98cSAndroid Build Coastguard Worker(with a new number embedded in the file name) is created whenever the database
49*9507f98cSAndroid Build Coastguard Workeris reopened. The MANIFEST file is formatted as a log, and changes made to the
50*9507f98cSAndroid Build Coastguard Workerserving state (as files are added or removed) are appended to this log.
51*9507f98cSAndroid Build Coastguard Worker
52*9507f98cSAndroid Build Coastguard Worker### Current
53*9507f98cSAndroid Build Coastguard Worker
54*9507f98cSAndroid Build Coastguard WorkerCURRENT is a simple text file that contains the name of the latest MANIFEST
55*9507f98cSAndroid Build Coastguard Workerfile.
56*9507f98cSAndroid Build Coastguard Worker
57*9507f98cSAndroid Build Coastguard Worker### Info logs
58*9507f98cSAndroid Build Coastguard Worker
59*9507f98cSAndroid Build Coastguard WorkerInformational messages are printed to files named LOG and LOG.old.
60*9507f98cSAndroid Build Coastguard Worker
61*9507f98cSAndroid Build Coastguard Worker### Others
62*9507f98cSAndroid Build Coastguard Worker
63*9507f98cSAndroid Build Coastguard WorkerOther files used for miscellaneous purposes may also be present (LOCK, *.dbtmp).
64*9507f98cSAndroid Build Coastguard Worker
65*9507f98cSAndroid Build Coastguard Worker## Level 0
66*9507f98cSAndroid Build Coastguard Worker
67*9507f98cSAndroid Build Coastguard WorkerWhen the log file grows above a certain size (4MB by default):
68*9507f98cSAndroid Build Coastguard WorkerCreate a brand new memtable and log file and direct future updates here.
69*9507f98cSAndroid Build Coastguard Worker
70*9507f98cSAndroid Build Coastguard WorkerIn the background:
71*9507f98cSAndroid Build Coastguard Worker
72*9507f98cSAndroid Build Coastguard Worker1. Write the contents of the previous memtable to an sstable.
73*9507f98cSAndroid Build Coastguard Worker2. Discard the memtable.
74*9507f98cSAndroid Build Coastguard Worker3. Delete the old log file and the old memtable.
75*9507f98cSAndroid Build Coastguard Worker4. Add the new sstable to the young (level-0) level.
76*9507f98cSAndroid Build Coastguard Worker
77*9507f98cSAndroid Build Coastguard Worker## Compactions
78*9507f98cSAndroid Build Coastguard Worker
79*9507f98cSAndroid Build Coastguard WorkerWhen the size of level L exceeds its limit, we compact it in a background
80*9507f98cSAndroid Build Coastguard Workerthread. The compaction picks a file from level L and all overlapping files from
81*9507f98cSAndroid Build Coastguard Workerthe next level L+1. Note that if a level-L file overlaps only part of a
82*9507f98cSAndroid Build Coastguard Workerlevel-(L+1) file, the entire file at level-(L+1) is used as an input to the
83*9507f98cSAndroid Build Coastguard Workercompaction and will be discarded after the compaction.  Aside: because level-0
84*9507f98cSAndroid Build Coastguard Workeris special (files in it may overlap each other), we treat compactions from
85*9507f98cSAndroid Build Coastguard Workerlevel-0 to level-1 specially: a level-0 compaction may pick more than one
86*9507f98cSAndroid Build Coastguard Workerlevel-0 file in case some of these files overlap each other.
87*9507f98cSAndroid Build Coastguard Worker
88*9507f98cSAndroid Build Coastguard WorkerA compaction merges the contents of the picked files to produce a sequence of
89*9507f98cSAndroid Build Coastguard Workerlevel-(L+1) files. We switch to producing a new level-(L+1) file after the
90*9507f98cSAndroid Build Coastguard Workercurrent output file has reached the target file size (2MB). We also switch to a
91*9507f98cSAndroid Build Coastguard Workernew output file when the key range of the current output file has grown enough
92*9507f98cSAndroid Build Coastguard Workerto overlap more than ten level-(L+2) files.  This last rule ensures that a later
93*9507f98cSAndroid Build Coastguard Workercompaction of a level-(L+1) file will not pick up too much data from
94*9507f98cSAndroid Build Coastguard Workerlevel-(L+2).
95*9507f98cSAndroid Build Coastguard Worker
96*9507f98cSAndroid Build Coastguard WorkerThe old files are discarded and the new files are added to the serving state.
97*9507f98cSAndroid Build Coastguard Worker
98*9507f98cSAndroid Build Coastguard WorkerCompactions for a particular level rotate through the key space. In more detail,
99*9507f98cSAndroid Build Coastguard Workerfor each level L, we remember the ending key of the last compaction at level L.
100*9507f98cSAndroid Build Coastguard WorkerThe next compaction for level L will pick the first file that starts after this
101*9507f98cSAndroid Build Coastguard Workerkey (wrapping around to the beginning of the key space if there is no such
102*9507f98cSAndroid Build Coastguard Workerfile).
103*9507f98cSAndroid Build Coastguard Worker
104*9507f98cSAndroid Build Coastguard WorkerCompactions drop overwritten values. They also drop deletion markers if there
105*9507f98cSAndroid Build Coastguard Workerare no higher numbered levels that contain a file whose range overlaps the
106*9507f98cSAndroid Build Coastguard Workercurrent key.
107*9507f98cSAndroid Build Coastguard Worker
108*9507f98cSAndroid Build Coastguard Worker### Timing
109*9507f98cSAndroid Build Coastguard Worker
110*9507f98cSAndroid Build Coastguard WorkerLevel-0 compactions will read up to four 1MB files from level-0, and at worst
111*9507f98cSAndroid Build Coastguard Workerall the level-1 files (10MB). I.e., we will read 14MB and write 14MB.
112*9507f98cSAndroid Build Coastguard Worker
113*9507f98cSAndroid Build Coastguard WorkerOther than the special level-0 compactions, we will pick one 2MB file from level
114*9507f98cSAndroid Build Coastguard WorkerL. In the worst case, this will overlap ~ 12 files from level L+1 (10 because
115*9507f98cSAndroid Build Coastguard Workerlevel-(L+1) is ten times the size of level-L, and another two at the boundaries
116*9507f98cSAndroid Build Coastguard Workersince the file ranges at level-L will usually not be aligned with the file
117*9507f98cSAndroid Build Coastguard Workerranges at level-L+1). The compaction will therefore read 26MB and write 26MB.
118*9507f98cSAndroid Build Coastguard WorkerAssuming a disk IO rate of 100MB/s (ballpark range for modern drives), the worst
119*9507f98cSAndroid Build Coastguard Workercompaction cost will be approximately 0.5 second.
120*9507f98cSAndroid Build Coastguard Worker
121*9507f98cSAndroid Build Coastguard WorkerIf we throttle the background writing to something small, say 10% of the full
122*9507f98cSAndroid Build Coastguard Worker100MB/s speed, a compaction may take up to 5 seconds. If the user is writing at
123*9507f98cSAndroid Build Coastguard Worker10MB/s, we might build up lots of level-0 files (~50 to hold the 5*10MB). This
124*9507f98cSAndroid Build Coastguard Workermay significantly increase the cost of reads due to the overhead of merging more
125*9507f98cSAndroid Build Coastguard Workerfiles together on every read.
126*9507f98cSAndroid Build Coastguard Worker
127*9507f98cSAndroid Build Coastguard WorkerSolution 1: To reduce this problem, we might want to increase the log switching
128*9507f98cSAndroid Build Coastguard Workerthreshold when the number of level-0 files is large. Though the downside is that
129*9507f98cSAndroid Build Coastguard Workerthe larger this threshold, the more memory we will need to hold the
130*9507f98cSAndroid Build Coastguard Workercorresponding memtable.
131*9507f98cSAndroid Build Coastguard Worker
132*9507f98cSAndroid Build Coastguard WorkerSolution 2: We might want to decrease write rate artificially when the number of
133*9507f98cSAndroid Build Coastguard Workerlevel-0 files goes up.
134*9507f98cSAndroid Build Coastguard Worker
135*9507f98cSAndroid Build Coastguard WorkerSolution 3: We work on reducing the cost of very wide merges. Perhaps most of
136*9507f98cSAndroid Build Coastguard Workerthe level-0 files will have their blocks sitting uncompressed in the cache and
137*9507f98cSAndroid Build Coastguard Workerwe will only need to worry about the O(N) complexity in the merging iterator.
138*9507f98cSAndroid Build Coastguard Worker
139*9507f98cSAndroid Build Coastguard Worker### Number of files
140*9507f98cSAndroid Build Coastguard Worker
141*9507f98cSAndroid Build Coastguard WorkerInstead of always making 2MB files, we could make larger files for larger levels
142*9507f98cSAndroid Build Coastguard Workerto reduce the total file count, though at the expense of more bursty
143*9507f98cSAndroid Build Coastguard Workercompactions.  Alternatively, we could shard the set of files into multiple
144*9507f98cSAndroid Build Coastguard Workerdirectories.
145*9507f98cSAndroid Build Coastguard Worker
146*9507f98cSAndroid Build Coastguard WorkerAn experiment on an ext3 filesystem on Feb 04, 2011 shows the following timings
147*9507f98cSAndroid Build Coastguard Workerto do 100K file opens in directories with varying number of files:
148*9507f98cSAndroid Build Coastguard Worker
149*9507f98cSAndroid Build Coastguard Worker
150*9507f98cSAndroid Build Coastguard Worker| Files in directory | Microseconds to open a file |
151*9507f98cSAndroid Build Coastguard Worker|-------------------:|----------------------------:|
152*9507f98cSAndroid Build Coastguard Worker|               1000 |                           9 |
153*9507f98cSAndroid Build Coastguard Worker|              10000 |                          10 |
154*9507f98cSAndroid Build Coastguard Worker|             100000 |                          16 |
155*9507f98cSAndroid Build Coastguard Worker
156*9507f98cSAndroid Build Coastguard WorkerSo maybe even the sharding is not necessary on modern filesystems?
157*9507f98cSAndroid Build Coastguard Worker
158*9507f98cSAndroid Build Coastguard Worker## Recovery
159*9507f98cSAndroid Build Coastguard Worker
160*9507f98cSAndroid Build Coastguard Worker* Read CURRENT to find name of the latest committed MANIFEST
161*9507f98cSAndroid Build Coastguard Worker* Read the named MANIFEST file
162*9507f98cSAndroid Build Coastguard Worker* Clean up stale files
163*9507f98cSAndroid Build Coastguard Worker* We could open all sstables here, but it is probably better to be lazy...
164*9507f98cSAndroid Build Coastguard Worker* Convert log chunk to a new level-0 sstable
165*9507f98cSAndroid Build Coastguard Worker* Start directing new writes to a new log file with recovered sequence#
166*9507f98cSAndroid Build Coastguard Worker
167*9507f98cSAndroid Build Coastguard Worker## Garbage collection of files
168*9507f98cSAndroid Build Coastguard Worker
169*9507f98cSAndroid Build Coastguard Worker`RemoveObsoleteFiles()` is called at the end of every compaction and at the end
170*9507f98cSAndroid Build Coastguard Workerof recovery. It finds the names of all files in the database. It deletes all log
171*9507f98cSAndroid Build Coastguard Workerfiles that are not the current log file. It deletes all table files that are not
172*9507f98cSAndroid Build Coastguard Workerreferenced from some level and are not the output of an active compaction.
173