1*9507f98cSAndroid Build Coastguard Worker## Files 2*9507f98cSAndroid Build Coastguard Worker 3*9507f98cSAndroid Build Coastguard WorkerThe implementation of leveldb is similar in spirit to the representation of a 4*9507f98cSAndroid Build Coastguard Workersingle [Bigtable tablet (section 5.3)](https://research.google/pubs/pub27898/). 5*9507f98cSAndroid Build Coastguard WorkerHowever the organization of the files that make up the representation is 6*9507f98cSAndroid Build Coastguard Workersomewhat different and is explained below. 7*9507f98cSAndroid Build Coastguard Worker 8*9507f98cSAndroid Build Coastguard WorkerEach database is represented by a set of files stored in a directory. There are 9*9507f98cSAndroid Build Coastguard Workerseveral different types of files as documented below: 10*9507f98cSAndroid Build Coastguard Worker 11*9507f98cSAndroid Build Coastguard Worker### Log files 12*9507f98cSAndroid Build Coastguard Worker 13*9507f98cSAndroid Build Coastguard WorkerA log file (*.log) stores a sequence of recent updates. Each update is appended 14*9507f98cSAndroid Build Coastguard Workerto the current log file. When the log file reaches a pre-determined size 15*9507f98cSAndroid Build Coastguard Worker(approximately 4MB by default), it is converted to a sorted table (see below) 16*9507f98cSAndroid Build Coastguard Workerand a new log file is created for future updates. 17*9507f98cSAndroid Build Coastguard Worker 18*9507f98cSAndroid Build Coastguard WorkerA copy of the current log file is kept in an in-memory structure (the 19*9507f98cSAndroid Build Coastguard Worker`memtable`). This copy is consulted on every read so that read operations 20*9507f98cSAndroid Build Coastguard Workerreflect all logged updates. 21*9507f98cSAndroid Build Coastguard Worker 22*9507f98cSAndroid Build Coastguard Worker## Sorted tables 23*9507f98cSAndroid Build Coastguard Worker 24*9507f98cSAndroid Build Coastguard WorkerA sorted table (*.ldb) stores a sequence of entries sorted by key. Each entry is 25*9507f98cSAndroid Build Coastguard Workereither a value for the key, or a deletion marker for the key. (Deletion markers 26*9507f98cSAndroid Build Coastguard Workerare kept around to hide obsolete values present in older sorted tables). 27*9507f98cSAndroid Build Coastguard Worker 28*9507f98cSAndroid Build Coastguard WorkerThe set of sorted tables are organized into a sequence of levels. The sorted 29*9507f98cSAndroid Build Coastguard Workertable generated from a log file is placed in a special **young** level (also 30*9507f98cSAndroid Build Coastguard Workercalled level-0). When the number of young files exceeds a certain threshold 31*9507f98cSAndroid Build Coastguard Worker(currently four), all of the young files are merged together with all of the 32*9507f98cSAndroid Build Coastguard Workeroverlapping level-1 files to produce a sequence of new level-1 files (we create 33*9507f98cSAndroid Build Coastguard Workera new level-1 file for every 2MB of data.) 34*9507f98cSAndroid Build Coastguard Worker 35*9507f98cSAndroid Build Coastguard WorkerFiles in the young level may contain overlapping keys. However files in other 36*9507f98cSAndroid Build Coastguard Workerlevels have distinct non-overlapping key ranges. Consider level number L where 37*9507f98cSAndroid Build Coastguard WorkerL >= 1. When the combined size of files in level-L exceeds (10^L) MB (i.e., 10MB 38*9507f98cSAndroid Build Coastguard Workerfor level-1, 100MB for level-2, ...), one file in level-L, and all of the 39*9507f98cSAndroid Build Coastguard Workeroverlapping files in level-(L+1) are merged to form a set of new files for 40*9507f98cSAndroid Build Coastguard Workerlevel-(L+1). These merges have the effect of gradually migrating new updates 41*9507f98cSAndroid Build Coastguard Workerfrom the young level to the largest level using only bulk reads and writes 42*9507f98cSAndroid Build Coastguard Worker(i.e., minimizing expensive seeks). 43*9507f98cSAndroid Build Coastguard Worker 44*9507f98cSAndroid Build Coastguard Worker### Manifest 45*9507f98cSAndroid Build Coastguard Worker 46*9507f98cSAndroid Build Coastguard WorkerA MANIFEST file lists the set of sorted tables that make up each level, the 47*9507f98cSAndroid Build Coastguard Workercorresponding key ranges, and other important metadata. A new MANIFEST file 48*9507f98cSAndroid Build Coastguard Worker(with a new number embedded in the file name) is created whenever the database 49*9507f98cSAndroid Build Coastguard Workeris reopened. The MANIFEST file is formatted as a log, and changes made to the 50*9507f98cSAndroid Build Coastguard Workerserving state (as files are added or removed) are appended to this log. 51*9507f98cSAndroid Build Coastguard Worker 52*9507f98cSAndroid Build Coastguard Worker### Current 53*9507f98cSAndroid Build Coastguard Worker 54*9507f98cSAndroid Build Coastguard WorkerCURRENT is a simple text file that contains the name of the latest MANIFEST 55*9507f98cSAndroid Build Coastguard Workerfile. 56*9507f98cSAndroid Build Coastguard Worker 57*9507f98cSAndroid Build Coastguard Worker### Info logs 58*9507f98cSAndroid Build Coastguard Worker 59*9507f98cSAndroid Build Coastguard WorkerInformational messages are printed to files named LOG and LOG.old. 60*9507f98cSAndroid Build Coastguard Worker 61*9507f98cSAndroid Build Coastguard Worker### Others 62*9507f98cSAndroid Build Coastguard Worker 63*9507f98cSAndroid Build Coastguard WorkerOther files used for miscellaneous purposes may also be present (LOCK, *.dbtmp). 64*9507f98cSAndroid Build Coastguard Worker 65*9507f98cSAndroid Build Coastguard Worker## Level 0 66*9507f98cSAndroid Build Coastguard Worker 67*9507f98cSAndroid Build Coastguard WorkerWhen the log file grows above a certain size (4MB by default): 68*9507f98cSAndroid Build Coastguard WorkerCreate a brand new memtable and log file and direct future updates here. 69*9507f98cSAndroid Build Coastguard Worker 70*9507f98cSAndroid Build Coastguard WorkerIn the background: 71*9507f98cSAndroid Build Coastguard Worker 72*9507f98cSAndroid Build Coastguard Worker1. Write the contents of the previous memtable to an sstable. 73*9507f98cSAndroid Build Coastguard Worker2. Discard the memtable. 74*9507f98cSAndroid Build Coastguard Worker3. Delete the old log file and the old memtable. 75*9507f98cSAndroid Build Coastguard Worker4. Add the new sstable to the young (level-0) level. 76*9507f98cSAndroid Build Coastguard Worker 77*9507f98cSAndroid Build Coastguard Worker## Compactions 78*9507f98cSAndroid Build Coastguard Worker 79*9507f98cSAndroid Build Coastguard WorkerWhen the size of level L exceeds its limit, we compact it in a background 80*9507f98cSAndroid Build Coastguard Workerthread. The compaction picks a file from level L and all overlapping files from 81*9507f98cSAndroid Build Coastguard Workerthe next level L+1. Note that if a level-L file overlaps only part of a 82*9507f98cSAndroid Build Coastguard Workerlevel-(L+1) file, the entire file at level-(L+1) is used as an input to the 83*9507f98cSAndroid Build Coastguard Workercompaction and will be discarded after the compaction. Aside: because level-0 84*9507f98cSAndroid Build Coastguard Workeris special (files in it may overlap each other), we treat compactions from 85*9507f98cSAndroid Build Coastguard Workerlevel-0 to level-1 specially: a level-0 compaction may pick more than one 86*9507f98cSAndroid Build Coastguard Workerlevel-0 file in case some of these files overlap each other. 87*9507f98cSAndroid Build Coastguard Worker 88*9507f98cSAndroid Build Coastguard WorkerA compaction merges the contents of the picked files to produce a sequence of 89*9507f98cSAndroid Build Coastguard Workerlevel-(L+1) files. We switch to producing a new level-(L+1) file after the 90*9507f98cSAndroid Build Coastguard Workercurrent output file has reached the target file size (2MB). We also switch to a 91*9507f98cSAndroid Build Coastguard Workernew output file when the key range of the current output file has grown enough 92*9507f98cSAndroid Build Coastguard Workerto overlap more than ten level-(L+2) files. This last rule ensures that a later 93*9507f98cSAndroid Build Coastguard Workercompaction of a level-(L+1) file will not pick up too much data from 94*9507f98cSAndroid Build Coastguard Workerlevel-(L+2). 95*9507f98cSAndroid Build Coastguard Worker 96*9507f98cSAndroid Build Coastguard WorkerThe old files are discarded and the new files are added to the serving state. 97*9507f98cSAndroid Build Coastguard Worker 98*9507f98cSAndroid Build Coastguard WorkerCompactions for a particular level rotate through the key space. In more detail, 99*9507f98cSAndroid Build Coastguard Workerfor each level L, we remember the ending key of the last compaction at level L. 100*9507f98cSAndroid Build Coastguard WorkerThe next compaction for level L will pick the first file that starts after this 101*9507f98cSAndroid Build Coastguard Workerkey (wrapping around to the beginning of the key space if there is no such 102*9507f98cSAndroid Build Coastguard Workerfile). 103*9507f98cSAndroid Build Coastguard Worker 104*9507f98cSAndroid Build Coastguard WorkerCompactions drop overwritten values. They also drop deletion markers if there 105*9507f98cSAndroid Build Coastguard Workerare no higher numbered levels that contain a file whose range overlaps the 106*9507f98cSAndroid Build Coastguard Workercurrent key. 107*9507f98cSAndroid Build Coastguard Worker 108*9507f98cSAndroid Build Coastguard Worker### Timing 109*9507f98cSAndroid Build Coastguard Worker 110*9507f98cSAndroid Build Coastguard WorkerLevel-0 compactions will read up to four 1MB files from level-0, and at worst 111*9507f98cSAndroid Build Coastguard Workerall the level-1 files (10MB). I.e., we will read 14MB and write 14MB. 112*9507f98cSAndroid Build Coastguard Worker 113*9507f98cSAndroid Build Coastguard WorkerOther than the special level-0 compactions, we will pick one 2MB file from level 114*9507f98cSAndroid Build Coastguard WorkerL. In the worst case, this will overlap ~ 12 files from level L+1 (10 because 115*9507f98cSAndroid Build Coastguard Workerlevel-(L+1) is ten times the size of level-L, and another two at the boundaries 116*9507f98cSAndroid Build Coastguard Workersince the file ranges at level-L will usually not be aligned with the file 117*9507f98cSAndroid Build Coastguard Workerranges at level-L+1). The compaction will therefore read 26MB and write 26MB. 118*9507f98cSAndroid Build Coastguard WorkerAssuming a disk IO rate of 100MB/s (ballpark range for modern drives), the worst 119*9507f98cSAndroid Build Coastguard Workercompaction cost will be approximately 0.5 second. 120*9507f98cSAndroid Build Coastguard Worker 121*9507f98cSAndroid Build Coastguard WorkerIf we throttle the background writing to something small, say 10% of the full 122*9507f98cSAndroid Build Coastguard Worker100MB/s speed, a compaction may take up to 5 seconds. If the user is writing at 123*9507f98cSAndroid Build Coastguard Worker10MB/s, we might build up lots of level-0 files (~50 to hold the 5*10MB). This 124*9507f98cSAndroid Build Coastguard Workermay significantly increase the cost of reads due to the overhead of merging more 125*9507f98cSAndroid Build Coastguard Workerfiles together on every read. 126*9507f98cSAndroid Build Coastguard Worker 127*9507f98cSAndroid Build Coastguard WorkerSolution 1: To reduce this problem, we might want to increase the log switching 128*9507f98cSAndroid Build Coastguard Workerthreshold when the number of level-0 files is large. Though the downside is that 129*9507f98cSAndroid Build Coastguard Workerthe larger this threshold, the more memory we will need to hold the 130*9507f98cSAndroid Build Coastguard Workercorresponding memtable. 131*9507f98cSAndroid Build Coastguard Worker 132*9507f98cSAndroid Build Coastguard WorkerSolution 2: We might want to decrease write rate artificially when the number of 133*9507f98cSAndroid Build Coastguard Workerlevel-0 files goes up. 134*9507f98cSAndroid Build Coastguard Worker 135*9507f98cSAndroid Build Coastguard WorkerSolution 3: We work on reducing the cost of very wide merges. Perhaps most of 136*9507f98cSAndroid Build Coastguard Workerthe level-0 files will have their blocks sitting uncompressed in the cache and 137*9507f98cSAndroid Build Coastguard Workerwe will only need to worry about the O(N) complexity in the merging iterator. 138*9507f98cSAndroid Build Coastguard Worker 139*9507f98cSAndroid Build Coastguard Worker### Number of files 140*9507f98cSAndroid Build Coastguard Worker 141*9507f98cSAndroid Build Coastguard WorkerInstead of always making 2MB files, we could make larger files for larger levels 142*9507f98cSAndroid Build Coastguard Workerto reduce the total file count, though at the expense of more bursty 143*9507f98cSAndroid Build Coastguard Workercompactions. Alternatively, we could shard the set of files into multiple 144*9507f98cSAndroid Build Coastguard Workerdirectories. 145*9507f98cSAndroid Build Coastguard Worker 146*9507f98cSAndroid Build Coastguard WorkerAn experiment on an ext3 filesystem on Feb 04, 2011 shows the following timings 147*9507f98cSAndroid Build Coastguard Workerto do 100K file opens in directories with varying number of files: 148*9507f98cSAndroid Build Coastguard Worker 149*9507f98cSAndroid Build Coastguard Worker 150*9507f98cSAndroid Build Coastguard Worker| Files in directory | Microseconds to open a file | 151*9507f98cSAndroid Build Coastguard Worker|-------------------:|----------------------------:| 152*9507f98cSAndroid Build Coastguard Worker| 1000 | 9 | 153*9507f98cSAndroid Build Coastguard Worker| 10000 | 10 | 154*9507f98cSAndroid Build Coastguard Worker| 100000 | 16 | 155*9507f98cSAndroid Build Coastguard Worker 156*9507f98cSAndroid Build Coastguard WorkerSo maybe even the sharding is not necessary on modern filesystems? 157*9507f98cSAndroid Build Coastguard Worker 158*9507f98cSAndroid Build Coastguard Worker## Recovery 159*9507f98cSAndroid Build Coastguard Worker 160*9507f98cSAndroid Build Coastguard Worker* Read CURRENT to find name of the latest committed MANIFEST 161*9507f98cSAndroid Build Coastguard Worker* Read the named MANIFEST file 162*9507f98cSAndroid Build Coastguard Worker* Clean up stale files 163*9507f98cSAndroid Build Coastguard Worker* We could open all sstables here, but it is probably better to be lazy... 164*9507f98cSAndroid Build Coastguard Worker* Convert log chunk to a new level-0 sstable 165*9507f98cSAndroid Build Coastguard Worker* Start directing new writes to a new log file with recovered sequence# 166*9507f98cSAndroid Build Coastguard Worker 167*9507f98cSAndroid Build Coastguard Worker## Garbage collection of files 168*9507f98cSAndroid Build Coastguard Worker 169*9507f98cSAndroid Build Coastguard Worker`RemoveObsoleteFiles()` is called at the end of every compaction and at the end 170*9507f98cSAndroid Build Coastguard Workerof recovery. It finds the names of all files in the database. It deletes all log 171*9507f98cSAndroid Build Coastguard Workerfiles that are not the current log file. It deletes all table files that are not 172*9507f98cSAndroid Build Coastguard Workerreferenced from some level and are not the output of an active compaction. 173