Representing filesystems in databases efficiently with Hierarchical Ordering
Or: A new way to sort strings
The way the we interface with files in a file system and rows in a database are fundamentally not the same.
Let’s observe a typical filesystem structure:
$ ls /
b/
a/
$ ls /a
1
2
$ ls /b
1
Quick ChatGPT, make this a tree!
The tree structure looks something like:
/
├── a
│ ├── 1
│ └── 2
└── b
└── 1
Or in another view, we have:
/
/a
/a/1
/a/2
/b
/b/1
Can you see the problem yet? (hint: it’s right above this sentence)
The problem is that the tree structure, and the file path listing just below it, are totally different ways of sorting. Filesystems walk the tree by level, where as databases walk their B-Trees by order (and are constantly rebalanced).
Side quest: Why S3 is not a filesystem
S3 isn’t just blobs of your files, there’s a key-value database that keeps track of your files as well. And it’s the thing you hit first every time you make a request to S3.
As we saw in the code blocks above, if we listed files in the directory /
, then we get 2 results: a/
and b/
.
However, in S3, directories don’t actually exist, so what we really have is just:
/a/1
/a/2
/b/1
The problem is that if we wanted to treat S3 as a filesystem, and did an ls /
, we’ve potentially just asked our S3 client to make unlimited requests.
S3 uses lexicographical ordering, like pretty much every string-keyed database does.
This means that all children of the first “directory” that would appear will have to be sorted through before getting to the next.
Or in other words, if a/
has 1 million files beneath it, then we have to do 1,000 ListObjectV2
requests before we even see the b/
paths.
If you’re trying to mount S3 as a filesystem, this could result in atrocious performance (among other shortfalls trying to use S3 as a filesystem).
Hierarchical Ordering
Ok, so if you sort paths lexicographically, we can’t ls
our system without potentially causing insane load and client latency. But we also can’t rewrite databases to use “Hierarchical Ordering”, so how do we do it?
By “packing” the strings such that they natrually order hierarchically.
The trick is simple: For every segment of the file path but the last, prepend it with some high-value byte like 0xff
.
If we modify the FoundationDB tuple packing algorithm, we turn the filesystem:
dir
├── a
│ └── 1
├── b
└── 🚨
into:
\x01dir\x00 <- is only one item, so it doesn't lead with a 0xff
\x01\xffdir\x00\x01a\x00
\x01\xffdir\x00\x01b\x00
\x01\xffdir\x00\x01🚨\x00
\x01\xffdir\x00\x01\xffa\x00\x011\x00
Or if we clean it up:
dir
dir/a
dir/b
dir/🚨 <- have to make sure that unicode works!
dir/a/1
I wrote an implementation in my toy KV DB: objectkv/tuple
By putting a high-byte to partition the path segments, we sort the keys in the way they’d be represented in a file system.
Now, if we ask a client to ls
the dir
directory, we can simply look up the KV range []byte(“dir”)
to append([]byte(“dir”), 0xff)
. This range will return all direct children of dir
without recursively listing.
But what if we want to ls -R
? Well, we’ve lost that unfortunately. However, it’s not hard:
Get all the children of your first directory
For all found children, continue selecting ranges
[]byte(child)
toappend([]byte(child), 0xff)
Repeat until you reach the end of all recursive paths.
But how do they “Show prefixes as directories” in the AWS console? It’s an index, likely! With an index, you can have both traditional lexicographical sorting of the file paths, and the hierarchical ordering.
Depending on how frequently you need to drill down into directories, this might be a less efficient approach. However, if you’re building a highly cardial filesystem (e.g. one folder may have many children), this is far more efficient: You can easily prune child paths that are not of interest without listing them. With traditional sorting, you’re forced to see everything.