Building a Tiny Filesystem with FUSE

Lately I have been working on sandboxing, storage, and networking, and most of that has been in and around gVisor. A lot of it keeps coming back to files, which makes sense, since Unix has organized itself around everything is a file for over fifty years. Your terminal and random number generator are device files you can open and read (/dev/tty, /dev/urandom), and even network sockets, which are created with their own system call rather than opened by path, are read and written through the same interface afterwards. So I thought a fun experiment would be to dig into how a filesystem actually works, by building a simple one.

We will build it in Rust, in under 200 lines. It has four files, a hello.txt that is a normal read-only file, a time.txt whose contents are different every time you read it, a weather.txt that takes two seconds to read the first time because the filesystem pretends to download it, and a notes.txt you can write to, with the contents kept in the program’s memory. None of these files exist on disk, and you do not need much Linux or Rust background to follow along. I am also intentionally glossing over the things that make production filesystems hard like durability, crash consistency, concurrent access. Perhaps, I will come back with that in part-II of this post.

Play with it first

If you have Docker you can try the finished thing right now, on arm or x86.

docker run -it --rm --device /dev/fuse --cap-add SYS_ADMIN shayonj/magicfs

That starts a shell with the filesystem mounted at /magic, and the filesystem logs every request it receives into the same terminal, so you can see what the kernel asks as you run ls and cat. --device /dev/fuse gives the container access to the kernel’s FUSE device, and --cap-add SYS_ADMIN lets it mount. The full source is at github.com/shayonj/magicfs.

How a filesystem works

When you run cat hello.txt (cat prints a file’s contents), cat does not read the disk itself. It calls open() to resolve the path and get back a file descriptor, a small integer handle, then calls read() on that descriptor, and inside the kernel the VFS (Virtual File System) routes those calls to whichever filesystem the path was resolved on, ext4 for a typical disk, tmpfs for /tmp, proc for /proc. The VFS is why every filesystem looks the same to a program. It defines the set of requests a filesystem has to answer, and each filesystem is one implementation of that interface.

%%{init: {"flowchart": {"nodeSpacing": 25, "rankSpacing": 35}}}%%
flowchart TB
    A["cat hello.txt\nopen(), then read()"] --> B["VFS\n(Virtual File System)"]
    B --> C["ext4\n(disk)"]
    B --> D["tmpfs\n(memory)"]
    B --> E["proc\n(kernel state)"]

The requests are small and concrete. Does a file with this name exist in this directory? How big is it, who owns it, is it a directory? What are the bytes between offset 0 and 4096? Store these bytes at this offset and so on.

The kernel also barely deals in filenames. Internally it refers to files by inode number. On a disk filesystem an inode is a small record holding a file’s metadata and the location of its data, and a filename is a directory entry pointing at an inode, which is why the same file can appear under two names. A name is resolved through a lookup request that takes a directory and a name and returns an inode, and every request after that is in terms of the inode.

flowchart LR
    subgraph dir["directory entries"]
        D1["hello.txt -> ino 2"]
        D2["hello-link.txt -> ino 2"]
    end
    D1 --> I["inode 2\nsize, owner, permissions,\ntimestamps, where the data lives"]
    D2 --> I
    I --> B["data blocks"]

You may have come across inode contents, for instance stat prints them. Below is stat against the filesystem we are about to build, and everything in it comes from our replies, including the inode number (and the 1970 timestamps, I reply here with the Unix epoch and never bothered to change it).

$ stat /magic/hello.txt
  File: /magic/hello.txt
  Size: 61        	Blocks: 1          IO Block: 4096   regular file
Device: 0,55	Inode: 2           Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 1970-01-01 00:00:00.000000000 +0000
Modify: 1970-01-01 00:00:00.000000000 +0000
Change: 1970-01-01 00:00:00.000000000 +0000

Then there is caching. Asking the filesystem the same questions over and over is wasteful, so the kernel caches names and attributes on one side, and file contents in the page cache on the other. Most reads of a recently used file never reach the filesystem at all. This will come back to bite us shortly.

Moving the filesystem to userspace

Normally the code answering these requests lives inside the kernel, and the answers come from a disk. Linux also ships FUSE (Filesystem in Userspace), which forwards the requests to a regular program instead. When your program mounts a FUSE filesystem, it gets a connection to the kernel through a special file called /dev/fuse. Operations on that mount, when the kernel cannot answer them from its caches, become request messages with an opcode like LOOKUP, GETATTR, or READ. The full opcode list is in the kernel’s fuse.h, and libfuse’s low-level ops document what each one expects. Your program reads the request, handles it however it wants, and writes a reply back. The application that triggered the request is blocked inside its read() or write() call the whole time, and whatever bytes you reply with are what it gets back. It cannot tell the difference.

flowchart LR
    A[cat hello.txt] --> B[Linux kernel, VFS + FUSE driver]
    B -->|"READ request via /dev/fuse"| C[our Rust program]
    C -->|"here are the bytes"| B
    B --> A

Mounting a remote machine’s files with sshfs, browsing cloud storage as a directory, and parts of how containers see their filesystems all go through this same request and reply loop.

Building it

We will use the fuser crate, which speaks the /dev/fuse protocol and exposes each request type as a method on a Filesystem trait. Every method has a default implementation that replies “not implemented”, so you only implement the requests you care about.

Since the kernel talks in inodes, the core of our filesystem is a static table mapping inode numbers to names. The numbers can be anything we like, so we pick small ones, with 1 reserved for the root directory because FUSE expects that. Linux’s own memory-backed filesystems work the same way. On ext4 the inode number is the index of the file’s slot in the on-disk inode table, but tmpfs keeps its files in RAM and has no table to index, so it hands out numbers from a counter, the first file you create on a tmpfs mount gets inode 2. proc makes its numbers up as well, stat /proc/uptime even reports a size of 0 because the contents do not exist until you read them.

An inode has to be a stable identifier while the filesystem is mounted and has more work to do here. For instance, the inode table and directory entries have to land on stable storage along with the data, and stay consistent if the machine loses power mid-write, which is what journaling in ext4 is for. Our files do not survive the process, so everything stays in memory for now.

const ROOT: u64 = 1;
const HELLO: u64 = 2;
const TIME: u64 = 3;
const WEATHER: u64 = 4;
const NOTES: u64 = 5;

const FILES: &[(u64, &str)] = &[
    (HELLO, "hello.txt"),
    (TIME, "time.txt"),
    (WEATHER, "weather.txt"),
    (NOTES, "notes.txt"),
];

Now, reading a file takes four request types, roughly in the order the kernel sends them.

readdir to list the directory, which is what ls triggers
lookup to ask whether a name exists and get back its inode
getattr for the size, ownership, and type
read for the bytes

There is also an open in between, which we will come back to in the caching section. Most of our filesystem’s behavior sits in one helper that decides what an inode contains at the moment it is asked.

fn contents(&mut self, ino: u64) -> String {
    match ino {
        HELLO => HELLO_TEXT.to_string(),
        TIME => format!(
            "The time right now is {}\n",
            chrono::Local::now().format("%H:%M:%S")
        ),
        WEATHER => {
            if !self.weather_cached {
                eprintln!("[magicfs] weather.txt: first read, simulating a slow download...");
                std::thread::sleep(Duration::from_secs(2));
                eprintln!("[magicfs] weather.txt: downloaded, cached in memory");
                self.weather_cached = true;
            }
            WEATHER_TEXT.to_string()
        }
        _ => String::new(),
    }
}

There is no file content stored anywhere. When a read request arrives, we build the string right then, hand it back, and forget it, and the next read builds it again. time.txt is generated on every read, and weather.txt sleeps two seconds on the first read to simulate a slow fetch, then sets a flag so later reads return immediately, the same shape as a filesystem that only fetches data the first time something asks for it, with the network call replaced by a sleep.

read itself is mostly bookkeeping. The kernel asks for a byte range rather than the whole file, so we clamp the range to the content length and return that slice.

fn read(&mut self, _req: &Request, ino: u64, _fh: u64, offset: i64,
        size: u32, _flags: i32, _lock: Option<u64>, reply: ReplyData) {
    let data = self.contents(ino).into_bytes();
    let start = (offset as usize).min(data.len());
    let end = (start + size as usize).min(data.len());
    reply.data(&data[start..end]);
}

Writing back

Writing is the same protocol in the other direction. When you run echo "remember to buy milk" > /magic/notes.txt, the kernel sends a WRITE request with the bytes and an offset, and whatever we do with them is what saving means in this filesystem. We keep them in a Vec<u8>.

fn write(&mut self, _req: &Request, ino: u64, _fh: u64, offset: i64, data: &[u8],
         _write_flags: u32, _flags: i32, _lock_owner: Option<u64>, reply: ReplyWrite) {
    if ino != NOTES {
        reply.error(libc::EACCES);
        return;
    }
    eprintln!("[magicfs] WRITE notes.txt (ino={ino}) offset={offset} len={}", data.len());
    let offset = offset as usize;
    let end = offset + data.len();
    if self.notes.len() < end {
        self.notes.resize(end, 0);
    }
    self.notes[offset..end].copy_from_slice(data);
    reply.written(data.len() as u32);
}

When a shell overwrites a file with >, it opens the file with a truncate flag, which reaches us as a setattr request setting the size to zero, and we shrink the Vec. Appending with >> arrives as a write at the end of the file, which the offset handling above covers. The other three files report read-only permissions through getattr, which is what ls -l shows, but the kernel only enforces those bits if you mount with the default_permissions option, which we did not. The EACCES (permission denied) reply in write is what actually rejects them.

Caching, or why the kernel must be told to stop helping

Here is where the caching from earlier bites. Crossing from the kernel into userspace and back on every operation is slow compared to an in-kernel filesystem, so FUSE leans on the kernel’s caches hard, and a filesystem that makes up its contents has to opt out of both deliberately.

For metadata, every lookup and getattr reply carries a TTL (time-to-live), and until it expires the kernel answers repeat questions itself without calling us. We use a one second TTL, which is fine because our files never change size or disappear.

For contents, the page cache would mean the kernel reads time.txt once and serves the same answer from memory afterwards, and the time would never change. The fix is a flag on open that marks the file’s contents as uncacheable, so every read reaches us.

fn open(&mut self, _req: &Request, _ino: u64, _flags: i32, reply: ReplyOpen) {
    reply.opened(0, fuser::consts::FOPEN_DIRECT_IO);
}

/proc works on the same principle, its contents are generated when you read them rather than served from a cache, and a network filesystem faces the same tradeoff between speed and seeing other machines’ writes promptly. Deciding what the kernel may cache is a large part of the design of a real filesystem.

Watching it run

Inside the container the directory has files in it, with no disk involved.

$ ls /magic
hello.txt  notes.txt  time.txt  weather.txt

The usual filesystem contract holds, you can write a file and read it back.

$ echo "remember to buy milk" > /magic/notes.txt
$ cat /magic/notes.txt
remember to buy milk

Except those bytes never reached a disk. They went into the Vec in our process, and when the program exits they are gone. The other three files take it further, nobody ever wrote them.

$ cat /magic/hello.txt
Hello! I am not a real file. A tiny Rust program made me up.

Here is the time file read twice, a few seconds apart.

$ cat /magic/time.txt
The time right now is 07:00:39

$ cat /magic/time.txt
The time right now is 07:00:42

Same file, same name, different contents, because the contents are computed when the kernel asks. Here is the lazy one, timed.

$ time cat /magic/weather.txt
Tomorrow: sunny, a high of 22, light wind, and no rain.

real    0m2.004s

$ time cat /magic/weather.txt
Tomorrow: sunny, a high of 22, light wind, and no rain.

real    0m0.001s

Two seconds the first time while the filesystem simulates the download, a millisecond after that because it kept a copy. The log shows the requests as they arrive.

[magicfs] READDIR ino=1
[magicfs] LOOKUP notes.txt -> ino=5
[magicfs] WRITE notes.txt (ino=5) offset=0 len=21
[magicfs] READ notes.txt (ino=5) offset=0 size=131072
[magicfs] LOOKUP weather.txt -> ino=4
[magicfs] READ weather.txt (ino=4) offset=0 size=131072
[magicfs] weather.txt: first read, simulating a slow download...
[magicfs] weather.txt: downloaded, cached in memory
[magicfs] READ weather.txt (ino=4) offset=56 size=131072

In the log you can also see that cat asks for 128KB at a time (size=131072) regardless of the file size, and with caching off the kernel passes that request through to us as is. And each cat issues a second READ starting exactly at the end of the file, offset 56 for a 56 byte file, because the only way to know there is nothing left is to ask.

More than a party trick

A filesystem, then, is whatever answers a fixed set of requests, what is in this directory, what is this file like, what bytes are here, store these bytes, whether that is ext4 in the kernel or two hundred lines of Rust in userspace. The kernel routes the requests and caches the answers unless you tell it not to. Where the file actually lives is up to the implementation.

%%{init: {"flowchart": {"nodeSpacing": 25, "rankSpacing": 35}}}%%
flowchart TB
    A["same kernel requests"] --> B["ext4\ndata blocks on disk"]
    A --> C["tmpfs\npages in RAM"]
    A --> D["magicfs\na Vec and the clock"]
    A --> E["object storage backed fs\nobjects in a bucket, cached locally"]

Take the last one, the pieces for it are already in this post. Reads fetch the object on first access and keep a local copy, that is weather.txt. Writes land in a local buffer and sync to the bucket in the background, that is notes.txt with a second step. A real filesystem also deals with plenty this one ignores, crash safety, concurrent access, real permission checks, but those are more involved answers to the same requests, not a different model.

The weather.txt pattern, where data does not exist until first access and is cached afterwards, is demand loading, and it shows up all over systems work. It is how lazy VM snapshot restore works, with memory pages instead of files. In gVisor, sandboxed applications reach their files through a userspace filesystem layer where a separate process answers these same requests on the application’s behalf, over a protocol called LISAFS, and a lot of my day to day is in and around that layer and the virtual kernel in general.

If you want to go deeper, the kernel’s FUSE documentation describes the protocol, libfuse is the canonical C implementation and has good example filesystems, and the fuser docs cover the Rust side. The full code is at github.com/shayonj/magicfs, and it is easy to extend, files that expire after a retention window, a directory that shows different contents per user, a real network fetch behind weather.txt. Every one of those is the same handful of request handlers with a different answer inside. There is also a write-up of an email inbox exposed as a FUSE filesystem, where every message is a file backed by a database row.

We are doing a lot of this kind of work at Tines, on the sandboxing and filesystem layers under our products, and if that sounds interesting, we are hiring.