Parallel Filesystem Scan Caveats

Sometime we may need to scan the filesystem for files and place them into a database.

CREATE TABLE files (
  id INTEGER PRIMARY KEY,
  path TEXT NOT NULL,

  parent INTEGER REFERENCES files(id),

  -- ...Random other stuff...
);

But both filesystem traversal and database update are not atomic:

As we’re traversing the filesystem, something above our current path may be changed. The directory we’re reading may even be gone.
A scan may be superseded by another scan (e.g. inotify), so we need to avoid overwriting newer data.

Parallel scaning

The rule-of-thumb: the invariant we want to keep is that if two concurrent scans intersects, then after both scans are finished, the state of the files in the intersection should be identical as if the older scan never happened. Note that this does not mean that the state of the intersection in the database is given by the newer scan, because a even newer scan may be in progress.

We can refine this invariant to a progressive one: During a scan, all files that’s already discovered by this scan should either be updated according to this scan, or a scan that started later (which might actually reached this file earlier). This formulation suits the top-down nature of filesystem scanning more.

This motivates us to add a scan_id column to the files table, which denotes the scan that the row’s information was generated from. Accompanying this, we also need a scans table to generate monotonically increasing scan IDs, which helps us to selectively update only the rows that are outdated relative to the current scan.

CREATE TABLE scans (
  id INTEGER PRIMARY KEY AUTO_INCREMENT,
);

ALTER TABLE files ADD COLUMN scan_id INTEGER NOT NULL REFERENCES scans(id);

At the start of the scan, we insert a new row into the scans table to generate a new scan ID.

Now, whenever we encounter a file during a scan, we can use a transaction to atomically:

Check if the file exists in our database. If it does not exist, create the row.
Now if it does exist, update it only if the scan_id of the row is less than the current scan ID.

Deletions

Handling deletions are slightly more complicated. For one, we cannot “list deleted files” in a filesystem, so the most naive way is to list all the immediate children of a directory in the database, and diff it with filesystem. Obviously this is extremely racy.

However we can defer the deletion of files until the end of the scan. Add a stale boolean column to the files table, which denotes that during the scan_id scan, this file has not been seen yet.

ALTER TABLE files ADD COLUMN stale BOOLEAN NOT NULL DEFAULT FALSE;

We still need to be careful about parallel scanning, though. More specifically:

Don’t mark files that’s updated by a newer scan.
Don’t delete files that’s marked by a newer scan, because the newer scan may still be in progress and can see this file later.

We can do this by also forcing scan_id to be updated when we mark a file as stale. In some sense, “being stale” is a status for a row, so it should be associated with a scan ID.

During the start of the scan:

UPDATE files SET stale = TRUE, scan_id = cur_scan_id WHERE scan_id < cur_scan_id;

During file update, scan_id can be equal to cur_scan_id, so the bound should change to <=
At the end of the scan, sweep all stale files:
```
DELETE FROM files WHERE stale = TRUE AND scan_id = cur_scan_id;
```
Here, scan_id < cur_scan_id should also be fine, for now. At least until we introduce another new feature…

Partial scan

Partial scan allows for subtree scanning. Adding / updating files in the database is fine, but this creates two complications for deletions:

We cannot easily “get all files in a subtree”. Our database scheme is similar to a real filesystem, where we can only look one layer at a time. So we cannot just “mark everything as stale from the start”.

We can solve this by marking the stale files layer-by-layer. After we’ve updated each directory, we can atomically mark all of its children as stale in the same transaction, and then descend into the children.

Sweeping won’t work as-is because if an entire directory is deleted, we won’t be able to mark its transitive descendants as stale. Luckily (again) for us, this is a perfect use case for cascading deletes.

ALTER TABLE files MODIFY parent INTEGER REFERENCES files(id) ON DELETE CASCADE;

We also cannot just delete everything that’s stale at the end, because there might be stale entries that are outside our subtree. Luckily, we can just change the condition of the sweeping query:

DELETE FROM files WHERE stale = TRUE AND scan_id = cur_scan_id;

Because only files within the subtree will ever have scan_id = cur_scan_id AND stale = true, this protects us both from deleting files outside the subtree, and from deleting files that are updated by a newer scan.

So far so good, right? Not if we add another racy partener: the filesystem.

Concurrent filesystem changes

Consider the following timeline:

Originall, database has /a in it.
Scan 1 started, updated /a. Because now 1 is the newest scan, it successfully updated the row, and continued recursion. It read /a contents, and swapped out immediately after readdir(2) returned.
/a got deleted.
Scan 2 started, marked /a as stale, and eventually deleted it from the database.
Scan 1 resumed, tries to insert /a/b into database, and was hit with a foreign key constraint violation because /a no longer exists in the database.

A similar race may happen if a older scan got paused right after readdir(2), and then some children got deleted, but not the directory. This time, we won’t get a foreign key violation, but some deleted files will be “resurrected” by the awaken older scan.

So the row insertion should check if parent still exists and is under the current scan. We don’t need to check this for existing rows: for existing rows, either it’s updated by the newer scan, or haven’t been reached.

More subtly, even with this check, if ID is reused, a similar race might occur if the scanning does not follow a strictly DFS order (e.g. with a parallel fs walker):

A directory that we visited got deleted by a newer scan.
Current scan allocated the ID again for a file in another directory.

This is a classic ABA problem. To avoid this, recall that AUTO_INCREMENT primary keys has the special property that they are never reused^[1]. So we also needs:

ALTER TABLE files MODIFY id INTEGER PRIMARY KEY AUTO_INCREMENT;

Sum up

The tables:

CREATE TABLE files (
  id INTEGER PRIMARY KEY AUTO_INCREMENT,
  name TEXT NOT NULL,
  scan_id INTEGER NOT NULL REFERENCES scans(id),
  stale BOOLEAN NOT NULL DEFAULT FALSE,
  parent INTEGER REFERENCES files(id) ON DELETE CASCADE -- Null for root

  -- Maybe some unique constraint on (parent, name)
);

CREATE TABLE scans (
  id INTEGER PRIMARY KEY AUTO_INCREMENT
);

Recursion:

/**
 * Returns the ID of current updated file,
 * or none if we should not descend further
 * because the subtree is already covered by a
 * newer scan
 */
fn update(scan: i64, path: &Path, parent: Option<i64>) -> Option<i64> {
    // Serializable transaction: no write between
    // our check for the existance of the row
    // and the insertion / update of the row
    // 
    // See footnote [2]
    // 
    // Also, serialization failure retry loop is omitted
    let mut tx = db.begin_serializable();

    // This get-insert-update can also be turned into a upsert, if the database supports it.
    let cur: Option<_> =
        query!(tx, "
                SELECT id, scan_id FROM files WHERE parent = ? AND name = ?
            ",
            parent,
            path.file_name().unwrap()
        ).fetch_one();

    let Some(cur) = cur else {
        // Not in database yet
        let ret = query!(tx, r#"
            INSERT INTO files (name, parent, scan_id)
            SELECT ?1, ?2, ?3
            WHERE EXISTS (
                SELECT 1 FROM files
                WHERE id = ?2
                AND scan_id <= ?3
            )
            "#,
            path.file_name().unwrap(),
            parent,
            scan
        );

        if ret.affected_rows() == 0 {
            // Parent got updated
            return None;
        }
        tx.commit();
        return Some(ret.last_inserted_id());
    }

    if scan < cur.scan_id {
        // A newer scan is covering this subtree
        // tx aborted
        return None;
    }

    query!(tx, "UPDATE files SET stale = FALSE, scan_id = ? ... WHERE id = ?", scan, cur.id);
    query!(tx, "UPDATE files SET stale = true, scan_id = ? WHERE parent = ? AND scan_id < ?", scan, cur.id, scan);
    tx.commit();
    return Some(cur.id);
}

fn walk(scan: i64, path: &Path, parent: Option<i64>) {
    let Some(id) = update(scan, path, parent) else {
        return;
    }
    if !is_dir(path) { return; }
    for entry in readdir(path) {
        walk(scan, &path.join(entry.name), Some(id));
    }
}

Clean up:

DELETE FROM files WHERE stale = TRUE AND scan_id = cur_scan_id

Two final things:

Be careful with filesystem error. For example, the directory read by the scanner may be concurrently deleted, so readdir and friends will return not found. This should be handled as a missing file (just continue in the loop)
During subtree scans, remember to update the trunk leading to this branch, and gets their ID. ID = NULL is specifically reserved for the top-most level.

The update should not mark the other files in the trunk (the sibling subtrees) as stale, to preserve the condition that only the targeted subtree can have scan_id = cur_scan_id AND stale = TRUE.

Special handling needed if we want to delete the entire partial scan root.

Why?

Toy

[1] This is somewhat database dependent: SQLite uses AUTOINCREMENT which also has this semantics. MySQL can reuse AUTO_INCREMENT primary keys, but only across database restarts, using InnoDB, and <8.0.0.
[2] Actually, we only need the atomicity + isolation for the check then insertion-or-update sequence. This may be done in a upsert operation in the SQL server of your like, with their own non-standard SQL extensions.

All of them have some caveats, but all seems to just work out fine if we only need to return None on all falure cases. Claude Opus 4.8 gives the following snippet for SQLite, which requires the UNIQUE(parent, name) constraint:
```
INSERT INTO files (name, parent, scan_id)
SELECT :name, :parent, :scan
WHERE :parent IS NULL
  OR EXISTS (SELECT 1 FROM files WHERE id = :parent AND scan_id <= :scan)
ON CONFLICT(parent, name) DO UPDATE SET
  scan_id = excluded.scan_id,
  stale   = FALSE
  -- , ...content...
WHERE files.scan_id <= excluded.scan_id
RETURNING id;
```
The stale marking can be done completely standalone.