Sometime we may need to scan the filesystem for files and place them into a database.
CREATE TABLE files (
id INTEGER PRIMARY KEY,
path TEXT NOT NULL,
parent INTEGER REFERENCES files(id),
-- ...Random other stuff...
);
But both filesystem traversal and database update are not atomic:
- As we’re traversing the filesystem, something above our current path may be changed. The directory we’re reading may even be gone.
- A scan may be superseded by another scan (e.g. inotify), so we need to avoid overwriting newer data.
Parallel scaning
The rule-of-thumb: the invariant we want to keep is that if two concurrent scans intersects, then after both scans are finished, the state of the files in the intersection should be identical as if the older scan never happened. Note that this does not mean that the state of the intersection in the database is given by the newer scan, because a even newer scan may be in progress.
We can refine this invariant to a progressive one: During a scan, all files that’s already discovered by this scan should either be updated according to this scan, or a scan that started later (which might actually reached this file earlier). This formulation suits the top-down nature of filesystem scanning more.
This motivates us to add a scan_id column to the files table, which denotes the scan that the row’s information was generated from. Accompanying this, we also need a scans table to generate monotonically increasing scan IDs, which helps us to selectively update only the rows that are outdated relative to the current scan.
CREATE TABLE scans (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
);
ALTER TABLE files ADD COLUMN scan_id INTEGER NOT NULL REFERENCES scans(id);
At the start of the scan, we insert a new row into the scans table to generate a new scan ID.
Now, whenever we encounter a file during a scan, we can use a transaction to atomically:
- Check if the file exists in our database. If it does not exist, create the row.
- Now if it does exist, update it only if the
scan_idof the row is less than the current scan ID.
Deletions
Handling deletions are slightly more complicated. For one, we cannot “list deleted files” in a filesystem, so the most naive way is to list all the immediate children of a directory in the database, and diff it with filesystem. Obviously this is extremely racy.
However we can defer the deletion of files until the end of the scan. Add a stale boolean column to the files table, which denotes that during the scan_id scan, this file has not been seen yet.
ALTER TABLE files ADD COLUMN stale BOOLEAN NOT NULL DEFAULT FALSE;
We still need to be careful about parallel scanning, though. More specifically:
- Don’t mark files that’s updated by a newer scan.
- Don’t delete files that’s marked by a newer scan, because the newer scan may still be in progress and can see this file later.
We can do this by also forcing scan_id to be updated when we mark a file as stale. In some sense, “being stale” is a status for a row, so it should be associated with a scan ID.
-
During the start of the scan:
UPDATE files SET stale = TRUE, scan_id = cur_scan_id WHERE scan_id < cur_scan_id; -
During file update,
scan_idcan be equal to cur_scan_id, so the bound should change to<= -
At the end of the scan, sweep all stale files:
DELETE FROM files WHERE stale = TRUE AND scan_id = cur_scan_id;Here,
scan_id < cur_scan_idshould also be fine, for now. At least until we introduce another new feature…
Partial scan
Partial scan allows for subtree scanning. Adding / updating files in the database is fine, but this creates two complications for deletions:
- We cannot easily “get all files in a subtree”. Our database scheme is similar to a real filesystem, where we can only look one layer at a time. So we cannot just “mark everything as stale from the start”.
We can solve this by marking the stale files layer-by-layer. After we’ve updated each directory, we can atomically mark all of its children as stale in the same transaction, and then descend into the children.
Sweeping won’t work as-is because if an entire directory is deleted, we won’t be able to mark its transitive descendants as stale. Luckily (again) for us, this is a perfect use case for cascading deletes.
ALTER TABLE files MODIFY parent INTEGER REFERENCES files(id) ON DELETE CASCADE;
- We also cannot just delete everything that’s stale at the end, because there might be stale entries that are outside our subtree. Luckily, we can just change the condition of the sweeping query:
DELETE FROM files WHERE stale = TRUE AND scan_id = cur_scan_id;
Because only files within the subtree will ever have scan_id = cur_scan_id AND stale = true, this protects us both from deleting files outside the subtree, and from deleting files that are updated by a newer scan.
So far so good, right? Not if we add another racy partener: the filesystem.
Concurrent filesystem changes
Consider the following timeline:
- Originall, database has
/ain it. - Scan 1 started, updated
/a. Because now 1 is the newest scan, it successfully updated the row, and continued recursion. It read/acontents, and swapped out immediately afterreaddir(2)returned. /agot deleted.- Scan 2 started, marked
/aas stale, and eventually deleted it from the database. - Scan 1 resumed, tries to insert
/a/binto database, and was hit with a foreign key constraint violation because/ano longer exists in the database.
A similar race may happen if a older scan got paused right after readdir(2), and then some children got deleted, but not the directory. This time, we won’t get a foreign key violation, but some deleted files will be “resurrected” by the awaken older scan.
So the row insertion should check if parent still exists and is under the current scan. We don’t need to check this for existing rows: for existing rows, either it’s updated by the newer scan, or haven’t been reached.
More subtly, even with this check, if ID is reused, a similar race might occur if the scanning does not follow a strictly DFS order (e.g. with a parallel fs walker):
- A directory that we visited got deleted by a newer scan.
- Current scan allocated the ID again for a file in another directory.
This is a classic ABA problem. To avoid this, recall that AUTO_INCREMENT primary keys has the special property that they are never reused[1]. So we also needs:
ALTER TABLE files MODIFY id INTEGER PRIMARY KEY AUTO_INCREMENT;
Sum up
The tables:
CREATE TABLE files (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
name TEXT NOT NULL,
scan_id INTEGER NOT NULL REFERENCES scans(id),
stale BOOLEAN NOT NULL DEFAULT FALSE,
parent INTEGER REFERENCES files(id) ON DELETE CASCADE -- Null for root
-- Maybe some unique constraint on (parent, name)
);
CREATE TABLE scans (
id INTEGER PRIMARY KEY AUTO_INCREMENT
);
Recursion:
/**
* Returns the ID of current updated file,
* or none if we should not descend further
* because the subtree is already covered by a
* newer scan
*/
fn update(scan: i64, path: &Path, parent: Option<i64>) -> Option<i64> {
// Serializable transaction: no write between
// our check for the existance of the row
// and the insertion / update of the row
//
// See footnote [2]
//
// Also, serialization failure retry loop is omitted
let mut tx = db.begin_serializable();
// This get-insert-update can also be turned into a upsert, if the database supports it.
let cur: Option<_> =
query!(tx, "
SELECT id, scan_id FROM files WHERE parent = ? AND name = ?
",
parent,
path.file_name().unwrap()
).fetch_one();
let Some(cur) = cur else {
// Not in database yet
let ret = query!(tx, r#"
INSERT INTO files (name, parent, scan_id)
SELECT ?1, ?2, ?3
WHERE EXISTS (
SELECT 1 FROM files
WHERE id = ?2
AND scan_id <= ?3
)
"#,
path.file_name().unwrap(),
parent,
scan
);
if ret.affected_rows() == 0 {
// Parent got updated
return None;
}
tx.commit();
return Some(ret.last_inserted_id());
}
if scan < cur.scan_id {
// A newer scan is covering this subtree
// tx aborted
return None;
}
query!(tx, "UPDATE files SET stale = FALSE, scan_id = ? ... WHERE id = ?", scan, cur.id);
query!(tx, "UPDATE files SET stale = true, scan_id = ? WHERE parent = ? AND scan_id < ?", scan, cur.id, scan);
tx.commit();
return Some(cur.id);
}
fn walk(scan: i64, path: &Path, parent: Option<i64>) {
let Some(id) = update(scan, path, parent) else {
return;
}
if !is_dir(path) { return; }
for entry in readdir(path) {
walk(scan, &path.join(entry.name), Some(id));
}
}
Clean up:
DELETE FROM files WHERE stale = TRUE AND scan_id = cur_scan_id
Two final things:
- Be careful with filesystem error. For example, the directory read by the scanner may be concurrently deleted, so readdir and friends will return not found. This should be handled as a missing file (just
continuein the loop) - During subtree scans, remember to update the trunk leading to this branch, and gets their ID. ID = NULL is specifically reserved for the top-most level.
- The update should not mark the other files in the trunk (the sibling subtrees) as stale, to preserve the condition that only the targeted subtree can have
scan_id = cur_scan_id AND stale = TRUE.
- Special handling needed if we want to delete the entire partial scan root.
Why?
-
[1] This is somewhat database dependent: SQLite uses
AUTOINCREMENTwhich also has this semantics. MySQL can reuseAUTO_INCREMENTprimary keys, but only across database restarts, using InnoDB, and <8.0.0. -
[2] Actually, we only need the atomicity + isolation for the check then insertion-or-update sequence. This may be done in a upsert operation in the SQL server of your like, with their own non-standard SQL extensions.
All of them have some caveats, but all seems to just work out fine if we only need to return
Noneon all falure cases. Claude Opus 4.8 gives the following snippet for SQLite, which requires theUNIQUE(parent, name)constraint:INSERT INTO files (name, parent, scan_id) SELECT :name, :parent, :scan WHERE :parent IS NULL OR EXISTS (SELECT 1 FROM files WHERE id = :parent AND scan_id <= :scan) ON CONFLICT(parent, name) DO UPDATE SET scan_id = excluded.scan_id, stale = FALSE -- , ...content... WHERE files.scan_id <= excluded.scan_id RETURNING id;The stale marking can be done completely standalone.