Building a File Indexer in Rust: A Journey Through Systems Programming Hell

Three months ago, I thought building a file indexer would be a straightforward weekend project. “How hard could it be?” I asked myself, staring at my Systems Programming assignment requirements. Fast-forward to today, and I’ve learned more about Rust’s ownership model, async programming, and the depths of my own patience than I ever thought possible.

The Ambitious Beginning

The project started simple enough: build a command-line tool that could recursively scan directories, index file metadata, and provide fast search capabilities. Think of it as a lightweight alternative to locate or Windows Search, but with more granular control over what gets indexed and how.

My initial requirements were:

  • Recursive directory traversal
  • Metadata extraction (size, modified time, file type)
  • Full-text content indexing for common file types
  • SQLite database for persistence
  • Concurrent processing for performance
  • A clean CLI interface

Coming from Python and Java in my previous coursework, I figured Rust would just be “C++ but safer.” How naive I was.

First Contact with the Borrow Checker

My first attempt at the core indexing logic looked something like this:

fn index_directory(path: &Path, index: &mut FileIndex) -> Result<(), IndexError> {
for entry in fs::read_dir(path)? {
let entry = entry?;
let metadata = entry.metadata()?;
if metadata.is_dir() {
index_directory(&entry.path(), index)?; // This seemed reasonable...
} else {
let file_info = extract_file_info(&entry.path(), &metadata);
index.add_file(file_info); // The borrow checker had other plans
}
}
Ok(())
}

The borrow checker immediately shut me down. Apparently, you can’t just pass mutable references around willy-nilly like in other languages. The error messages were… educational:

error[E0502]: cannot borrow `index` as mutable because it is also borrowed as immutable

After spending an embarrassing amount of time on Stack Overflow, I learned about Rc<RefCell<T>> and interior mutability. This led to my second iteration, which worked but felt like I was fighting the language at every step.

The Async Rabbit Hole

Performance was terrible with my initial synchronous approach. Indexing my home directory (about 50GB, 200k files) took nearly 20 minutes. Clearly, I needed concurrency.

Enter Tokio and async Rust. I thought I understood async programming from JavaScript, but Rust’s approach is… different. My first async attempt was a disaster:

async fn index_file_async(path: PathBuf) -> Result<FileInfo, IndexError> {
let metadata = tokio::fs::metadata(&path).await?;
let content = if should_index_content(&path) {
Some(tokio::fs::read_to_string(&path).await?)
} else {
None
};
Ok(FileInfo {
path: path.clone(),
size: metadata.len(),
modified: metadata.modified()?,
content_hash: content.as_ref().map(|c| hash_content(c)),
indexed_content: content,
})
}

This looked reasonable until I tried to use it with database operations. SQLite connections aren’t Send + Sync, which meant I couldn’t share them across async tasks. After wrestling with connection pools and Arc<Mutex<Connection>> wrappers, I eventually discovered sqlx and its async SQLite support.

The Great Rewrite #1: Channels and Worker Pools

Frustrated with the complexity, I decided to step back and use a more traditional approach with channels and worker threads. This actually worked pretty well:

use std::sync::mpsc;
use std::thread;
struct IndexWorker {
receiver: mpsc::Receiver<PathBuf>,
db_sender: mpsc::Sender<FileInfo>,
}
impl IndexWorker {
fn run(self) {
while let Ok(path) = self.receiver.recv() {
match self.process_file(path) {
Ok(file_info) => {
if self.db_sender.send(file_info).is_err() {
break; // Database writer has shut down
}
}
Err(e) => eprintln!("Error processing file: {}", e),
}
}
}
}

I spawned multiple worker threads to process files and a dedicated database writer thread to handle all SQLite operations. This avoided the Send + Sync issues and gave me much better performance—indexing time dropped to about 3 minutes.

Content Indexing: The Text Processing Nightmare

The metadata indexing was working great, but I wanted full-text search capabilities. This meant parsing content from various file types: plain text, PDFs, Word documents, even extracting metadata from images.

For text files, it was straightforward. For everything else… let’s just say I gained a new appreciation for the complexity of document formats. PDFs alone nearly broke me—between password-protected files, corrupted documents, and files with weird encodings, my error handling became increasingly paranoid:

fn extract_pdf_text(path: &Path) -> Result<String, ExtractionError> {
let file = File::open(path)?;
let mut reader = BufReader::new(file);
match pdf_extract::extract_text(&mut reader) {
Ok(text) => Ok(text),
Err(pdf_extract::Error::PdfError(_)) => {
// Try with a different extraction method
fallback_pdf_extraction(path)
}
Err(e) => {
eprintln!("PDF extraction failed for {:?}: {}", path, e);
Ok(String::new()) // Return empty string rather than failing
}
}
}

The Database Schema Evolution

My database schema went through several iterations as I learned more about SQLite and query optimization:

Version 1 (Naive):

CREATE TABLE files (
id INTEGER PRIMARY KEY,
path TEXT UNIQUE,
size INTEGER,
modified_time INTEGER,
content_hash TEXT,
indexed_content TEXT
);

Version 3 (Final):

CREATE TABLE files (
id INTEGER PRIMARY KEY,
path TEXT UNIQUE NOT NULL,
name TEXT NOT NULL,
extension TEXT,
size INTEGER NOT NULL,
modified_time INTEGER NOT NULL,
content_hash TEXT,
file_type TEXT NOT NULL
);
CREATE TABLE file_content (
file_id INTEGER PRIMARY KEY,
content TEXT,
FOREIGN KEY (file_id) REFERENCES files(id)
);
CREATE VIRTUAL TABLE content_fts USING fts5(
content,
content_rowid UNINDEXED
);
CREATE INDEX idx_files_name ON files(name);
CREATE INDEX idx_files_extension ON files(extension);
CREATE INDEX idx_files_size ON files(size);

The separation of content into its own table and the use of FTS5 for full-text search made queries dramatically faster. I also learned the hard way about SQLite’s transaction behavior and the importance of batch operations for performance.

Error Handling: From Panic to Grace

My early error handling was… optimistic. Any I/O error would propagate up and crash the entire indexing process. Not ideal when you’re trying to index a directory with thousands of files, some of which might be locked, corrupted, or on failing drives.

I eventually settled on a layered approach:

  • File-level errors (permission denied, file not found) are logged but don’t stop processing
  • Directory-level errors halt processing of that subtree but continue elsewhere
  • Database errors are more serious and might require stopping the entire operation
  • Critical errors (out of disk space, database corruption) shut everything down
#[derive(Debug)]
enum IndexError {
IoError(io::Error),
DatabaseError(rusqlite::Error),
ContentExtractionError(String),
CriticalError(String),
}
impl IndexError {
fn is_recoverable(&self) -> bool {
match self {
IndexError::IoError(_) => true,
IndexError::ContentExtractionError(_) => true,
IndexError::DatabaseError(_) => false,
IndexError::CriticalError(_) => false,
}
}
}

Performance Optimization: The Fun Part

Once the core functionality was working, I got obsessed with performance optimization. This was actually the most enjoyable part of the project—there’s something deeply satisfying about watching your code get faster.

Initial performance: 200k files in 20 minutes After async/threading: 200k files in 3 minutes
After database optimization: 200k files in 90 seconds After smarter file filtering: 150k relevant files in 45 seconds

Key optimizations included:

  • Skip hidden directories and common build artifacts (node_modules, target/, .git/)
  • Batch database insertions in transactions of 1000 records
  • Use memory-mapped files for large text files
  • Implement a simple LRU cache for recently accessed metadata
  • Add progress indicators so users don’t think it’s hung

The CLI Interface: Making It User-Friendly

The final piece was building a decent command-line interface. I used clap for argument parsing, which made this relatively painless:

#[derive(Parser)]
#[command(name = "fileindexer")]
#[command(about = "A fast file indexer with full-text search")]
struct Cli {
#[command(subcommand)]
command: Commands,
}
#[derive(Subcommand)]
enum Commands {
Index {
#[arg(help = "Directory to index")]
path: PathBuf,
#[arg(long, help = "Include file content in index")]
full_text: bool,
},
Search {
#[arg(help = "Search query")]
query: String,
#[arg(long, help = "Search in file content")]
content: bool,
},
Stats {
#[arg(long, help = "Show detailed statistics")]
detailed: bool,
},
}

I also added colored output using colored and progress bars with indicatif. These quality-of-life improvements made the tool much more pleasant to use during development and testing.

Lessons Learned

This project taught me more about systems programming than any textbook could. Here are the key takeaways:

Rust-specific lessons:

  • The borrow checker is your friend, even when it doesn’t feel like it
  • Result<T, E> and proper error handling make your code more robust
  • Ownership and borrowing prevent entire classes of bugs I didn’t even know I was writing in other languages
  • The ecosystem is fantastic—crates.io has high-quality libraries for almost everything

General programming lessons:

  • Start simple and add complexity gradually
  • Measure performance before optimizing
  • Good error handling is just as important as the happy path
  • User experience matters, even for command-line tools
  • Database schema design has a huge impact on query performance

Project management lessons:

  • Scope creep is real—I added features I didn’t originally plan for
  • Testing on diverse file systems and directory structures is crucial
  • Documentation matters (future me will thank present me)

The Current State

The file indexer is now a reasonably polished tool that I actually use regularly. It can index my home directory in under a minute and provides fast search across both filenames and content. The codebase is about 2,500 lines of Rust, with comprehensive error handling and a test suite that covers the core functionality.

Some statistics from indexing my development machine:

  • 147,000 files indexed
  • 89GB of content processed
  • 2.1GB SQLite database
  • Average search response time: 12ms
  • Memory usage during indexing: ~150MB

What’s Next?

There are several features I’d like to add:

  • Watch mode for real-time index updates using filesystem events
  • Web interface for remote searching
  • Plugin system for custom file type processors
  • Integration with external search tools like ripgrep
  • Distributed indexing for network drives

But for now, I’m calling this project complete. It solved the original problem, taught me a ton about Rust and systems programming, and gave me a tool I actually use.

If you’re a CS student considering a Rust project, I’d highly recommend it. The learning curve is steep, but the language really does help you write better, more reliable code. Just be prepared to spend quality time with the compiler—it has opinions, and it’s usually right.

The complete source code is available on my GitHub.


← Back to blog