Building a File Indexer in Rust: A Journey Through Systems Programming Hell
Three months ago, I thought building a file indexer would be a straightforward weekend project. “How hard could it be?” I asked myself, staring at my Systems Programming assignment requirements. Fast-forward to today, and I’ve learned more about Rust’s ownership model, async programming, and the depths of my own patience than I ever thought possible.
The Ambitious Beginning
The project started simple enough: build a command-line tool that could recursively scan directories, index file metadata, and provide fast search capabilities. Think of it as a lightweight alternative to locate
or Windows Search, but with more granular control over what gets indexed and how.
My initial requirements were:
- Recursive directory traversal
- Metadata extraction (size, modified time, file type)
- Full-text content indexing for common file types
- SQLite database for persistence
- Concurrent processing for performance
- A clean CLI interface
Coming from Python and Java in my previous coursework, I figured Rust would just be “C++ but safer.” How naive I was.
First Contact with the Borrow Checker
My first attempt at the core indexing logic looked something like this:
fn index_directory(path: &Path, index: &mut FileIndex) -> Result<(), IndexError> { for entry in fs::read_dir(path)? { let entry = entry?; let metadata = entry.metadata()?;
if metadata.is_dir() { index_directory(&entry.path(), index)?; // This seemed reasonable... } else { let file_info = extract_file_info(&entry.path(), &metadata); index.add_file(file_info); // The borrow checker had other plans } } Ok(())}
The borrow checker immediately shut me down. Apparently, you can’t just pass mutable references around willy-nilly like in other languages. The error messages were… educational:
error[E0502]: cannot borrow `index` as mutable because it is also borrowed as immutable
After spending an embarrassing amount of time on Stack Overflow, I learned about Rc<RefCell<T>>
and interior mutability. This led to my second iteration, which worked but felt like I was fighting the language at every step.
The Async Rabbit Hole
Performance was terrible with my initial synchronous approach. Indexing my home directory (about 50GB, 200k files) took nearly 20 minutes. Clearly, I needed concurrency.
Enter Tokio and async Rust. I thought I understood async programming from JavaScript, but Rust’s approach is… different. My first async attempt was a disaster:
async fn index_file_async(path: PathBuf) -> Result<FileInfo, IndexError> { let metadata = tokio::fs::metadata(&path).await?; let content = if should_index_content(&path) { Some(tokio::fs::read_to_string(&path).await?) } else { None };
Ok(FileInfo { path: path.clone(), size: metadata.len(), modified: metadata.modified()?, content_hash: content.as_ref().map(|c| hash_content(c)), indexed_content: content, })}
This looked reasonable until I tried to use it with database operations. SQLite connections aren’t Send + Sync
, which meant I couldn’t share them across async tasks. After wrestling with connection pools and Arc<Mutex<Connection>>
wrappers, I eventually discovered sqlx
and its async SQLite support.
The Great Rewrite #1: Channels and Worker Pools
Frustrated with the complexity, I decided to step back and use a more traditional approach with channels and worker threads. This actually worked pretty well:
use std::sync::mpsc;use std::thread;
struct IndexWorker { receiver: mpsc::Receiver<PathBuf>, db_sender: mpsc::Sender<FileInfo>,}
impl IndexWorker { fn run(self) { while let Ok(path) = self.receiver.recv() { match self.process_file(path) { Ok(file_info) => { if self.db_sender.send(file_info).is_err() { break; // Database writer has shut down } } Err(e) => eprintln!("Error processing file: {}", e), } } }}
I spawned multiple worker threads to process files and a dedicated database writer thread to handle all SQLite operations. This avoided the Send + Sync
issues and gave me much better performance—indexing time dropped to about 3 minutes.
Content Indexing: The Text Processing Nightmare
The metadata indexing was working great, but I wanted full-text search capabilities. This meant parsing content from various file types: plain text, PDFs, Word documents, even extracting metadata from images.
For text files, it was straightforward. For everything else… let’s just say I gained a new appreciation for the complexity of document formats. PDFs alone nearly broke me—between password-protected files, corrupted documents, and files with weird encodings, my error handling became increasingly paranoid:
fn extract_pdf_text(path: &Path) -> Result<String, ExtractionError> { let file = File::open(path)?; let mut reader = BufReader::new(file);
match pdf_extract::extract_text(&mut reader) { Ok(text) => Ok(text), Err(pdf_extract::Error::PdfError(_)) => { // Try with a different extraction method fallback_pdf_extraction(path) } Err(e) => { eprintln!("PDF extraction failed for {:?}: {}", path, e); Ok(String::new()) // Return empty string rather than failing } }}
The Database Schema Evolution
My database schema went through several iterations as I learned more about SQLite and query optimization:
Version 1 (Naive):
CREATE TABLE files ( id INTEGER PRIMARY KEY, path TEXT UNIQUE, size INTEGER, modified_time INTEGER, content_hash TEXT, indexed_content TEXT);
Version 3 (Final):
CREATE TABLE files ( id INTEGER PRIMARY KEY, path TEXT UNIQUE NOT NULL, name TEXT NOT NULL, extension TEXT, size INTEGER NOT NULL, modified_time INTEGER NOT NULL, content_hash TEXT, file_type TEXT NOT NULL);
CREATE TABLE file_content ( file_id INTEGER PRIMARY KEY, content TEXT, FOREIGN KEY (file_id) REFERENCES files(id));
CREATE VIRTUAL TABLE content_fts USING fts5( content, content_rowid UNINDEXED);
CREATE INDEX idx_files_name ON files(name);CREATE INDEX idx_files_extension ON files(extension);CREATE INDEX idx_files_size ON files(size);
The separation of content into its own table and the use of FTS5 for full-text search made queries dramatically faster. I also learned the hard way about SQLite’s transaction behavior and the importance of batch operations for performance.
Error Handling: From Panic to Grace
My early error handling was… optimistic. Any I/O error would propagate up and crash the entire indexing process. Not ideal when you’re trying to index a directory with thousands of files, some of which might be locked, corrupted, or on failing drives.
I eventually settled on a layered approach:
- File-level errors (permission denied, file not found) are logged but don’t stop processing
- Directory-level errors halt processing of that subtree but continue elsewhere
- Database errors are more serious and might require stopping the entire operation
- Critical errors (out of disk space, database corruption) shut everything down
#[derive(Debug)]enum IndexError { IoError(io::Error), DatabaseError(rusqlite::Error), ContentExtractionError(String), CriticalError(String),}
impl IndexError { fn is_recoverable(&self) -> bool { match self { IndexError::IoError(_) => true, IndexError::ContentExtractionError(_) => true, IndexError::DatabaseError(_) => false, IndexError::CriticalError(_) => false, } }}
Performance Optimization: The Fun Part
Once the core functionality was working, I got obsessed with performance optimization. This was actually the most enjoyable part of the project—there’s something deeply satisfying about watching your code get faster.
Initial performance: 200k files in 20 minutes
After async/threading: 200k files in 3 minutes
After database optimization: 200k files in 90 seconds
After smarter file filtering: 150k relevant files in 45 seconds
Key optimizations included:
- Skip hidden directories and common build artifacts (node_modules, target/, .git/)
- Batch database insertions in transactions of 1000 records
- Use memory-mapped files for large text files
- Implement a simple LRU cache for recently accessed metadata
- Add progress indicators so users don’t think it’s hung
The CLI Interface: Making It User-Friendly
The final piece was building a decent command-line interface. I used clap
for argument parsing, which made this relatively painless:
#[derive(Parser)]#[command(name = "fileindexer")]#[command(about = "A fast file indexer with full-text search")]struct Cli { #[command(subcommand)] command: Commands,}
#[derive(Subcommand)]enum Commands { Index { #[arg(help = "Directory to index")] path: PathBuf, #[arg(long, help = "Include file content in index")] full_text: bool, }, Search { #[arg(help = "Search query")] query: String, #[arg(long, help = "Search in file content")] content: bool, }, Stats { #[arg(long, help = "Show detailed statistics")] detailed: bool, },}
I also added colored output using colored
and progress bars with indicatif
. These quality-of-life improvements made the tool much more pleasant to use during development and testing.
Lessons Learned
This project taught me more about systems programming than any textbook could. Here are the key takeaways:
Rust-specific lessons:
- The borrow checker is your friend, even when it doesn’t feel like it
Result<T, E>
and proper error handling make your code more robust- Ownership and borrowing prevent entire classes of bugs I didn’t even know I was writing in other languages
- The ecosystem is fantastic—crates.io has high-quality libraries for almost everything
General programming lessons:
- Start simple and add complexity gradually
- Measure performance before optimizing
- Good error handling is just as important as the happy path
- User experience matters, even for command-line tools
- Database schema design has a huge impact on query performance
Project management lessons:
- Scope creep is real—I added features I didn’t originally plan for
- Testing on diverse file systems and directory structures is crucial
- Documentation matters (future me will thank present me)
The Current State
The file indexer is now a reasonably polished tool that I actually use regularly. It can index my home directory in under a minute and provides fast search across both filenames and content. The codebase is about 2,500 lines of Rust, with comprehensive error handling and a test suite that covers the core functionality.
Some statistics from indexing my development machine:
- 147,000 files indexed
- 89GB of content processed
- 2.1GB SQLite database
- Average search response time: 12ms
- Memory usage during indexing: ~150MB
What’s Next?
There are several features I’d like to add:
- Watch mode for real-time index updates using filesystem events
- Web interface for remote searching
- Plugin system for custom file type processors
- Integration with external search tools like ripgrep
- Distributed indexing for network drives
But for now, I’m calling this project complete. It solved the original problem, taught me a ton about Rust and systems programming, and gave me a tool I actually use.
If you’re a CS student considering a Rust project, I’d highly recommend it. The learning curve is steep, but the language really does help you write better, more reliable code. Just be prepared to spend quality time with the compiler—it has opinions, and it’s usually right.
The complete source code is available on my GitHub.
← Back to blog