Technical Guide

File Discovery Workflow

Comprehensive technical guide to the 5-stage AI workflow that identifies and filters relevant files for task execution.

September 21, 2025

•

12 min read

•

The File Discovery Workflow is a sophisticated 5-stage AI pipeline that intelligently identifies and filters relevant files from your repository before generating prompts or implementation plans. This comprehensive system combines git integration, binary file detection, AI-powered relevance assessment, and path correction to ensure optimal context selection.

Workflow Architecture

The workflow operates as an orchestrated background job system with five distinct stages that execute sequentially. Each stage builds upon the previous stage's output, progressively refining the file selection based on task requirements.

The system uses a distributed job architecture where each stage runs as an independent background job, enabling cancellation, retry logic, and detailed progress tracking. Real-time events are published throughout execution to provide immediate feedback to the user interface.

Key Architecture Features:

• Event-driven progress reporting with WebSocket-like updates
• Comprehensive error handling with automatic retry mechanisms
• Cost tracking and timeout management for AI operations
• Caching of intermediate results for performance optimization
• Git integration with fallback to directory traversal

5-Stage Workflow Process

Stage 1: Root Folder Selection

Determines the project root directory and validates git repository status. This stage establishes the base directory for all subsequent file operations and configures exclusion patterns.

Technical Details: Uses git detection with fallback to directory validation, applies user-defined exclusion patterns, and establishes the working directory context for the entire workflow.

Stage 2: Regex File Filter

Generates intelligent regex patterns based on the task description to perform initial file filtering. This stage combines git ls-files output with binary file detection to create a preliminary file list.

Git Integration: Executes `git ls-files --cached --others --exclude-standard` to respect .gitignore rules while including both tracked and untracked files.

Binary Detection: Filters out files with binary extensions (.jpg, .png, .pdf, .exe, etc.) and uses content analysis to detect binary files by null bytes and non-printable character ratios.

Stage 3: AI File Relevance Assessment

Employs AI models to analyze file content and assess relevance to the specific task description. This stage performs deep content analysis to identify files that are most likely to be useful for the given task.

AI Processing: Uses large language models to evaluate file content against task requirements, with intelligent batching to manage token limits and cost optimization.

Stage 4: Extended Path Finder

Discovers additional relevant files through relationship analysis and dependency tracking. This stage identifies files that might not match initial patterns but are contextually important.

Relationship Analysis: Analyzes import statements, configuration files, and project structure to find related files that enhance the context for the specific task.

Stage 5: Path Correction

Validates and corrects file paths, ensuring all selected files exist and are accessible. This final stage performs consistency checks and path normalization across different operating systems.

Validation Process: Verifies file existence, normalizes path separators, resolves symbolic links, and removes any paths that have become invalid during the workflow execution.

Configuration Options

Workflow Configuration

Timeout Management

Configure maximum execution time for the entire workflow or individual stages to prevent indefinite hanging.

timeoutMs: 300000 // 5 minutes default

Exclusion Patterns

Define directories and file patterns to exclude from the discovery process.

excludedPaths: ["node_modules", ".git", "dist", "build"]

Retry Configuration

Set maximum retry attempts for failed stages with exponential backoff.

maxRetries: 3 // Per stage retry limit

API Usage Examples

Starting a Workflow

const tracker = await WorkflowTracker.startWorkflow(
  sessionId,
  "Add user authentication to the login page",
  "/path/to/project",
  ["node_modules", "dist"],
  { timeoutMs: 300000 }
);

Monitoring Progress

tracker.onProgress((state) => {
  console.log(`Stage: ${state.currentStage}`);
  console.log(`Progress: ${state.progressPercentage}%`);
});

tracker.onComplete((results) => {
  console.log(`Selected ${results.selectedFiles.length} files`);
});

Retrieving Results

const results = await tracker.getResults();
const selectedFiles = results.selectedFiles;
const intermediateData = results.intermediateData;
const totalCost = results.totalActualCost;

Performance Considerations

Memory Management

The workflow implements intelligent memory management with file caching (30-second TTL), batch processing (100 files per batch), and automatic cleanup of intermediate data to prevent memory exhaustion.

Cost Optimization

AI stages track actual costs from API responses, implement intelligent batching to minimize token usage, and provide cost estimates before execution to help manage expenses.

Performance Monitoring

Built-in performance tracking monitors execution times, memory usage, throughput metrics, and provides recommendations for optimization based on historical data analysis.

Integration Patterns

Desktop Application

The workflow integrates seamlessly with the desktop application through Tauri commands, providing native file system access and event-driven updates via the WorkflowTracker class.

Implementation Plans Integration

Selected files are automatically fed into the Implementation Plans panel, ensuring that plan generation uses the same optimized file context without requiring re-execution of the discovery workflow.

Session Management

Workflow results are cached per session, allowing multiple operations within the same session to reuse the discovered file context, significantly improving performance for iterative development workflows.

Error Handling & Troubleshooting

Common Issues

• Git repository not found: Falls back to directory traversal with standard exclusions
• Binary file detection: Uses both extension-based and content-based binary detection
• Token limit exceeded: Implements intelligent batching and provides clear error messages
• Network timeouts: Automatic retry with exponential backoff for transient failures

Error Categories

• Validation Errors: Invalid session ID, missing task description, or invalid project directory
• Workflow Errors: Stage-specific failures with detailed context and retry suggestions
• Billing Errors: Insufficient credits or payment failures with actionable guidance
• System Errors: File system access, git command failures, or memory constraints

Debugging Tools

The workflow provides comprehensive logging, performance metrics export, and detailed error context including stage information, retry attempts, and intermediate data for troubleshooting.

Workflow State Management

State Transitions

The workflow progresses through clearly defined states: Created → Running → Paused (optional) → Completed/Failed/Canceled. Each state transition publishes events that can be monitored for real-time updates.

Intermediate Data Storage

Each stage stores its output in a structured intermediate data format, including directory tree content, regex patterns, filtered file lists, and path correction results. This data is accessible for debugging and can be used to resume workflows from specific stages.

Event-Driven Updates

The system publishes real-time events for workflow status changes, stage completions, and error conditions. These events enable responsive user interfaces and integration with external monitoring systems.

Need the desktop app?

The file discovery workflow runs inside the desktop client alongside implementation planning and terminal sessions.