File Discovery Workflow
Comprehensive technical guide to the 5-stage AI workflow that identifies and filters relevant files for task execution.
The File Discovery Workflow is a sophisticated 5-stage AI pipeline that intelligently identifies and filters relevant files from your repository before generating prompts or implementation plans. This comprehensive system combines git integration, binary file detection, AI-powered relevance assessment, and path correction to ensure optimal context selection.
Workflow Architecture
The workflow operates as an orchestrated background job system with five distinct stages that execute sequentially. Each stage builds upon the previous stage's output, progressively refining the file selection based on task requirements.
The system uses a distributed job architecture where each stage runs as an independent background job, enabling cancellation, retry logic, and detailed progress tracking. Real-time events are published throughout execution to provide immediate feedback to the user interface.
Key Architecture Features:
- • Event-driven progress reporting with WebSocket-like updates
- • Comprehensive error handling with automatic retry mechanisms
- • Cost tracking and timeout management for AI operations
- • Caching of intermediate results for performance optimization
- • Git integration with fallback to directory traversal
5-Stage Workflow Process
Stage 1: Root Folder Selection
Determines the project root directory and validates git repository status. This stage establishes the base directory for all subsequent file operations and configures exclusion patterns.
Stage 2: Regex File Filter
Generates intelligent regex patterns based on the task description to perform initial file filtering. This stage combines git ls-files output with binary file detection to create a preliminary file list.
Stage 3: AI File Relevance Assessment
Employs AI models to analyze file content and assess relevance to the specific task description. This stage performs deep content analysis to identify files that are most likely to be useful for the given task.
Stage 4: Extended Path Finder
Discovers additional relevant files through relationship analysis and dependency tracking. This stage identifies files that might not match initial patterns but are contextually important.
Stage 5: Path Correction
Validates and corrects file paths, ensuring all selected files exist and are accessible. This final stage performs consistency checks and path normalization across different operating systems.
Configuration Options
Workflow Configuration
Timeout Management
Configure maximum execution time for the entire workflow or individual stages to prevent indefinite hanging.
timeoutMs: 300000 // 5 minutes default
Exclusion Patterns
Define directories and file patterns to exclude from the discovery process.
excludedPaths: ["node_modules", ".git", "dist", "build"]
Retry Configuration
Set maximum retry attempts for failed stages with exponential backoff.
maxRetries: 3 // Per stage retry limit
API Usage Examples
Starting a Workflow
const tracker = await WorkflowTracker.startWorkflow( sessionId, "Add user authentication to the login page", "/path/to/project", ["node_modules", "dist"], { timeoutMs: 300000 } );
Monitoring Progress
tracker.onProgress((state) => { console.log(`Stage: ${state.currentStage}`); console.log(`Progress: ${state.progressPercentage}%`); }); tracker.onComplete((results) => { console.log(`Selected ${results.selectedFiles.length} files`); });
Retrieving Results
const results = await tracker.getResults(); const selectedFiles = results.selectedFiles; const intermediateData = results.intermediateData; const totalCost = results.totalActualCost;
Performance Considerations
Memory Management
The workflow implements intelligent memory management with file caching (30-second TTL), batch processing (100 files per batch), and automatic cleanup of intermediate data to prevent memory exhaustion.
Cost Optimization
AI stages track actual costs from API responses, implement intelligent batching to minimize token usage, and provide cost estimates before execution to help manage expenses.
Performance Monitoring
Built-in performance tracking monitors execution times, memory usage, throughput metrics, and provides recommendations for optimization based on historical data analysis.
Integration Patterns
Desktop Application
The workflow integrates seamlessly with the desktop application through Tauri commands, providing native file system access and event-driven updates via the WorkflowTracker class.
Implementation Plans Integration
Selected files are automatically fed into the Implementation Plans panel, ensuring that plan generation uses the same optimized file context without requiring re-execution of the discovery workflow.
Session Management
Workflow results are cached per session, allowing multiple operations within the same session to reuse the discovered file context, significantly improving performance for iterative development workflows.
Error Handling & Troubleshooting
Common Issues
- • Git repository not found: Falls back to directory traversal with standard exclusions
- • Binary file detection: Uses both extension-based and content-based binary detection
- • Token limit exceeded: Implements intelligent batching and provides clear error messages
- • Network timeouts: Automatic retry with exponential backoff for transient failures
Error Categories
- • Validation Errors: Invalid session ID, missing task description, or invalid project directory
- • Workflow Errors: Stage-specific failures with detailed context and retry suggestions
- • Billing Errors: Insufficient credits or payment failures with actionable guidance
- • System Errors: File system access, git command failures, or memory constraints
Debugging Tools
The workflow provides comprehensive logging, performance metrics export, and detailed error context including stage information, retry attempts, and intermediate data for troubleshooting.
Workflow State Management
State Transitions
The workflow progresses through clearly defined states: Created → Running → Paused (optional) → Completed/Failed/Canceled. Each state transition publishes events that can be monitored for real-time updates.
Intermediate Data Storage
Each stage stores its output in a structured intermediate data format, including directory tree content, regex patterns, filtered file lists, and path correction results. This data is accessible for debugging and can be used to resume workflows from specific stages.
Event-Driven Updates
The system publishes real-time events for workflow status changes, stage completions, and error conditions. These events enable responsive user interfaces and integration with external monitoring systems.
Need the desktop app?
The file discovery workflow runs inside the desktop client alongside implementation planning and terminal sessions.