9.8 KiB
PID File Tracking for Robust Process Management
Problem Statement
When the Toes host process crashes unexpectedly (OOM, SIGKILL, power loss, kernel panic), child app processes continue running as orphans. On restart, Toes has no knowledge of these processes:
- Port conflicts: Orphans hold ports, new instances fail to bind
- Resource waste: Zombie processes consume memory/CPU
- State confusion: App appears "stopped" but is actually running
- Data corruption: Multiple instances may write to same files
Currently, Toes only handles graceful shutdown (SIGTERM/SIGINT). There's no recovery mechanism for ungraceful termination.
Proposed Solution: PID File Tracking
Design
Store PID files in TOES_DIR/pids/:
${TOES_DIR}/pids/
clock.pid # Contains: 12345
todo.pid # Contains: 12389
weather.pid # Contains: 12402
Lifecycle
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ App Start │────▶│ Write PID │────▶│ Running │
└─────────────┘ └─────────────┘ └─────────────┘
│
┌─────────────┐ │
│ Delete PID │◀──────────┘
└─────────────┘ App Exit
On host startup:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Host Init │────▶│ Scan PIDs │────▶│Kill Orphans │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│Clean Stale │
│ PID Files │
└─────────────┘
Implementation
1. PID Directory Setup
const PIDS_DIR = join(TOES_DIR, 'pids')
function ensurePidsDir() {
if (!existsSync(PIDS_DIR)) {
mkdirSync(PIDS_DIR, { recursive: true })
}
}
2. Write PID on Start
In runApp(), after spawning:
const proc = Bun.spawn(['bun', 'run', 'toes'], { ... })
app.proc = proc
// Write PID file
const pidFile = join(PIDS_DIR, `${dir}.pid`)
writeFileSync(pidFile, String(proc.pid))
3. Delete PID on Exit
In the proc.exited.then() handler:
proc.exited.then(code => {
// Remove PID file
const pidFile = join(PIDS_DIR, `${dir}.pid`)
if (existsSync(pidFile)) {
unlinkSync(pidFile)
}
// ... existing cleanup
})
4. Orphan Cleanup on Startup
New function called during initApps():
function cleanupOrphanProcesses() {
ensurePidsDir()
for (const file of readdirSync(PIDS_DIR)) {
if (!file.endsWith('.pid')) continue
const appName = file.replace('.pid', '')
const pidFile = join(PIDS_DIR, file)
const pid = parseInt(readFileSync(pidFile, 'utf-8').trim(), 10)
if (isNaN(pid)) {
// Invalid PID file, remove it
unlinkSync(pidFile)
hostLog(`Removed invalid PID file: ${file}`)
continue
}
if (isProcessRunning(pid)) {
// Orphan found - kill it
hostLog(`Found orphan process for ${appName} (PID ${pid}), terminating...`)
try {
process.kill(pid, 'SIGTERM')
// Give it 5 seconds, then SIGKILL
setTimeout(() => {
if (isProcessRunning(pid)) {
hostLog(`Orphan ${appName} (PID ${pid}) didn't terminate, sending SIGKILL`)
process.kill(pid, 'SIGKILL')
}
}, 5000)
} catch (e) {
// Process may have exited between check and kill
hostLog(`Failed to kill orphan ${appName}: ${e}`)
}
}
// Remove stale PID file
unlinkSync(pidFile)
}
}
function isProcessRunning(pid: number): boolean {
try {
// Sending signal 0 checks if process exists without killing it
process.kill(pid, 0)
return true
} catch {
return false
}
}
5. Integration Point
Update initApps():
export function initApps() {
initPortPool()
setupShutdownHandlers()
cleanupOrphanProcesses() // <-- Add here, before discovery
rotateLogs()
createAppSymlinks()
discoverApps()
runApps()
}
Edge Cases
| Scenario | Handling |
|---|---|
| PID reused by OS | Check if process command matches expected pattern before killing |
| PID file corrupted | Delete invalid files, log warning |
| Multiple Toes instances | Use file locking or instance ID in PID path |
| App renamed while running | Old PID file orphaned; cleanup handles it |
| Permission denied on kill | Log error, continue with other orphans |
Enhanced: Validate Process Identity
To avoid killing an unrelated process that reused the PID:
function isOurProcess(pid: number, appName: string): boolean {
try {
// On macOS/Linux, check /proc or use ps
const result = Bun.spawnSync(['ps', '-p', String(pid), '-o', 'args='])
const cmd = new TextDecoder().decode(result.stdout).trim()
// Check if it looks like a Toes app process
return cmd.includes('bun') && cmd.includes('toes')
} catch {
return false
}
}
Related Recommendations
1. Store Port in PID File
Extend PID files to include port for faster recovery:
# clock.pid
12345
3001
Or use JSON:
{"pid": 12345, "port": 3001, "started": 1706900000000}
This allows Toes to reclaim the exact port on restart, avoiding port shuffling.
2. Circuit Breaker for Crash Loops
Add crash tracking to prevent infinite restart loops:
interface CrashRecord {
timestamp: number
exitCode: number
}
// Store in TOES_DIR/crashes/<app>.json
const CRASH_WINDOW = 3600000 // 1 hour
const MAX_CRASHES = 10
function recordCrash(appName: string, exitCode: number) {
const file = join(TOES_DIR, 'crashes', `${appName}.json`)
const crashes: CrashRecord[] = existsSync(file)
? JSON.parse(readFileSync(file, 'utf-8'))
: []
// Add new crash
crashes.push({ timestamp: Date.now(), exitCode })
// Prune old crashes
const cutoff = Date.now() - CRASH_WINDOW
const recent = crashes.filter(c => c.timestamp > cutoff)
writeFileSync(file, JSON.stringify(recent))
return recent.length
}
function shouldCircuitBreak(appName: string): boolean {
const file = join(TOES_DIR, 'crashes', `${appName}.json`)
if (!existsSync(file)) return false
const crashes: CrashRecord[] = JSON.parse(readFileSync(file, 'utf-8'))
const cutoff = Date.now() - CRASH_WINDOW
const recent = crashes.filter(c => c.timestamp > cutoff)
return recent.length >= MAX_CRASHES
}
3. Track Restart Timer for Cancellation
Store scheduled restart timers on the app object:
export type App = SharedApp & {
// ... existing fields
restartTimer?: Timer // <-- Add this
}
Update scheduleRestart():
function scheduleRestart(app: App, dir: string) {
// Cancel any existing scheduled restart
if (app.restartTimer) {
clearTimeout(app.restartTimer)
}
// ... existing delay calculation ...
app.restartTimer = setTimeout(() => {
app.restartTimer = undefined
// ... existing restart logic
}, delay)
}
Update clearTimers():
const clearTimers = (app: App) => {
// ... existing timer cleanup ...
if (app.restartTimer) {
clearTimeout(app.restartTimer)
app.restartTimer = undefined
}
}
4. Exit Code Classification
function classifyExit(code: number | null): 'restart' | 'invalid' | 'stop' {
if (code === null) return 'restart' // Killed by signal
if (code === 0) return 'stop' // Clean exit
if (code === 2) return 'invalid' // Bad arguments/config
if (code >= 128) {
// Killed by signal (128 + signal number)
const signal = code - 128
if (signal === 9) return 'restart' // SIGKILL (OOM?)
if (signal === 15) return 'stop' // SIGTERM (intentional)
}
return 'restart' // Default: try again
}
5. Install Timeout
Wrap bun install with a timeout:
async function installWithTimeout(cwd: string, timeout = 60000): Promise<boolean> {
const install = Bun.spawn(['bun', 'install'], {
cwd,
stdout: 'pipe',
stderr: 'pipe'
})
const timeoutPromise = new Promise<never>((_, reject) => {
setTimeout(() => {
install.kill()
reject(new Error('Install timeout'))
}, timeout)
})
try {
await Promise.race([install.exited, timeoutPromise])
return install.exitCode === 0
} catch (e) {
return false
}
}
Implementation Priority
| Change | Effort | Impact | Priority |
|---|---|---|---|
| PID file tracking | Medium | High | 1 |
| Orphan cleanup on startup | Medium | High | 1 |
| Track restart timer | Low | Medium | 2 |
| Install timeout | Low | Medium | 2 |
| Circuit breaker | Medium | Medium | 3 |
| Exit code classification | Low | Low | 4 |
| Process identity validation | Medium | Low | 5 |
Testing Checklist
- Host crashes while apps running → orphans cleaned on restart
- App crashes → PID file removed, restart scheduled
- App stopped manually → PID file removed, no restart
- Stale PID file (process gone) → file cleaned up
- PID reused by unrelated process → not killed (with identity check)
- Multiple rapid restarts → circuit breaker triggers
- Rename app while running → handled gracefully
bun installhangs → times out, app marked failed