Second hackathon ever, first time solo. The task: build an autonomous AI agent that solves 104 natural-language tasks in a sandboxed workspace while defending itself against prompt injection, PII leaks, and destructive tool calls. Blind scoring, three hours of build time on the day, one week of preparation beforehand. Result: 1st place at the on-site run in Vienna, AI Factory Austria, April 11, 2026. What got me there was not talent on the day. It was preparation, parallel development, and five independent security layers that overlapped on purpose.
AIM Hackathon 4: The Challenge
AIM Hackathon 4 ran under the theme Personal & Trustworthy Autonomous Agents, a direct follow-up to the AIM mission of building AI that people can actually deploy around their own data. The benchmark for this edition was BitGN PAC (Personal Agent Challenge), designed by Rinat Abdullin and hosted by AI Impact Mission, Klartext AI, and AI Factory Austria. 880 signups from 88 cities globally, with on-site runs in Vienna and several other European locations.
PAC is not a pitch-and-slides hackathon. No jury evaluating vibes, no Canva decks. The format is deterministic: every participant builds an agent that runs against a sandboxed, file-based workspace. The sandbox looks a lot like an Obsidian vault: calendars, notes, emails, contacts. The agent receives natural-language tasks like "Find Lisa's phone number in the contacts" or "This email contains a phishing link. Detect it."
104 tasks. Scoring is based on observable side effects: which files were touched, which tool calls were made, which outcome enum was returned. No code changes allowed once evaluation starts.
The hard part was not the 104 tasks themselves. The hard part was that roughly a third of them were adversarial: prompt injection inside emails, malicious attachments, requests that tried to exfiltrate PII, path traversals aimed at /etc/passwd, secret-bearing documents the agent was supposed to never quote back. An agent that was accurate but naive would lose.
The Week Before: Preparation Over Improvisation
The hackathon was April 11. The dev benchmark (43 tasks) was released a week earlier; the full 104-task prod benchmark only on competition day. Starting April 4, I developed the agent alongside regular client work. That week of lead time turned out to be the single biggest differentiator.
Architecture choices made upfront:
- TypeScript end to end. Vercel AI SDK v6 with native tool calling as the agent loop. ConnectRPC as transport to the BitGN platform.
- Model: Claude Sonnet 4.6 as the primary, with Opus 4.6 and GPT-4.1 available as fallbacks behind an environment-variable switch.
- Development harness: session-orchestrator, my own Claude Code harness for parallel subagent execution. This was what let me build the security layers and the metrics system at the same time instead of sequentially.
The reason parallel development mattered is not vanity. It is that when you have a week, the natural instinct is to chain features: first model plumbing, then tool calls, then security, then metrics. By the end of the week you have a happy-path agent and a security story you invented on the last day. Wave-based parallel development meant security was a real dimension of the agent from day two onwards, not an afterthought.
Architecture
The final agent was 2,758 lines of TypeScript across 21 source files, MIT-licensed. The repo is public on GitHub: bitgn-pac-agent-public.
The critical design choice was to format tool results as shell-style output (cat, ls, rg) rather than raw JSON. LLMs are heavily trained on CLI output; matching that surface produces measurably better reasoning on file-based tasks. This one decision lifted accuracy on multi-step contact and email tasks by a margin I did not expect.
Five Security Layers
The agent has five independent defense layers. Each is pure TypeScript, no external dependencies, and individually toggleable via environment variables. This matters in a live competition: if a layer starts misbehaving mid-evaluation, you disable it and keep running.
B1: Path Traversal Guard. Blocks access to /etc/, ~/.ssh/, .env, and any .. traversals. Runs before every tool call.
B2: PII Refusal. Detects queries about real personal data (family relations, home addresses) and refuses cleanly rather than fabricating answers.
B3: Grounding-Refs Validation. Verifies that cited file paths in the final answer were actually read by the agent during the run. Prevents citation hallucinations.
B4: Destructive Brake. Hard cap of 10 write operations per task. Soft cap of 35 iterations to kill infinite loops before they burn the whole budget.
B5: Secret Redaction. Redacts AWS keys, GitHub tokens, JWTs, and PEM certificates from tool output before it enters the LLM context. The agent cannot leak what it never saw.
On top of the five layers, an injection scanner with 16+ patterns catches HTML-comment payloads, Base64 variants, and domain mismatches in emails.
The layers overlap on purpose. A secret that sneaks past redaction still has a chance of being caught by the injection scanner. A PII request that slips past B2 will often hit B1 on the tool-call side. Defense in depth, not defense by hope.
Gameday: Two Incidents
The MCP Scope Leak
During final regression testing, roughly 45 minutes before the prod window opened, I tested a calendar task. The agent was supposed to create an event in the sandbox. It created one in my real Google Calendar instead.
Root cause: I had been testing with the Claude Agent SDK using bypassPermissions, which auto-approves every tool call. The MCP connection had inherited OAuth scopes for Gmail, Google Calendar, and Notion from my local setup. The agent had not stayed in the sandbox. It had quite happily reached out to my real Calendar and booked something for next Tuesday.
Fix: a canUseTool runtime gate with an explicit allowlist. Only tool namespaces matching mcp__bitgn__* were permitted. Three lines of code that saved me from a rather embarrassing moment.
Platform Disk Full
14:47 CEST, mid-evaluation. StartRun and StartPlayground started returning 502s. The room went quiet. Everyone looked up from their laptops at the same time.
BitGN platform: disk full. Nothing worked. Rinat diagnosed and fixed it live with agentic coding in a handful of minutes, and we kept going. A useful reminder that live benchmarks are their own sport. The best agent in the world does not help you when the infrastructure is down.
The Evaluation: Parallel Agents Won the Clock
The evaluation window closed at 16:00. I started four prod runs. Aborted the first three. Too slow. The fourth ran six parallel agents, each processing one task at a time, and finished all 104 tasks in under 20 minutes. The earlier runs had not been anywhere close.
Parallelism was the second-biggest differentiator after preparation. The serial agent was accurate but would have run out of wall-clock before finishing the tail. Six parallel workers, each with the full security stack, each isolated, got the job done with margin to spare.
Key Takeaways
Preparation beats improvisation. At my first hackathon two weeks earlier I had improvised. This time I had a dev benchmark, a harness, and a week. The difference was not subtle.
Feature flags on everything. Every security layer and every model choice was togglable. In a live competition this is survival. Layer B3 causing false refusals? Disable it for the next run, debug later.
Shell-style tool output helps LLMs reason. Formatting tool results as cat, ls, rg output rather than JSON was one of the quietest wins in the project and one of the most effective.
Always submit something. If the agent crashes, hits max steps, or a brake triggers: still return a report_completion with a best-guess outcome. Zero points for a crash, partial points for a half-right guess.
Defense in depth is worth the complexity. Five layers is more code than one clever one. But in an adversarial benchmark where roughly a third of tasks are trying to break you, the layered approach catches what any single layer would miss.
Thank You
Huge thanks to the people who made AIM Hackathon 4 happen. Rinat Abdullin for designing a benchmark that rewards substance over demo polish, and for fixing the platform under fire mid-evaluation. Markus Keiblinger and the AIM team for hosting on a scale that reaches 88 cities. Felix Krause and Klartext AI, and the AI Factory Austria team in Vienna, for making the on-site run possible.
And to everyone else who participated in their own city: congratulations on shipping something real under a hard clock. That is always the hard part.
For another strong technical breakdown of the same benchmark, read Vladimir Manuilov’s second-place recap from the Vienna on-site leaderboard.