On October 4th, 2025, teams gathered at Vienna's Impact Hub for an ambitious challenge provided by AIM - AI Impact Mission: build an AI agent that could answer complex sustainability questions by pulling data from SQL databases, corporate PDF reports, and Wikipedia — all within a single day!
Our team, the Nightwalkers, took home first place with a sophisticated multi-tool agent. As you'll discover in this article, our victory wasn't about building the most features — it was about iterative refinement and knowing which battles to fight.
The Challenge
The hackathon (challenge code) presented a complex task. Questions ranged from simple database queries like:
"What were Austria's CO₂ emissions in 1750?"
to multi-step calculations requiring data from multiple sources:
"What percentage of Switzerland's 2023 total GHG emissions would the combined Scope 1 emissions of GSK, Swisscom, and Erste Bank represent?"
The agent needed to:
- Query SQL databases with climate data spanning centuries
- Extract specific data from corporate sustainability PDFs
- Search Wikipedia for recent climate events
- Perform accurate arithmetic (a notorious weak point for LLMs)
- Track sources for transparency
- Complete all answers within 15 minutes for the final evaluation
Example question format:
{
  "5": {
    "question": "What were the annual total greenhouse gas emissions including land use in tonnes for Austria in 2000?",
    "answer": 76664000.0,
    "answer_type": "float",
    "unit": "tonnes",
    "difficulty": "easy",
    "comment": "DB entry of 'total_ghg' for Austria with 'year' 2000.",
    "sources": [
      {
        "source_name": "owid_co2_data",
        "source_type": "database",
        "page_number": null
      }
    ]
  }
}What made the challenge particularly brutal is that PDF-based questions formed a significant portion of the test set (approx. 15/40 questions), and we had just 9 hours to build a complete solution.
The Winning Strategy: Iterate, Test, Refine, Move On
Phase 1: Master the Database (hours 1–4)
Focus: Get database questions to 75% accuracy before moving on.
We started by building a single MCP server for SQL database access, connecting to Our World in Data's climate datasets:
- annual-co2-emissions-per-country.csv— GHG emissions per country
- owid-energy-data.csv— Energy data per country
Building the tool was only 20% of the work; teaching Claude to use it correctly was the real challenge.
Our iterative loop:
- Implement the database MCP server
- Test against all public-set database questions
- Analyze every incorrect answer
- Improve system prompt with new rules
- Repeat until reaching 75% accuracy
Each test run revealed failure patterns:
- "Agent queried co2but neededtotal_ghg"
 → Rule: "total greenhouse gas emissions → use total_ghg column"
- "Got 76.664 instead of 76,664,000"
 → Rule: "co2, gas_co2, total_ghg are in MILLION tonnes (×1e6 in SQL)"
- "Used electricity_generationinstead ofelectricity_demand"
 → Rule: "electricity demand → use electricity_demand, NOT generation"
After dozens of iterations and hundreds of prompt refinements, we hit 75% accuracy — only then did we move on.
Phase 2: Add Wikipedia (hours 4–6)
With the database foundation solid, we built the Wikipedia MCP server using the same process:
- Build basic search and content retrieval
- Test on all Wikipedia questions
- Analyze failures (disambiguation issues, extraction errors)
- Add prompt rules: "search first to find exact article title"
- Test again
This iterative approach paid off, Wikipedia questions reached approximately 70% accuracy relatively quickly, benefiting from the robust reasoning framework built during Phase 1.
Phase 3: The PDF Headaches (hours 6–9)
This is where time became the enemy. We initially attempted to build their own RAG (Retrieval Augmented Generation) architecture:
- Parse PDFs with Docling
- Build a vector database for semantic search
- Create custom retrieval logic
After two hours of wrestling with PDF parsing edge cases, table extraction, and embedding generation, we made a critical decision: pivot to Ragie, a RAG-as-a-service platform:
@mcp.tool()
def retrieve_from_ragie(query: str, top_k: int = 8, rerank: bool = False):
    """Retrieve relevant chunks from Ragie's document index"""
    # Quick integration (no time for reranking)
    result = requests.post(
        "https://api.ragie.ai/retrievals",
        json={"query": query, "top_k": top_k, "rerank": False}
    )The rerank=False setting was necessary due to time constraints: reranking would have improved accuracy but added latency we couldn't afford during the 15-  minute evaluation window.
The issue is that there simply wasn't enough time to properly test and tune the Ragie integration. While database and Wikipedia questions had gone through multiple refinement cycles, PDF questions got maybe 2-3 test iterations before the final deadline.
The result? Very poor performance on PDF-related questions.
 
                            Architecture
Despite the PDF struggles, our architecture demonstrated sophisticated agentic design:
Custom Tools
We built five specialized MCP servers as custom tools for the agent to arrive at correct answers more reliably and efficiently:
- Database MCP server
 Connected directly to Azure SQL Server hosting Our World in Data's comprehensive climate datasets. After extensive testing, the agent learned to craft precise SQL queries, handling tricky details like distinguishing between hydro_electricity vs hydroelectricity.
- Calculator MCP server
 When we noticed Claude's arithmetic producing floating-point errors, we built custom tools to improve the agent performance:- signed_difference(a, b)(for time-series changes where order matters)
- abs_difference(a, b)(for magnitude comparisons)
- ratio(numerator, denominator)(avoiding division-by-zero)
- percentage_change(new, old)(properly handling growth calculations)
- cagr(end, start, periods)(compound annual growth rate)
 
- Currency converter MCP server
 Exchange rates for 30+ currencies, essential for questions spanning multiple companies and countries, facilitating currency conversion with fixed rates to allow for verifiable answers.
- Wikipedia MCP server
 For recent climate events not in historical databases:- Search for article titles
- Retrieve summaries or full content
- Extract facts like death tolls from heat waves or flood locations
 
- Ragie MCP server
 The last-minute addition for PDF documents. While not thoroughly tested, it provided semantic search over annual reports from Erste Bank, GSK, Swisscom, RWE, and Shell.
The Brain: A Battle-Tested System Prompt
The 5000 tokens system prompt wasn't written upfront, it evolved through dozens of test-analyze-improve cycles:
Chain-of-thought reasoning process:
- understand the question
- check constraints (answer_type, unit, comment)
- check comment for ORDER (critical for differences)
- plan your approach
- verify column names if unsure
- execute carefully
- return ONLY the final numeric answer
Critical rules (each born from a real failure):
- Unit conversions: "co2, gas_co2, total_ghg are in MILLION tonnes (multiply by 1e6 IN THE SQL QUERY if question asks for tonnes)"
- Calculator tool usage: "NEVER pass string expressions like '76.664 * 1e6' (do multiplication in SQL first)"
- Difference calculations: "ALWAYS check the comment for explicit calculation order! If comment says 'A - B', use signed_difference(A, B) with EXACT order"
- Erste Bank priority: "ANY question mentioning 'Erste Bank' MUST use retrieve_from_ragie (do NOT fall back to Wikipedia)"
Every rule existed because testing revealed a specific failure mode.
Results
We won the hackathon, but our performance breakdown tells an interesting story:
- Database questions: 75%+ accuracy
- Wikipedia questions: approximately 70% accuracy
- PDF questions: <20% accuracy
The solid foundation of database and Wikipedia questions allowed us to win, definitely not the PDF retrieval strategy.
Key Takeaways for Hackathon Success
- Iterate ruthlessly on one thing before moving on
 We didn't touch Wikipedia until we hit 75% on database questions. This discipline prevented half-baked implementations of everything.
- Testing is development
 We spent as much time testing and analyzing failures as writing code. Each incorrect answer became a new prompt rule or architectural decision.
- Know when to pivot
 Attempting a custom RAG solution would have been technically satisfying but integrating Ragie in 30 minutes beat spending 3 hours on a half-working custom solution.
- Perfect the basics
 Database and Wikipedia questions had simpler data sources. Mastering these created the conditions for winning the competition.
- Don't make LLMs do math
 Even though Claude has gotten much better at arithmetic it is still not perfect, this was clear during testing. An MCP for doing calculations using python was built to fix this issue.
What would we do differently?
In retrospect, we identified two changes for a hypothetical second attempt:
- Prototype PDFs earlier
 Even 30 extra minutes to evaluate Ragie vs. custom RAG would have saved hours of building the wrong thing.
- Set time boxes strictly
 Database testing could have stopped at 70% accuracy, freeing an extra hour for PDF refinement.
Tech Stack
- Claude Sonnet 4.5,
- Python
- FastMCP
- SQL Server
- Ragie
- Wikipedia API
Our full solution is available on GitHub including. The modular architecture means you can run individual MCP servers independently.
Final Thoughts
This even was a blast and very well organized by AIM! We had so much fun and couldn't stop laughing in certain moments. Before knowing the leaderboard we were actually quite pessimistic because we saw that the agent was struggling a lot with some questions. For this reason the victory was a big surprise.
On a final note, this challenge was also incredibly relevant for the team because the three members are building two sustainability SaaS startups (Scope4 and Susteam) so we were perfectly in tune with the topic. Feel free to say hi to us on LinkedIn:
Congratulations also to the other teams who did an excellent job and contributed to a great atmosphere!
 
                             
         
                         
                            