← Back to Blog

How We Won the Climate Data Hackathon: An Iterative Approach to Building AI Agents

13.10.2025 By Team Nightwalkers

#hackathon #agentic-AI #first-place #sustainability

On October 4th, 2025, teams gathered at Vienna's Impact Hub for an ambitious challenge provided by AIM - AI Impact Mission: build an AI agent that could answer complex sustainability questions by pulling data from SQL databases, corporate PDF reports, and Wikipedia — all within a single day!

Our team, the Nightwalkers, took home first place with a sophisticated multi-tool agent. As you'll discover in this article, our victory wasn't about building the most features — it was about iterative refinement and knowing which battles to fight.

The Challenge

The hackathon (challenge code) presented a complex task. Questions ranged from simple database queries like:

"What were Austria's CO₂ emissions in 1750?"

to multi-step calculations requiring data from multiple sources:

"What percentage of Switzerland's 2023 total GHG emissions would the combined Scope 1 emissions of GSK, Swisscom, and Erste Bank represent?"

The agent needed to:

Query SQL databases with climate data spanning centuries
Extract specific data from corporate sustainability PDFs
Search Wikipedia for recent climate events
Perform accurate arithmetic (a notorious weak point for LLMs)
Track sources for transparency
Complete all answers within 15 minutes for the final evaluation

Example question format:

{
  "5": {
    "question": "What were the annual total greenhouse gas emissions including land use in tonnes for Austria in 2000?",
    "answer": 76664000.0,
    "answer_type": "float",
    "unit": "tonnes",
    "difficulty": "easy",
    "comment": "DB entry of 'total_ghg' for Austria with 'year' 2000.",
    "sources": [
      {
        "source_name": "owid_co2_data",
        "source_type": "database",
        "page_number": null
      }
    ]
  }
}

What made the challenge particularly brutal is that PDF-based questions formed a significant portion of the test set (approx. 15/40 questions), and we had just 9 hours to build a complete solution.

The Winning Strategy: Iterate, Test, Refine, Move On

Phase 1: Master the Database (hours 1–4)

Focus: Get database questions to 75% accuracy before moving on.

We started by building a single MCP server for SQL database access, connecting to Our World in Data's climate datasets:

annual-co2-emissions-per-country.csv — GHG emissions per country
owid-energy-data.csv — Energy data per country

Building the tool was only 20% of the work; teaching Claude to use it correctly was the real challenge.

Our iterative loop:

Implement the database MCP server
Test against all public-set database questions
Analyze every incorrect answer
Improve system prompt with new rules
Repeat until reaching 75% accuracy

Each test run revealed failure patterns:

"Agent queried co2 but needed total_ghg"
→ Rule: "total greenhouse gas emissions → use total_ghg column"
"Got 76.664 instead of 76,664,000"
→ Rule: "co2, gas_co2, total_ghg are in MILLION tonnes (×1e6 in SQL)"
"Used electricity_generation instead of electricity_demand"
→ Rule: "electricity demand → use electricity_demand, NOT generation"

After dozens of iterations and hundreds of prompt refinements, we hit 75% accuracy — only then did we move on.

Phase 2: Add Wikipedia (hours 4–6)

With the database foundation solid, we built the Wikipedia MCP server using the same process:

Build basic search and content retrieval
Test on all Wikipedia questions
Analyze failures (disambiguation issues, extraction errors)
Add prompt rules: "search first to find exact article title"
Test again

This iterative approach paid off, Wikipedia questions reached approximately 70% accuracy relatively quickly, benefiting from the robust reasoning framework built during Phase 1.

Phase 3: The PDF Headaches (hours 6–9)

This is where time became the enemy. We initially attempted to build their own RAG (Retrieval Augmented Generation) architecture:

Parse PDFs with Docling
Build a vector database for semantic search
Create custom retrieval logic

After two hours of wrestling with PDF parsing edge cases, table extraction, and embedding generation, we made a critical decision: pivot to Ragie, a RAG-as-a-service platform:

@mcp.tool()
def retrieve_from_ragie(query: str, top_k: int = 8, rerank: bool = False):
    """Retrieve relevant chunks from Ragie's document index"""
    # Quick integration (no time for reranking)
    result = requests.post(
        "https://api.ragie.ai/retrievals",
        json={"query": query, "top_k": top_k, "rerank": False}
    )

The rerank=False setting was necessary due to time constraints: reranking would have improved accuracy but added latency we couldn't afford during the 15- minute evaluation window.

The issue is that there simply wasn't enough time to properly test and tune the Ragie integration. While database and Wikipedia questions had gone through multiple refinement cycles, PDF questions got maybe 2-3 test iterations before the final deadline.

The result? Very poor performance on PDF-related questions.

Team Nightwalkers coding at the hackathon — Team Nightwalkers during the coding sprint

Architecture

Despite the PDF struggles, our architecture demonstrated sophisticated agentic design:

View interactive diagram →

Custom Tools

We built five specialized MCP servers as custom tools for the agent to arrive at correct answers more reliably and efficiently:

Database MCP server
Connected directly to Azure SQL Server hosting Our World in Data's comprehensive climate datasets. After extensive testing, the agent learned to craft precise SQL queries, handling tricky details like distinguishing between hydro_electricity vs hydroelectricity.
Calculator MCP server
When we noticed Claude's arithmetic producing floating-point errors, we built custom tools to improve the agent performance:
- signed_difference(a, b) (for time-series changes where order matters)
- abs_difference(a, b) (for magnitude comparisons)
- ratio(numerator, denominator) (avoiding division-by-zero)
- percentage_change(new, old) (properly handling growth calculations)
- cagr(end, start, periods) (compound annual growth rate)
Each tool handles rounding precisely, ensuring results match expected decimal places.
Currency converter MCP server
Exchange rates for 30+ currencies, essential for questions spanning multiple companies and countries, facilitating currency conversion with fixed rates to allow for verifiable answers.
Wikipedia MCP server
For recent climate events not in historical databases:
- Search for article titles
- Retrieve summaries or full content
- Extract facts like death tolls from heat waves or flood locations
Ragie MCP server
The last-minute addition for PDF documents. While not thoroughly tested, it provided semantic search over annual reports from Erste Bank, GSK, Swisscom, RWE, and Shell.

The Brain: A Battle-Tested System Prompt

The 5000 tokens system prompt wasn't written upfront, it evolved through dozens of test-analyze-improve cycles:

Chain-of-thought reasoning process:

understand the question
check constraints (answer_type, unit, comment)
check comment for ORDER (critical for differences)
plan your approach
verify column names if unsure
execute carefully
return ONLY the final numeric answer

Critical rules (each born from a real failure):

Unit conversions: "co2, gas_co2, total_ghg are in MILLION tonnes (multiply by 1e6 IN THE SQL QUERY if question asks for tonnes)"
Calculator tool usage: "NEVER pass string expressions like '76.664 * 1e6' (do multiplication in SQL first)"
Difference calculations: "ALWAYS check the comment for explicit calculation order! If comment says 'A - B', use signed_difference(A, B) with EXACT order"
Erste Bank priority: "ANY question mentioning 'Erste Bank' MUST use retrieve_from_ragie (do NOT fall back to Wikipedia)"

Every rule existed because testing revealed a specific failure mode.

Results

We won the hackathon, but our performance breakdown tells an interesting story:

Database questions: 75%+ accuracy
Wikipedia questions: approximately 70% accuracy
PDF questions: <20% accuracy

The solid foundation of database and Wikipedia questions allowed us to win, definitely not the PDF retrieval strategy.

Key Takeaways for Hackathon Success

Iterate ruthlessly on one thing before moving on
We didn't touch Wikipedia until we hit 75% on database questions. This discipline prevented half-baked implementations of everything.
Testing is development
We spent as much time testing and analyzing failures as writing code. Each incorrect answer became a new prompt rule or architectural decision.
Know when to pivot
Attempting a custom RAG solution would have been technically satisfying but integrating Ragie in 30 minutes beat spending 3 hours on a half-working custom solution.
Perfect the basics
Database and Wikipedia questions had simpler data sources. Mastering these created the conditions for winning the competition.
Don't make LLMs do math
Even though Claude has gotten much better at arithmetic it is still not perfect, this was clear during testing. An MCP for doing calculations using python was built to fix this issue.

What would we do differently?

In retrospect, we identified two changes for a hypothetical second attempt:

Prototype PDFs earlier
Even 30 extra minutes to evaluate Ragie vs. custom RAG would have saved hours of building the wrong thing.
Set time boxes strictly
Database testing could have stopped at 70% accuracy, freeing an extra hour for PDF refinement.

Tech Stack

Claude Sonnet 4.5,
Python
FastMCP
SQL Server
Ragie
Wikipedia API

Our full solution is available on GitHub including. The modular architecture means you can run individual MCP servers independently.

Final Thoughts

This even was a blast and very well organized by AIM! We had so much fun and couldn't stop laughing in certain moments. Before knowing the leaderboard we were actually quite pessimistic because we saw that the agent was struggling a lot with some questions. For this reason the victory was a big surprise.

On a final note, this challenge was also incredibly relevant for the team because the three members are building two sustainability SaaS startups (Scope4 and Susteam) so we were perfectly in tune with the topic. Feel free to say hi to us on LinkedIn:

Congratulations also to the other teams who did an excellent job and contributed to a great atmosphere!

Team Nightwalkers

Tommaso De Santis, Mark Drozdov, and Artur Sogomonyan are founders of sustainability-focused SaaS startups Scope4 and Susteam. Together they won AIM's hackathon "Agnetic AI for Sustainability Questions" on 04.10.2025 with their multi-tool AI agent.