Monolith’s research team spent months testing how leading large language models interact with engineering tools — and how far they can go in reasoning about real physics and data.
This report compares Claude 4 Sonnet, Claude 3.7, Gemini 2.5 Pro, Grok 4, ChatGPT o3, and Kimi-k2-instruct, revealing how each performs inside the Monolith platform across four realistic engineering scenarios.
You’ll learn:
- Which LLMs handle engineering workflows most effectively
- Where general-purpose models fall short on physical reasoning
- How variability and repeatability differ between models
- Why agentic frameworks may be key to the next generation of engineering AI
Our findings show that while most models master simple automation, they often struggle when true engineering intuition is required. The results reveal clear gaps — and opportunities — for smarter, more tailored AI systems built for engineering.
“All these results were very insightful for the team. In addition to learning about these different LLMs and their capacity to harness the Monolith platform, we also learnt a lot about how to evaluate them, and how to improve the way they are called and the way the platform tools are exposed via the MCP.”
Monolith LLM Benchmarking Report, October 2025


