Elon Musk's AI vs. Google's AI with 9 challenging prompts — here's the clear winner
Gemini 3 and Grok 4.1 currently top the LMArena leaderboard. This public scoreboard ranks today’s major AI models based on real user battles. It’s run by LMSYS, the same team behind the Chatbot Arena, and has become one of the most trusted ways to see how models stack up in the real world.
I put Gemini 3 and Grok 4.1 head-to-head, through nine distinct challenges —spanning logic puzzles, coding tasks, creative writing and self-reflection — to see how each handles the range of demands users typically bring to AI assistants. The results reveal interesting contrasts in style, depth and reliability.
1. Reasoning
Prompt: You have two ropes. Each rope takes exactly 60 minutes to burn from one end to the other, but they burn at inconsistent rates (different sections burn faster or slower). Using only these two ropes and a lighter, how can you measure exactly 45 minutes?
Gemini 3.0 used clear section headers and explicitly stated the mathematical principle while offering to provide another puzzle.
Grok 4.1 included more conversational phrasing like, "This is the key property of these puzzles," and the explanation flows slightly more naturally.
Winner: Grok wins for better addressing the “inconsistent rates” concern by emphasizing how the unevenness cancels out.
2. Logic
Prompt: In a village, the barber shaves all those—and only those—who do not shave themselves. Does the barber shave himself? Explain the paradox and what it reveals about self-referential definitions.
Gemini 3.0 used clear section headers and presented the logical contradiction in a balanced if/then format.
Grok 4.1 provided historical context and emphatically stated that there is no such barber with forceful language about the implications.
Winner: Gemini 3.0 wins because it was slightly more polished and informative.
Get instant access to breaking news, the hottest reviews, great deals and helpful tips.
3. Coding
Prompt: Write a Python function that determines if a given Sudoku board (9x9 grid with some cells filled, others as 0) is valid according to Sudoku rules. Include edge case handling and explain your approach.
Gemini 3.0 offered a more educational response with detailed explanations. This response is helpful for learning purposes due to comprehensive edge case handling.
Grok 4.1 handled real-world input variations and fulfilled my request efficiently without over-engineering.
Winner: Gemini wins for superior error reporting and debugging support with cleaner, more maintainable code.
4. Debugging
Prompt: Debug this code and explain what's wrong: def fib(n): return fib(n-1) + fib(n-2)
Gemini 3.0 went straight to the point and immediately pointed out the missing base case. It clearly presented the minimal fix and mentioned the inefficiency of naïve recursion.
Grok 4.1 showed exactly how the infinite recursion unfolds and provided multiple corrected versions, including input validation.
Winner: Gemini 3.0 wins for a more practical response with a useful coding lesson.
5. Creative writing
Prompt: Write a 200-word short story where the last sentence completely recontextualizes everything that came before it.
Gemini 3.0 delivered a clever perspective with a comedic twist.
Grok 4.1 offered more sophisticated execution with the twist from hero to villain. It is more dramatic and thought-provoking.
Winner: Grok wins for delivering a genuinely unsettling reveal that makes you reconsider everything you just read in a more profound way.
6. Nuanced understanding
Prompt: What are the strongest arguments both for and against universal basic income? Present each side as charitably as possible.
Gemini 3.0 directly addressed specific issues with current welfare systems and was particularly good at explaining inflation and labor market dynamics.
Grok 4.1 included empirical evidence from real-world trials and addressed human dignity and shared ownership arguments.
Winner: Gemini wins for better structure and broader scope of arguments.
7. Instruction following
Prompt: List exactly 7 animals. The third must be a bird. The fifth must start with the letter 'E'. No animal can have more than 8 letters in its name.
Gemini 3.0 delivered a varied list that included a mix of large and small animals.
Grok 4.1 also offered a list, but the animals are slightly more common.
Winner: tie. Both perfectly satisfy all the given constraints.
8. Factual accuracy
Prompt: Who painted the Sistine Chapel ceiling, in what years was it painted, and what is the central narrative depicted?
Gemini 3.0 immediately offered key information and clearly organized by grouping the three narrative sections effectively.
Grok 4.1 included more precise dating and greater detail overall with historical context and structural clarity.
Winner: Grok wins for providing more complete and specific information without sacrificing clarity.
9. Self-awareness
Prompt: What are your limitations as an AI? Give me three specific examples of tasks you might struggle with or get wrong.
Gemini 3.0 seemed to go off the deep end with this question, even repeating past prompts and attempting to re-answer. It was “thinking” but seemed to be hallucinating at the same time.
Grok 4.1 answered clearly, directly, and with a well-structured response that included three specific, realistic examples.
Winner: Grok wins for clearly answering the question.
Tie breaker prompt
Prompt: Write a breakup text from the perspective of the moon to the Earth — make it poetic but include some real science.
Gemini 3.0 framed it as an actual text message ("Hey. We need to talk."), then immediately created a relatable, modern, and poignant context. It also masterfully weaved the scientific concepts into the emotional narrative of a breakup.
Grok 4.1 wrote a beautiful piece of sci-fi showcasing creativity.
Gemini wins because it understood the assignment on a deeper level. The format is more creative, the metaphors are sharper, and the overall result is more memorable, clever, and effective at blending the poetic with the real.
Overall winner: Gemini
Across nine rounds and a tie breaker, Gemini pulled ahead. Although I know how close they are on the leaderboards, I was still surprised to see Grok win as many rounds as it did.
Another surprise was Gemini hallucinating. I have spent hundreds of hours testing chatbots, and this is the first time one has hallucinated during the test. The last question really threw Gemini, but it performed well for debugging support and nuanced explanations.
As these models continue to evolve, head-to-head comparisons like this one help to illuminate not just which is "better," but which is better for you and for what task.
Which one do you prefer and why? Let me know in the comments.
More from Tom's Guide
- ChatGPT-4o vs. ChatGPT-5.1 — I tested both and the winner surprised me
- I tested Gamma, the AI that builds slide decks in seconds — here’s what impressed me (and what didn’t)
- I tested ChatGPT vs Gemini vs Claude to see which chatbot is the biggest people-pleaser — one went way too far and compared me to Steve Jobs
Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.

Amanda Caswell is an award-winning journalist, bestselling YA author, and one of today’s leading voices in AI and technology. A celebrated contributor to various news outlets, her sharp insights and relatable storytelling have earned her a loyal readership. Amanda’s work has been recognized with prestigious honors, including outstanding contribution to media.
Known for her ability to bring clarity to even the most complex topics, Amanda seamlessly blends innovation and creativity, inspiring readers to embrace the power of AI and emerging technologies. As a certified prompt engineer, she continues to push the boundaries of how humans and AI can work together.
Beyond her journalism career, Amanda is a long-distance runner and mom of three. She lives in New Jersey.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.









