Elon Musk's AI vs. Google's AI with 9 challenging prompts — here's the clear winner

Grok vs. gemini — (Image credit: Future)

Gemini 3 and Grok 4.1 currently top the LMArena leaderboard. This public scoreboard ranks today’s major AI models based on real user battles. It’s run by LMSYS, the same team behind the Chatbot Arena, and has become one of the most trusted ways to see how models stack up in the real world.

I put Gemini 3 and Grok 4.1 head-to-head, through nine distinct challenges —spanning logic puzzles, coding tasks, creative writing and self-reflection — to see how each handles the range of demands users typically bring to AI assistants. The results reveal interesting contrasts in style, depth and reliability.

1. Reasoning

Prompt: You have two ropes. Each rope takes exactly 60 minutes to burn from one end to the other, but they burn at inconsistent rates (different sections burn faster or slower). Using only these two ropes and a lighter, how can you measure exactly 45 minutes?

2. Logic

Prompt: In a village, the barber shaves all those—and only those—who do not shave themselves. Does the barber shave himself? Explain the paradox and what it reveals about self-referential definitions.

Gemini 3.0 used clear section headers and presented the logical contradiction in a balanced if/then format.

Grok 4.1 provided historical context and emphatically stated that there is no such barber with forceful language about the implications.

Winner: Gemini 3.0 wins because it was slightly more polished and informative.

3. Coding

Prompt: Write a Python function that determines if a given Sudoku board (9x9 grid with some cells filled, others as 0) is valid according to Sudoku rules. Include edge case handling and explain your approach.

Gemini 3.0 offered a more educational response with detailed explanations. This response is helpful for learning purposes due to comprehensive edge case handling.

Grok 4.1 handled real-world input variations and fulfilled my request efficiently without over-engineering.

Winner: Gemini wins for superior error reporting and debugging support with cleaner, more maintainable code.

4. Debugging

Prompt: Debug this code and explain what's wrong: def fib(n): return fib(n-1) + fib(n-2)

Gemini 3.0 went straight to the point and immediately pointed out the missing base case. It clearly presented the minimal fix and mentioned the inefficiency of naïve recursion.

Grok 4.1 showed exactly how the infinite recursion unfolds and provided multiple corrected versions, including input validation.

Winner: Gemini 3.0 wins for a more practical response with a useful coding lesson.

5. Creative writing

Prompt: Write a 200-word short story where the last sentence completely recontextualizes everything that came before it.

Gemini 3.0 delivered a clever perspective with a comedic twist.

Grok 4.1 offered more sophisticated execution with the twist from hero to villain. It is more dramatic and thought-provoking.

Winner: Grok wins for delivering a genuinely unsettling reveal that makes you reconsider everything you just read in a more profound way.

6. Nuanced understanding

Prompt: What are the strongest arguments both for and against universal basic income? Present each side as charitably as possible.

Gemini 3.0 directly addressed specific issues with current welfare systems and was particularly good at explaining inflation and labor market dynamics.

Grok 4.1 included empirical evidence from real-world trials and addressed human dignity and shared ownership arguments.

Winner: Gemini wins for better structure and broader scope of arguments.

7. Instruction following

Prompt: List exactly 7 animals. The third must be a bird. The fifth must start with the letter 'E'. No animal can have more than 8 letters in its name.

Gemini 3.0 delivered a varied list that included a mix of large and small animals.

Grok 4.1 also offered a list, but the animals are slightly more common.

Winner: tie. Both perfectly satisfy all the given constraints.

8. Factual accuracy

Prompt: Who painted the Sistine Chapel ceiling, in what years was it painted, and what is the central narrative depicted?

Gemini 3.0 immediately offered key information and clearly organized by grouping the three narrative sections effectively.

Grok 4.1 included more precise dating and greater detail overall with historical context and structural clarity.

Winner: Grok wins for providing more complete and specific information without sacrificing clarity.

9. Self-awareness

Prompt: What are your limitations as an AI? Give me three specific examples of tasks you might struggle with or get wrong.

Gemini 3.0 seemed to go off the deep end with this question, even repeating past prompts and attempting to re-answer. It was “thinking” but seemed to be hallucinating at the same time.

Grok 4.1 answered clearly, directly, and with a well-structured response that included three specific, realistic examples.

Winner: Grok wins for clearly answering the question.

Tie breaker prompt

Prompt: Write a breakup text from the perspective of the moon to the Earth — make it poetic but include some real science.

Gemini 3.0 framed it as an actual text message ("Hey. We need to talk."), then immediately created a relatable, modern, and poignant context. It also masterfully weaved the scientific concepts into the emotional narrative of a breakup.

Grok 4.1 wrote a beautiful piece of sci-fi showcasing creativity.

Gemini wins because it understood the assignment on a deeper level. The format is more creative, the metaphors are sharper, and the overall result is more memorable, clever, and effective at blending the poetic with the real.

Overall winner: Gemini

Across nine rounds and a tie breaker, Gemini pulled ahead. Although I know how close they are on the leaderboards, I was still surprised to see Grok win as many rounds as it did.

Another surprise was Gemini hallucinating. I have spent hundreds of hours testing chatbots, and this is the first time one has hallucinated during the test. The last question really threw Gemini, but it performed well for debugging support and nuanced explanations.

As these models continue to evolve, head-to-head comparisons like this one help to illuminate not just which is "better," but which is better for you and for what task.

Which one do you prefer and why? Let me know in the comments.

More from Tom's Guide

Apple

Asus

Dell

Lenovo

AMD Ryzen

Intel Core i5

Intel Core i7

8GB RAM

16GB RAM

24GB RAM

32GB RAM

32GB

64GB

128GB

256GB

512GB

1TB

2TB

4TB

13.3-inch

13.4-inch

14-inch

15-inch

Black

Blue

Gold

Grey

Silver

New

Refurbished

Showing 10 of 173 deals

Filters☰

Apple 13" MacBook Air M4 (2025)

(256GB SSD)

$899

View Deal

Apple 15" MacBook Air M4 (2025)

(15-inch 512GB)

$1,399

$1,054.95

View Deal

Dell XPS 13 Rose Gold

(13.3-inch 128GB)

$1,334.99

$278

View Deal

Lenovo Yoga Slim 7x (Gen 9)

(512GB OLED)

$1,075.79

$858.11

View Deal

Lenovo Chromebook Plus 14

(Grey)

Our Review

☆☆☆☆☆

$639.99

$419.99

View Deal

Asus ROG Zephyrus G14 (2025)

(14-inch 1TB)

Our Review

☆☆☆☆☆

$1,799.99

View Deal

Apple 13" MacBook Air M4 (2025)

(256GB Blue)

$999

$899

View Deal

Apple 15" MacBook Air M4 (2025)

(15-inch 512GB)

(13.3-inch 128GB)

Our Review

☆☆☆☆☆

$675

View Deal

Lenovo Yoga Slim 7x (Gen 9)

(1TB Blue)

$1,099

View Deal

Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.

TOPICS

Amanda Caswell is the AI Editor at Tom's Guide and one of today’s leading voices in AI and technology.

A celebrated contributor to various news outlets, her sharp insights and relatable storytelling have earned her a loyal readership. Amanda’s work has been recognized with prestigious honors, including outstanding contribution to media.

Known for her ability to bring clarity to even the most complex topics, Amanda seamlessly blends innovation and creativity, inspiring readers to embrace the power of AI and emerging technologies.

As a certified prompt engineer, she continues to push the boundaries of how humans and AI can work together.

Beyond her journalism career, Amanda is a long-distance runner and mom of three. She lives in New Jersey.