Наталя Хандусенко AI Eng 19 February 2025, 15:55

AI researcher and co-founder of OpenAI Andrey Karpaty tested the Grok 3 Mask: here are his conclusions

On February 17, Elon Musk’s startup xAI presented a new chatbot, Grok 3. OpenAI co-founder and former head of Tesla’s autopilot development department, Andrey Karpathy, tested Musk’s new product, having received early access. What conclusions did he draw after two hours of using Grok 3?

Leave a comment

AI researcher and co-founder of OpenAI Andrey Karpaty tested the Grok 3 Mask: here are his conclusions

On February 17, Elon Musk’s startup xAI presented a new chatbot, Grok 3. OpenAI co-founder and former head of Tesla’s autopilot development department, Andrey Karpathy, tested Musk’s new product, having received early access. What conclusions did he draw after two hours of using Grok 3?

First, Andrey Karpathy tested the chatbot for thinking skills: the tasks related to the games The Settlers of Catan and tic-tac-toe, Emoji mystery, the Riemann hypothesis, and more.

«First, Grok 3 clearly has a state-of-the-art thinking model (there’s a ‘Think’ button) and did a great job with my Settler of Catan question,» wrote an AI researcher at X.

Karpathy used the following prompt: «Create a board game webpage that shows a hexagonal grid, like in the game Settlers of Catan. Number each hexagon of the grid from 1 to N, where N is the total number of hexagons. Make it so that you can change the number of „rings“ using a slider. For example, in Catan, the radius is hexagons. One html page, please.»

The OpenAI co-founder noted that not all models can do this well. For example, o1-pro with a paid subscription of $200 per month can do it, but DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude cannot cope with it.

Meanwhile, Grok 3 was unable to solve the Emoji mystery, even after being given clear instructions on how to decrypt using Rust. DeepSeek-R1 was the best at this task, partially decoding the message.

Then Grok 3 was given the task of solving several tic-tac-toe boards, which it did well. But it was unable to generate 3 «tricky» boards for the game, although o1 pro could not cope with this either.

Next, Karpathy loaded the GPT-2 document and asked a bunch of simple search questions, which worked well. Then he asked them to estimate the number of training failures needed to train GPT-2 without searching.

«This is difficult because the number of tokens is not specified, so it needs to be partly estimated and partly calculated, focusing on search, knowledge, and math. One example is 40 GB of text ≅ 40 B characters ≅ 40 B bytes (assuming ASCII) ≅ 10 B tokens (assuming ~4 bytes/token), with ~10 epochs ≅ 100 B tokens of the training run, with 1.5 B parameters and with 2+4=6 flops/parameter/token, this is 100e9 X 1.5e9 X 6 ≅ 1e21 flops. Both Grok 3 and 4o fail this task, but Grok 3 solves this task perfectly, while o1 pro (the GPT thinking model) fails,» the AI researcher noted.

The model tried to solve the Riemann hypothesis until Karpathy stopped trying. Other models with such a task instantly gave up, simply saying that it was a big unsolved problem.

«The overall impression I got here is that it’s somewhere around the capabilities of o1-pro and ahead of DeepSeek-R1, although of course we need actual, real-world estimates,» Karpathy concluded about how Grok 3 can think.

Next, Karpathy tested the DeepSearch function to search for answers on the Internet.

OpenAI launched Deep Research for searching information on the Internet. Grok 3 also has this feature, which is called DeepSearch.

Using DeepSearch, Karpaty searched the Internet for answers to several questions. So, Grok 3 correctly answered questions about Apple’s upcoming launch, why Palantir shares are falling, where «White Lotus 3» was filmed, and what toothpaste Brian Johnson uses.

The AI could not find the correct answer to two questions: where are the actors from season 4 of the series «Hell for Singles» now and what program does Simon Willison use to convert speech to text.

Also, Grok 3 doesn’t like to reference X as a source by default, although you can ask for it to. Several times the model referenced fictitious URLs.

«My impression of DeepSearch is that it is about the same as Perplexity DeepResearch (which is great!), but not yet on the level of the recently released OpenAI Deep Research, which still seems more thorough and reliable (although still not perfect),» concluded the OpenAI co-founder.

Grok 3's sense of humor hasn’t improved, but that’s a problem with many AIs. The model is also still too sensitive to «complex ethical issues.»

What conclusion did the AI researcher make?

Elon Musk's AI bot Grok has become a separate application

An AI expert tested Grok. What are its features and how does the chatbot differ from ChatGPT Gemini and Claude?

Read the country's main IT news in our Telegram

Leave a comment

Text: Наталя Хандусенко Tags: ai, artificial intelligence , grok 3

Found an error in the text? Highlight it and press Ctrl+Enter. Found an error in the text? Highlight it and press the 'Report an error' button.

Розміщення реклами

Advertising Placement

Roosh запускає нову освітню платформу AI HOUSE CLUB для ML/AI-спеціалістів та дата сайнтистів. Розповідаємо, як подати заявку та чому навчатимуть

Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua

Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент. Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.

У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами

У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.

1 comment

Які IT-спеціальності будуть потрібні в найближчі п'ять років? Ми з'ясували у голови американського стартапу ADAM Дениса Гурака

Have important news to share? Message our Telegram bot

Key events and useful links in our Telegram channel

No comments yet.

Sign in to leave a comment

AI researcher and co-founder of OpenAI Andrey Karpaty tested the Grok 3 Mask: here are his conclusions

What conclusion did the AI ​​researcher make?

What conclusion did the AI researcher make?