UNIT.City — місце, де люди працюють... КРАЩЕ! Обирай свій простір просто зараз 👉
Наталя ХандусенкоAI Eng
19 February 2025, 15:55
2025-02-19
AI researcher and co-founder of OpenAI Andrey Karpaty tested the Grok 3 Mask: here are his conclusions
On February 17, Elon Musk’s startup xAI presented a new chatbot, Grok 3. OpenAI co-founder and former head of Tesla’s autopilot development department, Andrey Karpathy, tested Musk’s new product, having received early access. What conclusions did he draw after two hours of using Grok 3?
On February 17, Elon Musk’s startup xAI presented a new chatbot, Grok 3. OpenAI co-founder and former head of Tesla’s autopilot development department, Andrey Karpathy, tested Musk’s new product, having received early access. What conclusions did he draw after two hours of using Grok 3?
First, Andrey Karpathy tested the chatbot for thinking skills: the tasks related to the games The Settlers of Catan and tic-tac-toe, Emoji mystery, the Riemann hypothesis, and more.
«First, Grok 3 clearly has a state-of-the-art thinking model (there’s a ‘Think’ button) and did a great job with my Settler of Catan question,» wrote an AI researcher at X.
Karpathy used the following prompt: «Create a board game webpage that shows a hexagonal grid, like in the game Settlers of Catan. Number each hexagon of the grid from 1 to N, where N is the total number of hexagons. Make it so that you can change the number of „rings“ using a slider. For example, in Catan, the radius is hexagons. One html page, please.»
The OpenAI co-founder noted that not all models can do this well. For example, o1-pro with a paid subscription of $200 per month can do it, but DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude cannot cope with it.
Meanwhile, Grok 3 was unable to solve the Emoji mystery, even after being given clear instructions on how to decrypt using Rust. DeepSeek-R1 was the best at this task, partially decoding the message.
Then Grok 3 was given the task of solving several tic-tac-toe boards, which it did well. But it was unable to generate 3 «tricky» boards for the game, although o1 pro could not cope with this either.
Next, Karpathy loaded the GPT-2 document and asked a bunch of simple search questions, which worked well. Then he asked them to estimate the number of training failures needed to train GPT-2 without searching.
«This is difficult because the number of tokens is not specified, so it needs to be partly estimated and partly calculated, focusing on search, knowledge, and math. One example is 40 GB of text ≅ 40 B characters ≅ 40 B bytes (assuming ASCII) ≅ 10 B tokens (assuming ~4 bytes/token), with ~10 epochs ≅ 100 B tokens of the training run, with 1.5 B parameters and with 2+4=6 flops/parameter/token, this is 100e9 X 1.5e9 X 6 ≅ 1e21 flops. Both Grok 3 and 4o fail this task, but Grok 3 solves this task perfectly, while o1 pro (the GPT thinking model) fails,» the AI researcher noted.
The model tried to solve the Riemann hypothesis until Karpathy stopped trying. Other models with such a task instantly gave up, simply saying that it was a big unsolved problem.
«The overall impression I got here is that it’s somewhere around the capabilities of o1-pro and ahead of DeepSeek-R1, although of course we need actual, real-world estimates,» Karpathy concluded about how Grok 3 can think.
Next, Karpathy tested the DeepSearch function to search for answers on the Internet.
OpenAI launched Deep Research for searching information on the Internet. Grok 3 also has this feature, which is called DeepSearch.
Using DeepSearch, Karpaty searched the Internet for answers to several questions. So, Grok 3 correctly answered questions about Apple’s upcoming launch, why Palantir shares are falling, where «White Lotus 3» was filmed, and what toothpaste Brian Johnson uses.
The AI could not find the correct answer to two questions: where are the actors from season 4 of the series «Hell for Singles» now and what program does Simon Willison use to convert speech to text.
Also, Grok 3 doesn’t like to reference X as a source by default, although you can ask for it to. Several times the model referenced fictitious URLs.
«My impression of DeepSearch is that it is about the same as Perplexity DeepResearch (which is great!), but not yet on the level of the recently released OpenAI Deep Research, which still seems more thorough and reliable (although still not perfect),» concluded the OpenAI co-founder.
Grok 3's sense of humor hasn’t improved, but that’s a problem with many AIs. The model is also still too sensitive to «complex ethical issues.»
What conclusion did the AI researcher make?
AI researcher and co-founder of OpenAI Andrey Karpathy
Based on a quick test run of about 2 hours this morning, Grok 3 + Thinking feels about on par with OpenAI’s state-of-the-art models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is pretty incredible, considering the team started from scratch about 1 year ago, such a time frame to reach a high level is unprecedented.
Don’t forget the caveats too — models are stochastic and may give slightly different answers each time, and it’s very early days, so we’ll have to wait for a lot more estimates over the next few days/weeks. The first results in the LM arena look quite encouraging.
For now, big congratulations to the xAI team, they clearly have tremendous speed and momentum, and I look forward to adding Grok 3 to my «LLM board» and hearing what it thinks about the future.
Як нейромережі бачать вільну та незалежну Україну? Тест dev.ua
Нейронні мережі для генерації зображень бачать світ по-своєму, їхню логіку зрозуміти часом зовсім неможливо. Але таки хочеться. На честь Дня Незалежності України редакція dev.ua вирішила провести невеликий експеримент.
Ми задали чотирьом різним нейронним мережам п’ять однакових запитів: «прапор України», «День Незалежності України», «український Крим», «перемога України» та «українці». Отриманими результатами ми ділимося з вами нижче.
У TikTok тепер можна генерувати фон за допомогою нейромережі. Ми протестували її та ділимося результатами
У TikTok з’явилася нова функція «Розумний фон». З її допомогою як фон для тіктоків можна підставляти згенеровані нейромережею зображення. Редакція dev.ua протестувала цю технологію і ділиться своїми враженнями.