11 Comments
May 25, 2023·edited May 25, 2023Liked by Jacob Bayless

Chatbots are random processes, so it's important to run each prompt at least a few times to see how consistent it is. (Ideally we'd use the API and do many trials.) I re-ran the river crossing prompt a couple more times on GPT-4 and reproduced your results, but worded differently each time. Interestingly, on the last attempt it explicitly said it recognized the problem:

---

This is a variant of the classic River Crossing puzzle, where you need to transport everyone across without certain groups being left together. Here's a possible solution to the problem:

First, the Earthling takes the Venutian to Earth via the wormhole, then the Earthling returns alone to the Galactic Federation. This leaves the Martian and the Moonling at the Galactic Federation, who are safe together, and the Venutian alone on Earth.

Next, the Earthling takes the Martian to Earth, leaving the Moonling alone at the Galactic Federation. But, she can't leave the Martian with the Venutian on Earth, so she leaves the Martian on Earth and takes the Venutian back to the Galactic Federation.

Then, she leaves the Venutian at the Galactic Federation and takes the Moonling to Earth. This leaves the Venutian and Martian at the Galactic Federation, who are safe together.

Finally, the Earthling returns to the Galactic Federation one last time to bring the Venutian to Earth.

Now, all of them, the Earthling, the Martian, the Venutian, and the Moonling are on Earth, and at no point were the Venutian and the Martian left alone together, nor the Venutian and the Moonling. The Earthling has successfully navigated the wormhole puzzle!

---

If someone wanted to explore this further, they could try varying the prompt to see when it breaks down. If we tell it that the the wolf is a vegetarian and the goat attacks wolves, does it get confused? What if it's given an unsolvable problem?

This is a little different because it's about in-context learning, but I recently saw an interesting paper about how if you give an LLM a problem with backwards labels, it will get confused. See: https://thegradient.pub/in-context-learning-in-context/

---

Edit: In general, I think this is a good post but the headline is overstated. It seems like the boundary between reasoning and statistics gets rather fuzzy once you allow for copying with substitutions, and LLM's are all about copying with substitutions. They're usually not copying word-for-word, but it's often unclear what level of abstraction they're working at.

Expand full comment
Apr 5·edited Apr 5Liked by Jacob Bayless

No it's not able to sort, it's merely applying statistics to the input.

Even Yann LeCun states that GPT-4 is a statistical correlations engine.

In simple terms, it sees multiple times a sequence of characters (say "42" for example) that it puts it in the output, and the output is ordered (sorted) because the neurons are reinforced to output "42" after "41", and "41" after "40" and so on and so forth, and keep the same amount. That's all it does!

The reason why it failed on 59 is because it has seen that number too few times at the first position is all.

*It's only a large statistical model*

You can easily determine if it had internally developed an algorithm to sort number by not giving one number in the input samples (e.g. 37) and train it, then give it 37 in some input and see whether it can sort it out or not.

I already know the answer: the model doesn't have a sense of ordering numbers expressed in base 10 (decimal), it just has a sense of patterns: it matches groups of digits and produce the output that will result in the lowest error rate. Again, that's only statistics.

Expand full comment

Great observations and a nice experiment design!

Have you seen the "Thinking like Transformers" article (https://arxiv.org/abs/2106.06981) and the RASP language?

(TLDR: they show that sorting is one of the operations for which the Transformer architecture is a natural prior, ie it can be expressed with just 1-2 layers if we sort token sequences, and about 3-4 for sequences of multi-token numbers)

Still impressive, but also means that it's not a particularly mind-blowing example, ie it's not inventing quicksort or the like, as the network doesn't even need has to do the pattern recognition and substitution as for the river crossing puzzle

Expand full comment