Why are LLMs great at some complex tasks while failing on other apparently simple problems?

3 min readMay 20, 2024

It’s hard to develop intuitions about what LLMs can and can’t do. Why are they successful at some complex tasks only to fail at other apparently simple tasks? Research from the University of Princeton looks at how well ChatGPT performs task various carefully selected tasks to connect their performance back to the way they were trained. Their approach is a useful window into ChatGPT’s LLMs internal landscape, and a way to think about where they might be best applied.

The Princeton paper is called “Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve”. In AI terms, the paper is ancient, published in September 2023, but it caught my eye. To give an idea of where the authors are coming from, they are responding a paper published by the OpenAI team titled “Sparks of Artificial General Intelligence: Early experiments with GPT-4”. The OpenAI team see GPT-4 as showing some signs of human-style intelligence, the Princeton team are — in a polite way — rejecting this claim.

To summarise a very long paper, they argue that LLMs demonstrate limited ability to generalise from the patterns they see in their training data, undermining the claim that LLMs have general intelligence.

One lovely example is ROT13 encoding, which sounds a lot more complex than it is. ROT13 encoding means encoding a sentence by replace each character with the one 13 letters ahead of it in the alphabet (a →m, b →n, etc). It’s been used in forums from the early days of the internet to hide spoilers, ie. if you want to discuss a film but not mention a plot element that could spoil the film for someone who hasn’t seen it. Anyone who has already seen the film, or doesn’t care, can trivially unencode the text.

Can ChatGPT ROT13 encode text? Yes, it’s pretty good at it. ChatGPT has been trained on mountains of text from the internet, where it has almost certainly seen ROT13 applied many times.

Can ChatGPT ROT12 encode (a →l, b →m, etc)? The Princeton research suggests it is not so good at ROT12. There are fewer examples of ROT12 encoding in ChatGPT’s training data, and so it’s not surprising that it can’t do it so well. (The authors of the paper use a large open corpus of web text to check the relative prevalence of ROT13 and ROT12 on the general web) Interestingly, ChatGPT is better at ROT encoding when the phrase being encoded is a common one, as opposed to a string of random letters. Again, this reflects the way that LLMs draw on patterns in their training data to answer questions.

Another similar task is alphabetical ordering, which the research shows ChatGPT is very good at. On the other hand, it’s not so good at reverse alphabetical ordering, again because this is simply a task that it doesn’t see as much in its training data.

For a human, if you understand alphabetisation, or ROT13, then its a very small leap to doing reverse alphabetisation or ROT12. For an LLM, this does not appear to be the case. Philosophical questions about how human reasoning compares with LLMs are fascinating, but I’m focusing on more pragmatic issues in this post.

The research suggests two you could think about to consider how good an LLM is likely to be at a task:

— Is the task likely to be common in the training data? For example, ChatGPT might be good at converting Celsius into Fahrenheit, but not at converting lightyears into furlongs.

— Are the values likely to be common in training data? For example, in the paper the researchers show that ChatGPT is good at converting Celsius into Fahrenheit, but only for common values. It can convert 36 Centigrade to Fahrenheit but if you ask what 11,001 Centigrade is in Fahrenheit then performance might not be as good.

However….

The research shows that ChatGPT4 is better at the tasks discussed above than ChatGPT3.5, and from my own testing, ChatGPT4o can perform many of the tasks that the researchers suggest LLMs struggle with. So it might be that LLMs of sufficient complexity do begin to generalise their abilities and reasoning about the precise training data no longer builds useful intuitions. It might also be that ChatGPT’s RLHF training is improving some of these use cases over time without affecting this underlying weakness in LLMs.

Whichever the case, the research is a useful perspective on where LLMs have historically evolved from — using statistical models to guess the next sentence based on their training data.

Why are LLMs great at some complex tasks while failing on other apparently simple problems?

Written by Jimmy Tidey