Week notes — 21st May 2024
To recap from the last couple of weeks: I’m interested in how LLM’s semantic understanding of unstructured text might be used to help users navigate and understand local plans better. In particular, what are there ways to go beyond the chatbot model?
I’ll do a post in the future on my specific progress with local plans, but this week I’ll just note some observations about the nuts and bolts.
First of all, ChatGPT4.o — can it help in chunking up PDFs into text fragments? No. Not on the documents I’ve tested. Breaking PDFs up into fragments is the key to a huge number of LLM applications. Right now it seems to be the hardest problem in AI. ChatGPT just comes back with gibberish. It might work on shorter documents. It seems fascinating this problem isn’t solved, I’ve tried numerous solutions and none is brilliant. Who knew the arcane, bloated PDF format would be such a blocker to progress.
On which note, in other testing I’ve observed that solutions that look promising for parsing PDFs are terrible at generalising. Approaches that work for scientific papers or accountancy documents fail on local plans and associated documents. An approach that works on one local plan fails on others.
Right now, I want to get to the creative exploration phase, so I’m going to move to that with a small set of documents that I can parse, rather than spending too long thinking about how to parse PDFs.
In the future it would be lovely to have a test with a representative sample of PDF local plans to evaluate against.
One long-running complaint in AI is that demonstrating progress in some domain is often focused on a single, quite arbitrary, bench mark. I now see one reason for this could be the complexity of creating benchmarks — making test datasets is hard.