Week notes — 12th June 2024

Jimmy Tidey
3 min readJun 12, 2024

--

This week I have been taking Stanford’s CS224d: Deep Learning for Natural Language Processing course.

It’s taught by Chris Manning, who was working in computational linguistics well before the AI boom. In an a field that moves so quickly, its nice to get a sense of all the work that went to laying the foundations, and an appreciation of how many dead-ends there have been. Are LLMs a cul-de-sac too? Admittedly, it would be very large and lucrative cul-de-sac.

On the one hand, it feels very unlikely I’ll ever need to implement a language model from the ground up, on the other hand, in an industry that really is suffused with people who have a less than detailed understanding of the technologies they are talking about, I want to know that I know what’s actually going on. So that’s why I’m taking the course.

Neighbourhood Plans

Work on neighbourhood plans has slowed a bit because of the course, but… I’ve continued looking at topic modelling.

Note: I’ve found using the week notes as a kind of lab notebook to record what I’ve been up to really gives structure to the research process.

I have around 30 Neighbourhood Plans, each ‘chunked’ into policy sections of around 1000 words maximum. I’ve spent a long time ensuring that the ‘chunks’ are semantically meaningful. This has been assisted by the fact that Neighbourhood plans are highly structured, typically broken into many subheadings, with each subhead connected to a policy.

Each chunk has been vectorised using Ada-002. I focused on topic modelling per policy rather than Neighbourhood Plans as a whole, reasoning that individual policies are likely to show more variation than Neighbourhood Plans as a whole, which, at the highest level, all address fairly similar topics. My plan is then to count, for each Neighbourhood Plan, the number of policies about key topics. This should give a sense of what topics a specific Neighbourhood Plan emphasises.

Approach 1 — vector embedding per policy
I tied to assign topics to policies based on their Ada-002 vector embeddings. I’ve tried numerous approaches to dimensionality reduction and clustering (see last week’s post), and not had meaningful clusters emerge. One reason for this might be that vectors for large chunks of text crush out too much detail. Vector search is quite successful on the policy chunks, so I’m a bit surprised clustering doesn’t work better.

Approach 2 — vector embedding per policy with place names removed
In approach 1, clusters often formed around the names of the names of the Neighbourhoods. This makes perfect sense — they are in a sense the most defining topics in the corpus. I wondered if they were ‘crowding out’ a more useful clustering.

If I removed place names from the text, I’d have to revectorise it using OpenAI, which would take a while and cost money. Instead, I removed the place names from the text and used BERTopic for topic modelling. The clusters were not very convincing, in fact, if anything they were less effective than Approach 1, suggesting that the best path might be to completely revectorise with all the place names removed. I have not explored this yet.

Approach 3 — LDA
LDA doesn’t use the ADA-002 vector embeddings, and instead uses the frequency of occurrence of words to do the analysis. Again, it did produce convincing clusters.

Approach 4 — Make Gemini do it
Neighbourhood Plans fit into Gemini’s context window, so I simply sent the whole document with the prompt:

Please return a list of 20 main topics discussed below. Return the list as a machine readable JSON list.

I briefly experimented with clustering the returned topics, but ultimately, Gemini to the rescue again, I just asked Gemini to look through the whole list (about 600 topics) and pick out themes. This is the full extent of my week’s work, I’ll paste the results below. Seeing this list made me realise is that, having worked with Neighbourhood Plans quite a bit, I have a strong intuition around what should be in the list, so it might be a case of manually refining what Gemini has provided. There are some topics I’m intested in (eg. Community Infrastructure Levy) irrespective of how common they are. ‘Quarrying’ is total outlier and was not in the list of topics I supplied to Gemini.

Neighbourhood Plan topics

  • Housing
  • Environment
  • Economy
  • Infrastructure
  • Transportation
  • Heritage
  • Community Facilities
  • Public Rights of Way
  • Coastal Enhancements
  • Biodiversity
  • Geodiversity
  • Climate Change Mitigation
  • Economic Development
  • Visitor Accommodation
  • Lighting
  • Utility Infrastructure
  • Design
  • Retail Facilities
  • Green Spaces
  • Sustainable Development
  • Affordable Housing
  • Business Development
  • Tourism
  • Broadband and Mobile Connectivity
  • Landscape and Setting
  • Quarrying
  • Traffic Management
  • Car Parking
  • Conservation Areas
  • Flood Risk Management

--

--

Jimmy Tidey

PhD on digital systems for collective action and social network analysis. jimmytidey.co.uk/blog