Show HN: I used NLP to turn UK planning PDFs into a clean CSV(www.kaggle.com)

1 pointby david_s_data7 hours ago2 comments

physicsguy5 hours ago
(a) you're not extracting linked plans in the schema - for anything larger scale than a single house development, this is really common. For e.g. you might have 5-20 different planning applications on the same large(r) site for things like initial plans, planning for the entranceways from the road, planning for commercial signage, etc. etc. etc. and so many planning applications have a link. Even for single household developments this can be the same - planning permission expires after 3 years, so if there's a delay, sometimes subsequent planning applications can be made that are pretty much identical to the previous and usually these applications + approvals are boilerplate and noramlly approved unless there's been a major change to policy.
(b) You're missing a major caveating part which is that many properties have additional restrictions on them for development, and so two properties on the same street might get totally different planning outcomes as a result even for the same set of changes to the building (e.g. replacing windows would require permission on my house but wouldn't to my neighbours!). You thus should determine whether the property is listed or in a conservation area and whether an Article 4 direction applies to the property. These are almost always listed in planning permission documents but not on the forms itself. Often there's a heritage assessment or similar provided by a planning consultant too in the application. The first example in the Kaggle dataset is one of these - if you look at it, the property is in the Portland Estate conservation area and was refused because of a discrepancy in the plans.
(c) Each property has a unique property reference number (UPRN) - this pins it to a specific property and is more specific than postcode. This might be useful.
(d) Goes without saying but the reference number is only unique within a single local authority, so you need the local authority to be named in another column. The format is normally YY/counter so 25/01536 mean's it's the 1536th application in 2025. Some local authorities prefix with the name of the local authority
- david_s_data26 minutes ago
  Incredibly useful, thank you.
david_s_data7 hours ago
Hi everyone. I've been spending a lot of time looking at UK real estate data and realized that the actual valuable stuff (like the specific reasons why a council rejects a planning application) is buried in unstructured PDFs.
I decided to build an extraction pipeline to pull the policy breaches, officer notes and timelines, etc. out of those PDFs and into a clean CSV. I also had to write a quick script to strip out all the exact addresses and names down to the postcode level to avoid GDPR issues.
I just put a 50-row sample of the schema up on Kaggle. Before I burn money on compute to scale this to 10,000+ rows across London, I'd really appreciate a sanity check from anyone who works with spatial or proptech data. Are there any obvious columns or data points I'm completely missing here?