(b) You're missing a major caveating part which is that many properties have additional restrictions on them for development, and so two properties on the same street might get totally different planning outcomes as a result even for the same set of changes to the building (e.g. replacing windows would require permission on my house but wouldn't to my neighbours!). You thus should determine whether the property is listed or in a conservation area and whether an Article 4 direction applies to the property. These are almost always listed in planning permission documents but not on the forms itself. Often there's a heritage assessment or similar provided by a planning consultant too in the application. The first example in the Kaggle dataset is one of these - if you look at it, the property is in the Portland Estate conservation area and was refused because of a discrepancy in the plans.
(c) Each property has a unique property reference number (UPRN) - this pins it to a specific property and is more specific than postcode. This might be useful.
(d) Goes without saying but the reference number is only unique within a single local authority, so you need the local authority to be named in another column. The format is normally YY/counter so 25/01536 mean's it's the 1536th application in 2025. Some local authorities prefix with the name of the local authority
I decided to build an extraction pipeline to pull the policy breaches, officer notes and timelines, etc. out of those PDFs and into a clean CSV. I also had to write a quick script to strip out all the exact addresses and names down to the postcode level to avoid GDPR issues.
I just put a 50-row sample of the schema up on Kaggle. Before I burn money on compute to scale this to 10,000+ rows across London, I'd really appreciate a sanity check from anyone who works with spatial or proptech data. Are there any obvious columns or data points I'm completely missing here?