I do like the principle of trying to use semantic html, but I don't think that things like this ever got the kind of mass adoption that would give them staying power. Still a nice nostalgia trip to see it here.
Agree with the mass adoption part. Back then I was so hopeful for what the web might become... The days before social media blowing up.
For a while, Google gave you good boy points for including microformats, and they still offer tests and validators [0] to tell you what the crawlers get out of your page. Supposedly microformats would not just give you better SEO ranking but also help Google connect people (like the fediverse) to accounts, so that you could surface things relevant to person by searching for the person.
[0] https://developers.google.com/search/docs/appearance/structu...
Google was a driver in practice. Accessibility and better web experiences were important to those involved. The reality was that people interested in this area were at the bleeding edge. Many people still held onto tables for site layout and Flash was still a default option for some in the period when microformats emerged.
People always mention RDF when the semantic web comes up. It's really important to understand where W3C was in the early-2000s and that RDF was driven by those with an academic bent. No one working with microformats was interested in anything beyond the RDF basics because they were too impractical for use by web devs. Part of this was complexity (OWL, anyone?), but the main part was browser and tool support.
There's nothing wrong with RDF itself, the modern plain-text and JSON serializations are very simple and elegant. Even things like OWL are being reworked now with efforts like SHACL and ShEx (see e.g. https://arxiv.org/abs/2108.06096 for a description of how these relate to the more logical/formal, OWL-centered point of view).
This was the alternative – simpler, focused, fully IE-compatible.
W3C tried proper Semantic Web again with RDF, RDFa, JSON-LD. HTML5 tried Microdata a compromise between extensibility of RDF and simplicity of Microformats, but nothing really took off.
Eventually HTML5 gave up on it, and took position that invisible metadata should be avoided. Page authors (outside the bunch who have Valid XHTML buttons on their pages) tend to implement and maintain only the minimum needed for human visitors, so on the Web invisible markup has a systemic disadvantage. It rarely exists at all, and when it does it can be invalid, out of date, or most often a SEO spam.
The schema.org vocab is being actively maintained, the latest major version came out last March w/ the latest minor release in September.
I personally think semweb-related technologies could play a significant and productive role in synthetic data generation, but that's a whole different conversation that is beyond the current era of LLMs.
If you show me an LLM that can take in serialized RDF and perform reasoning on it, I will be surprised. Once the LLM takes in the RDF serialization, it is a dead end for that knowledge: by principle, you can't rely on anything the LLM does with it.
In a world of LLMs, it makes much more sense to put the semweb technologies alongside the training step instead. You create ontologies and generate text from triples, then feed the generated text as training for the model. This is good because you can tweak and transform the triples-to-text mechanism in all sorts of ways (you can tune the data while retaining its meaning).
It doesn't make much sense to do it now, but if (or when?) training data becomes scarce, converting triples to text might be a viable approach for synthetic data, much more stable than having models themselves generate text.
It's easy, you just ask the LLM to convert your question to a SPARQL query, then you sanity-check it and run it on your dataset. The RDF input step is just so that the LLM knows what your schema looks like in the first place.
Can you make me a quick demonstration using a publicly available model and the dbpedia SPARQL endpoint?
And even before thinking about that, you can actually put the dimensions in a description, which some do (like Ikea) and Google is definitely able to pick up on that, no RDFa was ever needed. As far as I can tell, LLMs can work that out just fine as well.
The problem with the metadata discussion is that if they are actually useful, there is no reason that they are not useful to humans as well, so instead of trying to make the human work for the machine it is much better to make the machine understand humans.
And there is the fact that people/orgs lie/maintain poorly and thus you cannot trust any open tag/schema anymore than the data it is supposed to describe. You end up in a conundrum where your data may be sketchy but on top of that the schema/tags may be even worse, at the end of the day if you got to trust one thing it's got to be the data that is readily visible by everyone because it's more likely to be accurate and that's actually what's relevant.
To top all of that, tagging your stuff with RDFas just makes it easier for google and others to parse your stuff and then exploit the data/information without ever sending anyone to your site. If you are Wikipedia, it's mostly fine, but almost anyone else would benefit from receiving the traffic for a chance at value conversion/transaction.
All those metadata things are really idealized academic endeavor; they may make sense for a few noncommercial stuffs but the reality is that most website on the web need to find a way to pay the bills, making it easier for others to exploit your work is largely self-defeating.
And yes, LLMs didn't even need that to vacuum most of the web and still derive meaning, so at the end of the day it's mostly a waste of time...