An example that shocked me was using an xml translation of C for better vector search. The lack of curly braces made the model return much more relavent code than using anything else, including enriching the database with ctags.
>GlyphLang is specifically optimized for how modern LLMs tokenize.
This is extremely dubious. The vocabulary of tokens isn't conserved inside model families, let alone entirely different types of models. The only thing they are all good at is tokenizing English.
That's an absolutely fair point that vocabularies differ regarding the tokenizer variance, but the symbols GlyphLang uses are ASCII characters that tokenize as single tokens across GPT4, Claude, and Gemini tokenizers. THe optimization isn't model-specific, but rather it's targeting the common case of "ASCII char = 1 token". I could definitely reword my post though - looking at it more closely, it does read more as "fix-all" rather than "fix-most".
Regardless, I'd genuinely be interested in seeing failure cases. It would be incredibly useful data to see if there are specific patterns where symbol density hurts comprehension.
Sure you can represent the same code in fewer tokens but I doubt it'll get those tokens correct as often.
For example see this prompt describing an app: https://textclip.sh/?ask=chatgpt#c=XZTNbts4EMfvfYqpc0kQWpsEc...
The approach with GlyphLang is to make the source code itself token-efficient. When an LLM reads something like `@ GET /users/:id { $ user = query(...) > user }`, that's what gets tokenized (not a decompressed version). The reduced tokenization persists throughout the context window for the entire session.
That said, I don't think they're mutually exclusive. You could use textclip.sh to share GlyphLang snippets and get both benefits.
Please check your idea agains tiktokenizer
> In practice, that means more logic fits in context, and sessions stretch longer before hitting limits. The AI maintains a broader view of your codebase throughout.
This is one of those 'intuitions' that I've also had. However, I haven't found any convincing evidence for or against it so far.
In a similar vein, this is why `reflex`[0] intrigues me. IMO their value prop is "LLM's love Python, so let's write entire apps in python". But again, I haven't seen any hard numbers.
Anyone seen any hard numbers to back this?
Very underbaked but https://github.com/jaggederest/locque
I’m curious if you optimized for the ability to generate functioning code or just tokenization compression rate, which LLMs you tokenized for, and what was your optimization process like.
Sniffable Python: useful for Anthropic skill sister scripts, and in general.
https://github.com/SimHacker/moollm/tree/main/skills/sniffab...
It already knows python and javascript and markdown and yaml extremely well, so it requires zero tokens to teach it those languages, and doesn't need to be completely taught a new language it's never seen before from the ground up each prompt.
You are treating token count as the only bottleneck, rather than comprehension fidelity.
Context window management is a real problem, and designing for generation is a good instinct, but you need to design for what LLMs are already good at, not design a new syntax they have to learn.
jaggederest's opposite approach (full English words, locque) is actually more aligned with how LLMs work -- they're trained on English and understand English-like constructs deeply.
noosphr's comment is devastating: "Short symbols cause collisions with other tokens in the LLMs vocabulary." The @ in @ GET /users/:id activates Python decorator associations, shell patterns, email patterns, and more. The semantic noise may outweigh the token savings.
Perl's obsessive fetish for compact syntax, sigils, punctuation, performative TMTOWTDI one-liners, to the point of looking like line noise, is why it's so terribly designed and no longer relevant or interesting for LLM comprehension and generation.
I think the ideal syntax for LLM language understanding and generation are markdown and yaml, with some python, javascript, and preferably typescript thrown in.
As much as I have always preferred json to yaml, it is inarguably better for LLMs. It beats json for llms because it avoids entropy collapse, has less syntax, leaves more tokens and energy for solving problems instead of parsing and generating syntax! Plus, it has comments, which are a game changer for comprehension, in both directions.
https://x.com/__sunil_kumar_/status/1916926342882594948
>sunil kumar: Changing my model's tool calling interface from JSON to YAML had surprising side effects.
>Entropy collapse is one of the biggest issues with GRPO. I've learned that small changes to one's environment can have massive impacts on performance. Surprisingly, changing from JSON to YAML massively improved generation entropy stability, yielding much stronger performance.
>Forcing a small model to generate properly structured JSON massively constrains the model's ability to search and reason.
YAML Jazz:
https://github.com/SimHacker/moollm/blob/main/skills/yaml-ja...
YAML Jazz: Why Comments Beat Compression
The GlyphLang approach treats token count as THE bottleneck. Wrong. Comprehension fidelity is the bottleneck.
The LLM already knows YAML from training. Zero tokens to teach it. Your novel syntax costs millions of tokens per context window in docs, examples, and corrections.
Why YAML beats JSON for LLMs:
Sunil Kumar (Groundlight AI) switched from JSON to YAML for tool calling and found it "massively improved generation entropy stability."
"Forcing a small model to generate properly structured JSON
massively constrains the model's ability to search and reason."
JSON pain: Strict bracket matching {}[]
Mandatory commas everywhere
Quote escaping \"
NO COMMENTS ALLOWED
Rigid syntax = entropy collapse
YAML wins: Indentation IS structure
Minimal delimiters
Comments preserved
Flexible = entropy preserved
The killer feature: comments are data. timeout: 30 # generous because API is flaky on Mondays
retries: 3 # based on observed failure patterns
The LLM reads those comments. Acts on them. JSON strips this context entirely.On symbol collision: noosphr nails it. Short symbols like @ activate Python decorators, shell patterns, email patterns simultaneously. The semantic noise may exceed the token savings.
Perl's syntax fetish is why it's irrelevant for LLM generation. Dense punctuation is anti-optimized for how transformers tokenize and reason.
The ideal LLM syntax: markdown, yaml, typescript. Languages it already knows cold.
Additionally I have two thoughts about it:
1. I think this might be more practical as a transparent layer so users can write and get Golang (or whatever) the original language was back. Essentially making it something only the model reads/outputs.
2.) Longer term it seems like both NVidia and AMD along with the companies training/running the models are focused on driving down cost per token because it’s just too damn high. And I personally don’t see a world where AI becomes pervasive without a huge drop in cost token— it’s not sustainable for companies running the models and end users really can’t afford the real costs as they are today. My point being, will this even be necessary in a 12-18 months?
I could totally be missing things or lacking the vision of where this could go but I personally would worry that anything written with this has a very short shelf life.
That’s not to say it’s not useful in the meantime, or not a cool project, more so if there is a longer term vision for it, I think it would be worth calling out.