But you're right, with transliteration, it's much harder to guess because the sounds/combinations of letters are not typical, and the words are unfamiliar. So you just guess a bit and then you get corrected when you hear the sound (eg, on the song).
I’m just guessing here, albeit as someone with linguistic training: toponyms in a given region are typically formed by a limited inventory of words (“topoformants”) possibly extended by, for example, the name of a landowner, a tribe, etc. (a “specific”). Speakers growing up in a region will subconsciously learn the typical topoformants and therefore be able to read at least them without the vowel markings.
Also, don’t forget that Arabic does write the long vowels through the use of matres lectionis. It’s the very early Semitic inscriptions, from before this device was invented, that I am amazed that anyone could read.