The history and use of /etc/glob in early Unixes(utcc.utoronto.ca)

67 pointsby zdw5 days ago9 comments

imglorp5 days ago
User defined functions were implemented similarly as external execs in early shells. As the script was parsed, functions were dropped into /tmp without their wrappings and then called as external programs. Since they would still reference parameters as $1, $2 etc, it just worked: function bodies and standalone sh scripts had the same interface! Such a clever idea to avoid managing an interpreted call stack in the parent.
- 4 days ago
  undefined
miohtama4 days ago
The linked C source file is an excellent example of ancient C, when it was still more closer to high level assembly:
https://www.tuhs.org/cgi-bin/utree.pl?file=V2/cmd/glob.c
- LeFantome4 days ago
  I assume that you are referring to the liberal use of “goto”. Of course, “if”, “while”, and even “switch” are also used. Quite the mix.
  Directly calling into system calls (“write”) is interesting.
  - quuxplusone4 days ago
    write(2) is POSIX. That's not "directly calling into a system call"; it's a normal C API from the POSIX header <unistd.h>.
  - kps4 days ago
    The modern form of stdio only appear in Seventh Edition Unix.
- tomtomtom7774 days ago
  And when buffer overflows were (attempted to be) avoided by guestimating a large enough buffer size.
ginko5 days ago
Why is there a period after etc in the title? Another example of HN's stupid automated title editing?
- mkl5 days ago
  Probably the submitter typed it on a phone instead of copy-paste and "etc" got autoincorrected.
yjftsjthsd-h5 days ago
> PS: I don't know why expanding shell wildcards used a separate program in V6 and earlier, but part of it may have been to keep the shell smaller and more minimal so that it required less memory.
See, I thought it was a nice separation of concerns and wondered why we lost such a nice approach, until I read:
> How escaping wildcards works in the V5 and V6 shell is that all characters in commands and arguments are restricted to being seven-bit ASCII. The shell and /etc/glob both use the 8th bit to mark quoted characters, which means that such quoted characters don't match their unquoted versions and won't be seen as wildcards by either the shell
at which point I suddenly became a fan of ditching it. I do wonder if there's not some better way to factor that functionality out...
- p_l5 days ago
  Important thing to remember is that even after the move to PDP-11, early Unix systems had to deal with 32kB as entire space available to userland program, both code and data (including stack)
  - kjs34 days ago
    You mean 32k words, not 32k bytes, right[1]? And AFAIK by V5 or V6, Unix could use split instruction and data if the MMU supported it giving a bit more headroom. But, yeah, memory was very tight, and a lot of very clever tricks were used to get around it.
    [1] Even worse, the top 4kW/8kB was reserved for I/O.
    p_l3 days ago
    I meant 32k bytes - PDP-11 was byte-addressed, not word-addressed. The 64kB address space was split in half between kernel and userland in so called "high and low moby" scheme (as it required minimal logic latching onto single address line).
    And for I/D split you needed appropriate CPU model.
    the top 8kB "I/O page" is reserved as part of the kernel space, not userspace, so it does not impact as much the userspace part.
    kjs33 days ago
    Ah, I misunderstood your point. And while the PDP-11 was byte addressable, the doco often talked about memory size in words. Carry on.
- Joker_vD5 days ago
  Why would I want to factor out some syntactic functionality of one specific (and not very well thought out) shell to reuse, again?
  But if you really insist, you can write your own glob(1) that would invoke glob(3) for you, sure. There is also wordexp(3) although I believe its implementations had security problems for quite some time?
- hnlmorg5 days ago
  The way Murex works is each parameter is first compiled into an AST, and then globing only works against the unquoted tokens.
  Globbing is also a separate built in, which allows for other types of wildcard matches like regex too. Eg https://murex.rocks/tour.html#filesystem-wildcards-globbing
  So you have have the best of both worlds: inline globbing for convenience and also wildcard matching as a function too.
- 5 days ago
  undefined
- BoingBoomTschak4 days ago
  There's a sane language that never went that route: https://www.tcl.tk/man/tcl9.0/TclCmd/glob.html
  It also ditched another special case recently: the leading ~.
- eru5 days ago
  > at which point I suddenly became a fan of ditching it. I do wonder if there's not some better way to factor that functionality out...
  Just use backslash escaping like we do practically everywhere else in the Unix world?
  - rini175 days ago
    That's kind of cure worse than disease. Just ditch escaping completely.
    eru4 days ago
    Why? This is just for communication between the shell and its helper programs, the user wouldn't even see.
    What do you not like about escaping?
    Of course, for program-to-program communication you can also use different techniques, instead of escaping. Escaping is just the most human-readable and human-producible.
    (As a simple example, to be able to represent all characters in a string, you can either escape quotes like \" or you can prefix the string with its length.
    Computers can work with either convention, but humans will hate you if they have to prefix every string literal with its length and keep that length in sync with the string.)
    rini174 days ago
    Are you aware that the main issue here is not with string literals, but with glob expansions? Literals are quite easy to check statically as mistakes usually cause havoc with surrounding code syntax. Even so, I avoid nontrivial use of them.
    But expansions and substitutions with escaping are the can of worms.
    yjftsjthsd-h4 days ago
    If you completely ditch escaping, how do you handle filenames that contain special characters (in this context, mostly ? and *, but ()[] are also perennial favorites)? And to preempt the most obvious answer: No, you can't just ban them, because existing OSs and filesystems allow them and you need interoperability.
    rini174 days ago
    There are ways, no idea why doing anything here is so reviled.
    Find and xargs can delimite filenames by NUL, which is not allowed in filenames. Best practice in SQL was to abandon parameters escaping completely and pass them out of band. For internal representation, use array datastructures with length information.
    Actually, would it be that bad, to ban * and ? in filenames? If you accept them in the name of interop, something inevitably breaks later. Better to fail upfront. Many applications do sanitize filenames already and when they need to use binary data as file name, convert it to hex instead. It's a hassle otherwise.
    eru4 days ago
    > Actually, would it be that bad, to ban * and ? in filenames?
    That's possible, if you design your filesystem from scratch.
    But if you take your filesystem as given for now (with its ability to represent all kinds of interesting characters), and just want to design globbing you have to solve this problem. Otherwise you have a tool that can only handle some files. That's what Gnu Make does, btw. Try handling any file or output with whitespace in the name in Make, if you want some frustration.
    Yes, null-termination works for the specific problem of termination. Though if you just use program-to-program communication, you can also prefix your strings with their length.
    > If you accept them in the name of interop, something inevitably breaks later.
    Why? That's only the case when you have legacy software written by less than careful people. There's no reason to expect breakage when you are designing new software, just like the people in the article where doing. (Of course, back then they didn't know what they were doing, so we have a lot of breakage historically.)
    But for the very specific purpose of the shell talking to a helper program for globbing, you can control exactly what's happening, including all the encoding and decoding (or escaping and unescaping). So there's no unexpected breakage.
    And btw, you also need to give the human a way to specify a literal * in a filename, too. Not just for communication between programs.
    > Best practice in SQL was to abandon parameters escaping completely and pass them out of band.
    Yes, that's partially because SQL is such a complicated language, and because you are talking about program-to-program communication anyway, so you don't need to be human-friendly there. So communicating them on a separate is the simplest thing that covers all cases.
timewizard5 days ago
Sweet.
I use xterm.js a lot and have a "shell backbone" that I use to make shell based access to APIs, S3 and other things "cloud." This is essentially how I implement globbing as well. The convenience is that you can run glob by itself to get an idea of exactly what kind of automated nightmare you are about to kick off.
Anyways.. mine currently has V3 behavior. My shell command exec routine could actually benefit from that hack. What's old is new again?
amelius5 days ago
Recent versions of Bash don't expand the * (et cetera) patterns when there is no match, which although sometimes useful, I still feel it's a hack.
- pwg4 days ago
  The action to take upon no match is configurable in recent Bash versions.
  The 'failglob' shopt option will cause an error to be generated if a glob matches nothing.
  The 'nullglob' shopt option toggles between no match expanding to an empty string and the traditional default of no match leaving the glob characters untouched.
- Joker_vD4 days ago
  That's been around since the original Bourne shell; /etc/glob, from what I can see from its source, would refuse to run the command if the resulting expansion turned out completely empty; and the globs with no matches would be simply removed.
  - amelius4 days ago
    That's not how it works in recent Ubuntu releases. If there is no match, the command runs with the wildcard chars not substituted.
    # echo foo*bar foo*bar
    Joker_vD4 days ago
    Yes, this current behaviour was introduced by the original Bourne shell and then it stuck for some silly reason or another (it probably has some fringe use cases but they elude me). Thompson's original shell, or rather, /etc/glob, at various versions implemented the mix of behaviours that would later be reintroduced as nullglob and failglob options in Bash.
gjvc5 days ago
binaries in /etc/ -- i mean __really__
- NekkoDroid5 days ago
  Fun fact: the linux kernel itself actually also looks for `/etc/init` before it even looks for `/bin/init`
  https://github.com/torvalds/linux/blob/4a5df37964673effcd9f8...
- stevekemp5 days ago
  Even now you'll come across this, for example "/etc/rmt" probably exists, and other tape-related binaries if installed.
- tedunangst5 days ago
  Yes, really. That's what /etc was for.
  - gjvc4 days ago
    I know. I'm saying it's sick. I hate computers.
    kps4 days ago
    Why sick? That was the directory for binaries that weren't meant to be run directly — `getty`, `login`, etc.
    Today there's much more software, so some things got moved into finer-grained locations like /libexec and /sbin. That wasn't the case in the /etc/glob era when the entire UNIX system was smaller than today's average web page.
    gjvc4 days ago
    and /sbin was full
JimmyWilliams15 days ago
[flagged]
- rednafi5 days ago
  This reads like slop for some reasons; even to my non-native brain.
  - keybored5 days ago
    https://news.ycombinator.com/item?id=42575204
  - marcus0x625 days ago
    I wonder what the point of these accounts is - they show up in almost every post now. If the goal is farming karma, they aren't doing a very good job.
rollcat5 days ago
This is php.ini level of madness, and I'm glad it's gone from (semi-)modern shells. A formal (e.g. programming) language should be defined in its entirety by its formal grammar, its semantics by a formal spec, etc. There's barely any good reason to let the system administrator change the logic and semantics of deployed code.
You could argue that Lisp reader macros also somewhat violate this rule. As a longtime Lisp fan, I dislike reader macros, but I'm more conflicted about macros in general. A good macro system should aim to provide enough context for IDEs and LSPs to aid the developer, but Lisp macros are entirely about just transforming the AST. It's usually just better to evolve the language.
- JdeBP5 days ago
  It's not there to give the system administrator flexibility. It's there because early Unix was heavily constrained, and doing thing with lots of little overlays (and what was decades later known as "Bernstein chaining") rather than 1 big program was the way to architect stuff. exit(1), goto(1), and if(1) were all external commands in the Thompson shell.
  * https://v6sh.org
  - rollcat5 days ago
    I would argue with almost anyone else, that this is a poor design, but...
    Thank you for your perspective, work, and contributions.
    pwg4 days ago
    You are likely looking at this design from a modern system perspective.
    But the PDP-11 system that many of these designs were made upon had a minimum memory size of 4K bytes and with varying models that had different maximum memory sizes that are smaller than a single JPEG photo in today's world: PDP 11/45 max memory 256kbyte - PDP 11/70 max memory 4Mbyte.
    And this was the total memory for everything, the OS, and the users, and the system supported multiple users sharing the same machine at the same time.
    With those resource constraints, the design rules that determine good from poor are radically different than with one of today's systems with multiple Gb of RAM.
    JdeBP4 days ago
    Also remember that in the early days Unix did segment swapping, with demand paging only coming in with BSD and the VAX. So there was no paging in just a tiny part of a big executable.
    hnlmorg5 days ago
    The other thing to bear in mind is that it’s undergone literally decades of evolution while still being backwards compatible.
    The shells weren’t originally intended to be Turing complete. They were just a job launcher. What you use today would have been unimaginable when these shells were first designed.
    Whereas all other programming languages have had a drastically smaller evolution in comparison and yet still had a worse compatibility story.
    It’s very easy to be critical of the Bourne shell (and compatible shells too) because they are archaic by modern standards. But they weren’t written to solve modern problems. So it’s like looking at a bicycle and complaining how the designers didn’t design a sports car while ignoring the fact that technology didn’t exist and still push bikes are good enough for millions to use daily.
- ars4 days ago
  What in the world is "php.ini level of madness"?
  If you are trying to attack php you are not doing a good job of it, especially because there were good reason for using a separate program for glob.
  - rollcat12 hours ago
    I didn't consider the historical context, which several people in this thread provided. I already knew that "/etc" used to literally mean "etcetera" - "whatever doesn't fit elsewhere", but didn't immediately connect the dots that "/etc/glob" was still considered a fixed part of the system, and wasn't meant to be substituted by the administrator.
    I won't argue about PHP. I've dealt with it while there was money to be made from that, and moved on as soon as I had the chance. ¯\_(ツ)_/¯