24 aug 2020 @ justine's web page
One day, while studying old code, I found out that it's possible to encode Windows Portable Executable files as a UNIX Sixth Edition shell script, due to the fact that the Thompson Shell didn't use a shebang line. Once I realized it's possible to create a synthesis of the binary formats being used by Unix, Windows, and MacOS, I couldn't resist the temptation of making it a reality, since it means that high-performance native code can be almost as pain-free as web apps. Here's how it works:
MZqFpD=' BIOS BOOT SECTOR' exec 7<> $(command -v $0) printf '\177ELF...LINKER-ENCODED-FREEBSD-HEADER' >&7 exec "$0" "$@" exec qemu-x86_64 "$0" "$@" exit 1 REAL MODE... ELF SEGMENTS... OPENBSD NOTE... MACHO HEADERS... CODE AND DATA... ZIP DIRECTORY...
I started a project called Cosmopolitan which implements the αcτµαlly pδrταblε εxεcµταblε format. I chose the name because I like the idea of having the freedom to write software without restrictions that transcends traditional boundaries. My goal has been helping C become a build-once run-anywhere language, suitable for greenfield development, while avoiding any assumptions that would prevent software from being shared between tech communities. Here's how simple it is to get started:
gcc -g -O -static -fno-pie -no-pie -mno-red-zone -nostdlib -nostdinc -o hello.com hello.c \ -Wl,--oformat=binary -Wl,--gc-sections -Wl,-z,max-page-size=0x1000 -fuse-ld=bfd \ -Wl,-T,ape.lds -include cosmopolitan.h crt.o ape.o cosmopolitan.a
In the above one-liner, we've basically reconfigured the stock compiler on Linux so it outputs binaries that'll run on MacOS, Windows, FreeBSD, OpenBSD too. They also boot on bare metal, which is a work in progress. Please note this is intended for people who don't care about desktop GUIs, and just want stdio and sockets without devops toil.
Who could have predicted that cross-platform native builds would be this easy? As it turns out, they're surprisingly cheap too. Even with all the magic numbers, win32 utf-8 polyfills, and bios bootloader code, exes still end up being roughly 100x smaller than Go Hello World:
life.com is 12kb (symbols,
hello.com is 16kb (symbols, source)
Please note that zsh has a minor backwards compatibility glitch with
Thompson Shell so try
sh hello.com rather than
That one thing aside, if it's this easy, why has no one done this before? The best answer I
can tell is it requires an minor ABI change, where C preprocessor macros
relating to system interfaces need to be symbolic. This is barely an
issue, except in cases like
If we feel comfortable bending the rules, then the GNU Linker can easily
be configured to generate at linktime all the PE/Darwin data structures
we need, without any special toolchains.
Single-file executables are nice to have. There are a few cases where static executables depending on system files makes sense, e.g. zoneinfo. However we can't make that assumption if we're building binaries intended to run on multiple distros with Windows support too.
As it turns out, PKZIP was designed to place its magic marker at the end of file, rather than the beginning, so we can synthesize ELF/PE/MachO binaries with ZIP too! I was able to implement this efficiently in the Cosmopolitan codebase using a few lines of linker script, along with a program for incrementally compressing sections.
It's possible to run
unzip -vl executable.com to view its
contents. It's also possible on Windows 10 to change the file extension
to .zip and then open it in Microsoft's bundled ZIP GUI. Having that
flexibility of being able to easily edit assets post-compilation means
interpreter that reflectively loads interpreted sources via zip.
hellojs.com is 312kb (symbols, source)
Cosmopolitan also uses the ZIP format to automate compliance with the GPLv2 [update 2020-12-28: APE is now licensed ISC]. The non-commercial libre build is configured, by default, to embed any source file linked from within the hermetic make mono-repo. That makes binaries roughly 10x larger. For example:
life2.com is 216kb (symbols,
hello2.com is 256kb (symbols, source)
Rock musicians have a love-hate relationship with dynamic range compression, since it removes a dimension of complexity from their music, but is necessary in order to sound professional. Bloat might work by the same principles, in which case, zip source file embedding could be a more socially conscious way of wasting resources in order to gain appeal with the non-classical software consumer.
It wasn't until very recently in computing history that a clear shakeout occurred with hardware architectures, which is best evidenced by the TOP 500 list. Outside phones routers mainframes and cars, the consensus surrounding x86 is so strong, that I'd compare it to the Tower of Babel. Thanks to Linus Torvalds, we not only have a consensus on architecture, but we've come pretty close to having a consensus on the input output mechanism by which programs communicate with their host machines, via the SYSCALL instruction. He accomplished that by sitting at home in a bathrobe sending emails to huge corporations, getting them to agree to devote their resources to creating something beautifully opposite to tragedy of the commons.
So I think it's really the best of times to be optimistic about systems engineering. We agree more on sharing things in common than we ever have. There are still outliers like the plans coming out of Apple and Microsoft we hear about in the news, where they've sought to pivot PCs towards ARM. I'm not sure why we need a C-Class Macintosh, since the x86_64 patents should expire this year. Apple could have probably made their own x86 chip without paying royalties. The free/open architecture that we've always dreamed of, might turn out to be the one we're already using.
If a microprocessor architecture consensus finally exists, then I believe we should be focusing on building better tools that help software developers benefit from it. One of the ways I've been focusing on making a contribution in that area, is by building a friendlier way to visualize the impact that x86-64 execution has on memory. It should should hopefully clarify how αcτµαlly pδrταblε εxεcµταblε works.
You'll notice that execution starts off by treating the Windows PE
header as though it were code. For example, the ASCII string
pop %r10 ; jno 0x4a ; jo 0x4a and the string
"\177ELF" decodes as
jg 0x47. It then hops
through a mov statement which tells us the program is being run from
userspace rather than being booted, and then hops to the entrypoint.
Magic numbers are then
unpacked for the host operating system using decentralized sections
and the GNU Assembler
.sleb128 directive. Low entropy data
like UNICODE bit lookup tables will generally be decoded using either
byte LZ4 decompressor or
byte run-length decoder, and runtime code morphing can easily be
Please note that this emulator isn't a requirement. αcτµαlly pδrταblε εxεcµταblεs work fine if you just run them on the shell, the NT command prompt, or boot them from the BIOS. This isn't a JVM. You only use the emulator if you need it. For example, it's helpful to be able to have cool visualizations of how program execution impacts memory.
It'll be nice to know that any normal PC program we write will "just work" on Raspberry Pi and Apple ARM. All we have to do embed an ARM build of the emulator above within our x86 executables, and have them morph and re-exec appropriately, similar to how Cosmopolitan is already doing doing with qemu-x86_64, except that this wouldn't need to be installed beforehand. The tradeoff is that, if we do this, binaries will only be 10x smaller than Go's Hello World, instead of 100x smaller. The other tradeoff is the GCC Runtime Exception forbids code morphing, but I already took care of that for you, by rewriting the GNU runtimes.
The most compelling use case for making x86-64-linux-gnu as tiny as possible, with the availability of full emulation, is that it enables normal simple native programs to run everywhere including web browsers by default. Many of the solutions built in this area tend to focus too much on the interfaces that haven't achieved consensus, like GUIs and threads, otherwise they'll just emulate the entire operating system, like Docker or Fabrice Bellard running Windows in browsers. I think we need compatibility glue that just runs programs, ignores the systems, and treats x86_64-linux-gnu as a canonical software encoding.
One of the reasons why I love working with a lot of these old unsexy technologies, is that I want any software work I'm involved in to stand the test of time with minimal toil. Similar to how the Super Mario Bros ROM has managed to survive all these years without needing a GitHub issue tracker.
I believe the best chance we have of doing that, is by gluing together the binary interfaces that've already achieved a decades-long consensus, and ignoring the APIs. For example, here are the magic numbers used by Mac, Linux, BSD, and Windows distros. They're worth seeing at least once in your life, since these numbers underpin the internals of nearly all the computers, servers, and phones you've used.
If we focus on the subset of numbers all systems share in common, and compare it to their common ancestor, Bell System Five, we can see that few things about systems engineering have changed in the last 40 years at the binary level. Magnums are boring. Platforms can't break them without breaking themselves. Few people have proposed visions over the years on why UNIX numerology needs to change.
emulator.com (270k PE+ELF+MachO+ZIP+SH)
tinyemu.com (140k PE+ELF+MachO+ZIP+SH)
life.com (12kb ape symbols)
sha256.elf (3kb x86_64-linux-gnu)
hello.bin (55b x86_64-linux-gnu)
bash hello.com emulator.com -t life.com echo hello | emulator.com sha256.elf
SYNOPSIS o/tiny/tool/build/emulator.com [-?Hhrstv] [ROM] [ARGS...] DESCRIPTION NexGen32e Userspace Emulator w/ Debugger FLAGS -h -? help -v verbosity -s statistics -H disable highlight -t tui debugger mode -r reactive tui mode -b ADDR push a breakpoint -L PATH log file location ARGUMENTS ROM files can be ELF or a flat αcτµαlly pδrταblε εxεcµταblε. It should use x86_64 in accordance with the System Five ABI. The SYSCALL ABI is defined as it is written in Linux Kernel. PERFORMANCE 1500 MIPS w/ NOP loop Over 9000 MIPS w/ SIMD & Algorithms
justine's web page