A File is Just a Bunch of Bytes

Posted: 2023-03-21
Tags: #dotnet, #int-fic, #programming

Okay, yes, that’s a pretty obvious thing to say, but bear with me for a moment.¹

I’ve been procrastinating on the second part of my series exploring the Inform CLI tools by working on a Linux-first (but potentially cross-platform) desktop app for browsing and organizing a collection of interactive fiction games. Aside from just being something I would like to have, this project gives me a reason to explore things that I don’t typically face in my day job. For example, I’ve never written a graphical app for Linux before; I conceived this project partly as a means to gain some familiarity with GTK.

A core function of this app will be to scan the directory where you keep your IF games and extract metadata and cover art from each game. I’m focusing on games for the Z-machine and Glulx virtual machines to begin with, as those make up the bulk of modern parser-based games and consequently are of the most interest to me personally. Since the late 1990s, most new games targeting either of these VMs have been distributed as blorb files, which conveniently encapsulate the Z-code or Glulx bytecode of the game itself alongside any other resources the game might need — notably including said metadata and cover art.

But how to extract that information? Certainly, I’ve written programs that work with files before, but only far more mainstream formats:

Plaintext/XML/JSON: Just use C#’s File.ReadAllText(), coupled with the appropriate deserialization APIs.
PNG/JPEG: Image.FromFile().
XLS/XLSX: Install the NPOI NuGet package.

I even wrote a program once that could read and write from tar archives; but again there was a package for that, and more recently .NET 7 introduced the new System.Formats.Tar assembly.

The blorb format is more niche. Intuition told me there likely was not a ready-made C#² library³ for me to make use of, and a(n admittedly brief) search did little to disabuse me of that notion.

Being accustomed to such higher-level abstractions, it’s easy to forget that there are layers underneath. I have tended to think of files as opaque, indivisible units, but in reality a file is just a bunch of bytes. And, when you have a standard to guide you, it turns out that working at the byte level is a lot more straightforward than I might have guessed.

The blorb format is an extension of the Interchange File Format proposed by Electronic Arts in 1985. A blorb file contains one FORM data chunk with a newly-defined form type, IFRS (which I assume stands for “interactive fiction resource”). What that means is:

The first four bytes of a blorb file should be the letters F, O, R, M.
The next four should be a 32-bit integer representing the total remaining length of the FORM chunk (after the eight bytes which comprise the FORM identifier and the length value).
The four after that should be the letters I, F, R, S.

Let’s try it out! For testing purposes, I used Emily Short’s Galatea (2000).

byte[] buffer = new byte[12];
using Stream stream = new FileStream(
    "Galatea.zblorb", FileMode.Open, FileAccess.Read, FileShare.ReadWrite);

stream.Read(buffer, 0, 12);
Console.WriteLine(String.Join(", ", buffer));

70, 79, 82, 77, 0, 6, 190, 128, 73, 70, 82, 83

Well, time to break out the ASCII table.

Let’s see… 70 is F… 79 is O… 82 is R… 77 is M…

Oh. Oh shit. It works! (This moment of initial, exhilarating success is approximately when I decided to write this blog post.)

Of course, the program can do the conversion for us:

string chunkType = Encoding.ASCII.GetString(buffer[0..4]);
Console.WriteLine(chunkType);

FORM

Nice! Okay, now the next four bytes should be the length of the chunk.

int chunkLength = BitConverter.ToInt32(buffer[4..8]);
Console.WriteLine(chunkLength);

-2135030272

Hmm. That can’t be right. Going back to the spec:

All numbers are two-byte or four-byte unsigned integers, stored big-endian (most significant byte first.) Character constants such as 'FORM' are stored as four ASCII bytes, in order from left to right.

…and, as it turns out, my processor is little-endian. I have literally never needed to know that before. (Also, I should have been using uint instead of int, although that could only make a difference when dealing with a blorb that was more than 2 GiB in size, which is extremely unlikely.) Let’s try that again.

uint chunkLength = GetUInt(buffer[4..8]);
Console.WriteLine(chunkLength);

uint GetUInt(ReadOnlySpan<byte> bytes)
{
    if (BitConverter.IsLittleEndian)
    {
        byte[] reversed = bytes.ToArray();
        Array.Reverse(reversed);
        return BitConverter.ToUInt32(reversed);
    }
    else
    {
        return BitConverter.ToUInt32(bytes);
    }
}

That looks a lot more reasonable. Theoretically, the FORM chunk should be the entirety of the file, so this should line up with the file size (minus the first 8 bytes).

$ ls -l Galatea.zblorb
-rw-rw-r-- 1 rdnlsmith rdnlsmith 441992 Mar  9 20:37 Galatea.zblorb

And wouldn't you know it: 441,984 + 8 = 441,992.

At this point, I’m sure it won’t surprise you to learn that bytes 9 through 12 are indeed the ASCII representation of the letters IFRS, so I’ll skip ahead a little. Incidentally, the endianness issue doesn’t show up with the text fields because each ASCII character is a single byte. The length value was one number spread across four bytes, so the order in which the bytes are read matters. Byte order is also irrelevant for text encoded as UTF-8 — which will be important later on in this post — but it can be an issue for the less-common UTF-16 and UTF-32 encodings.

The remainder of the FORM (and thus, the remainder of the file) consists of several more data chunks, each of which begins with a four-byte ASCII identifier and a four-byte length, similar to the FORM itself. The first chunk is an index that lists each of the resources used within the game, as well as their locations in the blorb (as a number of bytes from the beginning of the file). After that, the resource chunks can appear in any order,⁴ intermingled with a few other chunks containing metadata. The metadata chunks aren’t resources used by the game, so they aren’t listed in the index. We can scan the rest of the file to find these, using the length values to skip from one header to the next:

int read;

while (true)
{
    if ((read = stream.Read(buffer, 0, 8)) < 8)
        break;

    chunkType = Encoding.ASCII.GetString(buffer[0..4]);
    chunkLength = GetUInt(buffer[4..8]);
    Console.WriteLine($"{chunkType}: {chunkLength} bytes");

    stream.Seek(chunkLength, SeekOrigin.Current);
}

RIdx: 28 bytes
ZCOD: 266240 bytes
JPEG: 174150 bytes
Fspc: 4 bytes
IFmd: 1518 bytes

RIdx is the resource index. ZCOD is the actual executable Z-code of the game. JPEG is a JPEG image; since there’s only one, there’s a good bet that this is the cover art, but we’d have to check both Fspc — that’s “frontispiece” — and the resource index to be certain. Finally, IFmd is the game’s metadata, in iFiction format.

Let’s expand our loop to process each of the sections that we’re interested in. The resource index begins with a four-byte uint representing the number of index entries. After that, each entry is 12 bytes: four characters to identify the type of resource, then four bytes to indicate the (ordinal) resource number (by which it is referenced in the game’s code), then four to give the byte offset.

int read;

while (true)
{
    if ((read = stream.Read(buffer, 0, 8)) < 8)
        break;

    chunkType = Encoding.ASCII.GetString(buffer[0..4]);
    chunkLength = GetUInt(buffer[4..8]);
    Console.WriteLine($"{chunkType}: {chunkLength} bytes");

    if (chunkType == "RIdx")
    {
        if ((read = stream.Read(buffer, 0, 4)) < 4)
            break;

        uint resCount = GetUInt(buffer[0..4]);
        Console.WriteLine($"  Index contains {resCount} resources");

        for (int i = 0; i < resCount; i++)
        {
            if ((read = stream.Read(buffer, 0, 12)) < 12)
                break;

            string resType = Encoding.ASCII.GetString(buffer[0..4]);
            uint resNum = GetUInt(buffer[4..8]);
            uint offset = GetUInt(buffer[8..12]);
            Console.WriteLine($"  {resType} {resNum}: offset {offset}");
        }
    }

RIdx: 28 bytes
  Index contains 2 resources
  Exec 0: offset 48
  Pict 1: offset 266296

Resource #0 is the game itself; the offset corresponds to the beginning of the ZCOD chunk. Resource #1 is an image; the offset corresponds to the beginning of the JPEG chunk. Once we determine the resource number for the cover art — again, almost certainly the lone JPEG chunk — we can use these offsets to skip directly from the beginning of the file to the resource in question. A quick side experiment confirms this:

stream.Seek(48, SeekOrigin.Begin);
stream.Read(buffer, 0, 4);
Console.WriteLine(Encoding.ASCII.GetString(buffer[0..4]));

stream.Seek(266296, SeekOrigin.Begin);
stream.Read(buffer, 0, 4);
Console.WriteLine(Encoding.ASCII.GetString(buffer[0..4]));

ZCOD
JPEG

The frontispiece chunk has a single four-byte field containing the resource number for the game’s cover art. Surprising absolutely no one, the cover is indeed resource #1.

    else if (chunkType == "Fspc")
    {
        if ((read = stream.Read(buffer, 0, 4)) < 4)
            break;

        uint resourceNum = GetUInt(buffer[0..4]);
        Console.WriteLine($"  Resource #{resourceNum}");
    }

ZCOD: 266240 bytes
JPEG: 174150 bytes
Fspc: 4 bytes
  Resource #1

Finally, the iFiction metadata is an XML document encoded as UTF-8. Although chunk identifiers must be ASCII-encoded, chunk contents are just undifferentiated bytes to be interpreted by the end application, and the iFiction spec mandates UTF-8.

    else if (chunkType == "IFmd")
    {
        byte[] largeBuf = new byte[chunkLength];
        read = stream.Read(largeBuf, 0, (int)chunkLength);
        string iFiction = Encoding.UTF8.GetString(largeBuf);
        Console.WriteLine(iFiction);
    }
    else
    {
        stream.Seek(chunkLength, SeekOrigin.Current);
    }
}

IFmd: 1518 bytes
<?xml version="1.0" encoding="UTF-8"?>
<ifindex version="1.0" xmlns="http://babel.ifarchive.org/protocol/iFiction/">
	<story>
 <identification>

			<format>zcode</format>
		<ifid>ZCODE-3-040208-2BC1</ifid>
 </identification>

		
		<bibliographic>
			<title>Galatea</title>
			<author>Emily Short</author>
			<language>en-US</language>
			<firstpublished>2000</firstpublished>
...

(Output above truncated for brevity.)

And with that, we have successfully pulled everything we needed out of this blorb! As a final test, let’s try running the same code against another blorb, one far larger and more complicated: Counterfeit Monkey (2012), also by Emily Short.

FORM
11314616
IFRS
RIdx: 1648 bytes
  Index contains 137 resources
  Exec 0: offset 1668
  Pict 1: offset 7911912
  Pict 3: offset 8910452
  Pict 4: offset 8911312
  ...
  Pict 134: offset 11243556
  Pict 135: offset 11253122
  Pict 136: offset 11262712
  Data 9998: offset 11275446
GLUL: 7907072 bytes
IFmd: 3143 bytes
<?xml version="1.0" encoding="UTF-8"?>
<ifindex version="1.0" xmlns="http://babel.ifarchive.org/protocol/iFiction/">
    <story>
        <identification>
            <ifid>7B5A779B-4653-43DB-A516-F475DDC12987</ifid>
            <format>glulx</format>
        </identification>
        <bibliographic>
            <title>Counterfeit Monkey</title>
            <author>Emily Short</author>
            <headline>A Removal</headline>
            <genre>Fiction</genre>
            <firstpublished>2021</firstpublished>
    ...
    </story>
</ifindex>

?Fsp: 1660944384 bytes

Well, how about that! It… almost worked. It was all going fine, up until the frontispiece — the chunk identifier is off by one character, and there’s no way it could be 1.6 GiB long (the blorb itself is only 11 MiB). It turns out there was one more important detail that I missed in the blorb spec:

If a chunk has an odd length, it must be followed by a single padding byte whose value is zero. (This padding byte is not included in the chunk length m.) This allows all chunks to be aligned on even byte boundaries.

IFmd, the last chunk to be successfully interpreted, just happened to be 3,143 bytes long. We can fix that with one more change at the end of the while loop:

    else
    {
        stream.Seek(chunkLength, SeekOrigin.Current);
    }

    if (chunkLength % 2 != 0)
        stream.Seek(1, SeekOrigin.Current);
}

    </story>
</ifindex>

Fspc: 4 bytes
  Resource #1
PNG : 998532 bytes
PNG : 851 bytes
PNG : 619 bytes
PNG : 9557 bytes
...
PNG : 9582 bytes
PNG : 9455 bytes
RDes: 3261 bytes
FORM: 39170 bytes

Much better.

A few points of interest for this second file:

This is a Glulx game; the Exec resource is a GLUL chunk instead of a ZCOD chunk.
The trailing space in the PNG chunk identifier is not a mistake. All chunk identifiers must be exactly four bytes, so if you want to use fewer than four characters you have to pad it out with spaces.
The top-level FORM has another FORM nested inside it (which, yes, is allowed), corresponding to the Data resource #9998 at the end of the resource index.

I’m sure I’ll need to refine this code further as I incorporate it into my actual program, but I’m pretty satisifed to see it work for two very different files.

This exercise was pretty intimidating at first, starting off with a 9,000-word spec (which, admittedly, I skimmed) and a blinking cursor, but it turned out to be much simpler than I expected — fewer than 100 lines of code to get this far! Given the (relatively) low-level logic involved, it was also incredibly gratifying to implement each successive piece and actually see more or less the output I expected.

Footnotes

XKCD #365 ↩︎
Even though GTK exposes a C API, I decided to write my app in C#, that being the language with which I am most comfortable. I didn’t want to be struggling with a new frontend framework and an unfamiliar backend language/ecosystem at the same time. ↩︎
The Interactive Fiction Technology Foundation does maintain a command-line utility written in C called babel, which can extract metadata from a wide range of IF game file formats, including blorbs. I probably could compile the blorb-related code from babel into a shared-object file and reference it with extern functions, but I felt like that would have been more painful than starting from scratch in C#. ↩︎
Well, they have to appear in an order such that the byte offsets listed in the resource index are correct. However, there is no requirement that the resource numbers (used within the game’s code) correspond to the physical order of the data chunks within the blorb. ↩︎