A File is Just a Bunch of Bytes
Okay, yes, that’s a pretty obvious thing to say, but bear with me for a moment.1
I’ve been procrastinating on the second part of my series exploring the Inform CLI tools by working on a Linux-first (but potentially cross-platform) desktop app for browsing and organizing a collection of interactive fiction games. Aside from just being something I would like to have, this project gives me a reason to explore things that I don’t typically face in my day job. For example, I’ve never written a graphical app for Linux before; I conceived this project partly as a means to gain some familiarity with GTK.
A core function of this app will be to scan the directory where you keep your IF games and extract metadata and cover art from each game. I’m focusing on games for the Z-machine and Glulx virtual machines to begin with, as those make up the bulk of modern parser-based games and consequently are of the most interest to me personally. Since the late 1990s, most new games targeting either of these VMs have been distributed as blorb files, which conveniently encapsulate the Z-code or Glulx bytecode of the game itself alongside any other resources the game might need—notably including said metadata and cover art.
But how to extract that information? Certainly, I’ve written programs that work with files before, but only far more mainstream formats:
- Plaintext/XML/JSON: Just use C#’s
File.ReadAllText()
, coupled with the appropriate deserialization APIs. - PNG/JPEG:
Image.FromFile()
. - XLS/XLSX: Install the NPOI NuGet package.
I even wrote a program once that could read and write from
tar
archives; but again there was
a package for
that, and more recently .NET 7
introduced the new System.Formats.Tar
assembly.
The blorb format is more niche. Intuition told me there likely was not a ready-made C#2 library3 for me to make use of, and a(n admittedly brief) search did little to disabuse me of that notion.
Being accustomed to such higher-level abstractions, it’s easy to forget that there are layers underneath. I have tended to think of files as opaque, indivisible units, but in reality a file is just a bunch of bytes. And, when you have a standard to guide you, it turns out that working at the byte level is a lot more straightforward than I might have guessed.
The blorb format is an extension of the
Interchange
File Format proposed by Electronic Arts in 1985. A blorb file
contains one FORM
data chunk with a newly-defined form
type, IFRS
(which I assume stands for “interactive
fiction resource”). What that means is:
- The first four bytes of a blorb file should be the letters
F
,O
,R
,M
. - The next four should be a 32-bit integer representing the total
remaining length of the
FORM
chunk (after the eight bytes which comprise theFORM
identifier and the length value). - The four after that should be the letters
I
,F
,R
,S
.
Let’s try it out! For testing purposes, I used Emily Short’s Galatea (2000).
byte[] buffer = new byte[12];
using Stream stream = new FileStream(
"Galatea.zblorb", FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
stream.Read(buffer, 0, 12);
Console.WriteLine(String.Join(", ", buffer));
70, 79, 82, 77, 0, 6, 190, 128, 73, 70, 82, 83
Well, time to break out the ASCII table.
Let’s see… 70
is F
…
79
is O
… 82
is
R
… 77
is M
…
Oh. Oh shit. It works! (This moment of initial, exhilarating success is approximately when I decided to write this blog post.)
Of course, the program can do the conversion for us:
string chunkType = Encoding.ASCII.GetString(buffer[0..4]);
Console.WriteLine(chunkType);
FORM
Nice! Okay, now the next four bytes should be the length of the chunk.
int chunkLength = BitConverter.ToInt32(buffer[4..8]);
Console.WriteLine(chunkLength);
-2135030272
Hmm. That can’t be right. Going back to the spec:
All numbers are two-byte or four-byte unsigned integers, stored big-endian (most significant byte first.) Character constants such as 'FORM' are stored as four ASCII bytes, in order from left to right.
…and, as it turns out, my processor is little-endian. I have
literally never needed to know that before. (Also, I should
have been using uint
instead of int
, although
that could only make a difference when dealing with a blorb that was
more than 2 GiB in size, which is extremely unlikely.) Let’s try
that again.
uint chunkLength = GetUInt(buffer[4..8]);
Console.WriteLine(chunkLength);
uint GetUInt(ReadOnlySpan<byte> bytes)
{
if (BitConverter.IsLittleEndian)
{
byte[] reversed = bytes.ToArray();
Array.Reverse(reversed);
return BitConverter.ToUInt32(reversed);
}
else
{
return BitConverter.ToUInt32(bytes);
}
}
441984
That looks a lot more reasonable. Theoretically, the
FORM
chunk should be the entirety of the file, so this
should line up with the file size (minus the first 8
bytes).
$ ls -l Galatea.zblorb
-rw-rw-r-- 1 rdnlsmith rdnlsmith 441992 Mar 9 20:37 Galatea.zblorb
And wouldn't you know it: 441,984 + 8 = 441,992.
At this point, I’m sure it won’t surprise you to learn
that bytes 9 through 12 are indeed the ASCII representation of the
letters IFRS
, so I’ll skip ahead a little.
Incidentally, the endianness issue doesn’t show up with the text
fields because each ASCII character is a single byte. The length value
was one number spread across four bytes, so the order in which the bytes
are read matters. Byte order is also irrelevant for text encoded as
UTF-8—which will be important later on in this post—but it
can be an issue for the less-common UTF-16 and UTF-32 encodings.
The remainder of the FORM
(and thus, the remainder of
the file) consists of several more data chunks, each of which begins
with a four-byte ASCII identifier and a four-byte length, similar to the
FORM
itself. The first chunk is an index that lists each of
the resources used within the game, as well as their locations in
the blorb (as a number of bytes from the beginning of the file). After
that, the resource chunks can appear in any
order,4 intermingled with a
few other chunks containing metadata. The metadata chunks aren’t
resources used by the game, so they aren’t listed in the index. We
can scan the rest of the file to find these, using the length values to
skip from one header to the next:
int read;
while (true)
{
if ((read = stream.Read(buffer, 0, 8)) < 8)
break;
chunkType = Encoding.ASCII.GetString(buffer[0..4]);
chunkLength = GetUInt(buffer[4..8]);
Console.WriteLine($"{chunkType}: {chunkLength} bytes");
stream.Seek(chunkLength, SeekOrigin.Current);
}
RIdx: 28 bytes
ZCOD: 266240 bytes
JPEG: 174150 bytes
Fspc: 4 bytes
IFmd: 1518 bytes
RIdx
is the resource index. ZCOD
is the
actual executable Z-code of the game. JPEG
is a JPEG image;
since there’s only one, there’s a good bet that this is the
cover art, but we’d have to check both
Fspc
—that’s
“frontispiece”—and the resource index to be certain.
Finally, IFmd
is the game’s metadata, in
iFiction
format.
Let’s expand our loop to process each of the sections that
we’re interested in. The resource index begins with a four-byte
uint
representing the number of index entries. After that,
each entry is 12 bytes: four characters to identify the type of
resource, then four bytes to indicate the (ordinal) resource number (by
which it is referenced in the game’s code), then four to give the
byte offset.
int read;
while (true)
{
if ((read = stream.Read(buffer, 0, 8)) < 8)
break;
chunkType = Encoding.ASCII.GetString(buffer[0..4]);
chunkLength = GetUInt(buffer[4..8]);
Console.WriteLine($"{chunkType}: {chunkLength} bytes");
if (chunkType == "RIdx")
{
if ((read = stream.Read(buffer, 0, 4)) < 4)
break;
uint resCount = GetUInt(buffer[0..4]);
Console.WriteLine($" Index contains {resCount} resources");
for (int i = 0; i < resCount; i++)
{
if ((read = stream.Read(buffer, 0, 12)) < 12)
break;
string resType = Encoding.ASCII.GetString(buffer[0..4]);
uint resNum = GetUInt(buffer[4..8]);
uint offset = GetUInt(buffer[8..12]);
Console.WriteLine($" {resType} {resNum}: offset {offset}");
}
}
RIdx: 28 bytes
Index contains 2 resources
Exec 0: offset 48
Pict 1: offset 266296
Resource #0 is the game itself; the offset corresponds to the
beginning of the ZCOD
chunk. Resource #1 is an image; the
offset corresponds to the beginning of the JPEG
chunk. Once
we determine the resource number for the cover art—again, almost
certainly the lone JPEG
chunk—we can use these
offsets to skip directly from the beginning of the file to the resource
in question. A quick side experiment confirms this:
stream.Seek(48, SeekOrigin.Begin);
stream.Read(buffer, 0, 4);
Console.WriteLine(Encoding.ASCII.GetString(buffer[0..4]));
stream.Seek(266296, SeekOrigin.Begin);
stream.Read(buffer, 0, 4);
Console.WriteLine(Encoding.ASCII.GetString(buffer[0..4]));
ZCOD
JPEG
The frontispiece chunk has a single four-byte field containing the resource number for the game’s cover art. Surprising absolutely no one, the cover is indeed resource #1.
else if (chunkType == "Fspc")
{
if ((read = stream.Read(buffer, 0, 4)) < 4)
break;
uint resourceNum = GetUInt(buffer[0..4]);
Console.WriteLine($" Resource #{resourceNum}");
}
ZCOD: 266240 bytes
JPEG: 174150 bytes
Fspc: 4 bytes
Resource #1
Finally, the iFiction metadata is an XML document encoded as UTF-8. Although chunk identifiers must be ASCII-encoded, chunk contents are just undifferentiated bytes to be interpreted by the end application, and the iFiction spec mandates UTF-8.
else if (chunkType == "IFmd")
{
byte[] largeBuf = new byte[chunkLength];
read = stream.Read(largeBuf, 0, (int)chunkLength);
string iFiction = Encoding.UTF8.GetString(largeBuf);
Console.WriteLine(iFiction);
}
else
{
stream.Seek(chunkLength, SeekOrigin.Current);
}
}
IFmd: 1518 bytes
<?xml version="1.0" encoding="UTF-8"?>
<ifindex version="1.0" xmlns="http://babel.ifarchive.org/protocol/iFiction/">
<story>
<identification>
<format>zcode</format>
<ifid>ZCODE-3-040208-2BC1</ifid>
</identification>
<bibliographic>
<title>Galatea</title>
<author>Emily Short</author>
<language>en-US</language>
<firstpublished>2000</firstpublished>
...
(Output above truncated for brevity.)
And with that, we have successfully pulled everything we needed out of this blorb! As a final test, let’s try running the same code against another blorb, one far larger and more complicated: Counterfeit Monkey (2012), also by Emily Short.
FORM
11314616
IFRS
RIdx: 1648 bytes
Index contains 137 resources
Exec 0: offset 1668
Pict 1: offset 7911912
Pict 3: offset 8910452
Pict 4: offset 8911312
...
Pict 134: offset 11243556
Pict 135: offset 11253122
Pict 136: offset 11262712
Data 9998: offset 11275446
GLUL: 7907072 bytes
IFmd: 3143 bytes
<?xml version="1.0" encoding="UTF-8"?>
<ifindex version="1.0" xmlns="http://babel.ifarchive.org/protocol/iFiction/">
<story>
<identification>
<ifid>7B5A779B-4653-43DB-A516-F475DDC12987</ifid>
<format>glulx</format>
</identification>
<bibliographic>
<title>Counterfeit Monkey</title>
<author>Emily Short</author>
<headline>A Removal</headline>
<genre>Fiction</genre>
<firstpublished>2021</firstpublished>
...
</story>
</ifindex>
?Fsp: 1660944384 bytes
Well, how about that! It… almost worked. It was all going fine, up until the frontispiece—the chunk identifier is off by one character, and there’s no way it could be 1.6 GiB long (the blorb itself is only 11 MiB). It turns out there was one more important detail that I missed in the blorb spec:
If a chunk has an odd length, it must be followed by a single padding byte whose value is zero. (This padding byte is not included in the chunk length m.) This allows all chunks to be aligned on even byte boundaries.
IFmd
, the last chunk to be successfully interpreted,
just happened to be 3,143 bytes long. We can fix that
with one more change at the end of the
while
loop:
else
{
stream.Seek(chunkLength, SeekOrigin.Current);
}
if (chunkLength % 2 != 0)
stream.Seek(1, SeekOrigin.Current);
}
</story>
</ifindex>
Fspc: 4 bytes
Resource #1
PNG : 998532 bytes
PNG : 851 bytes
PNG : 619 bytes
PNG : 9557 bytes
...
PNG : 9582 bytes
PNG : 9455 bytes
RDes: 3261 bytes
FORM: 39170 bytes
Much better.
A few points of interest for this second file:
- This is a Glulx game; the
Exec
resource is aGLUL
chunk instead of aZCOD
chunk. - The trailing space in the
PNG
chunk identifier is not a mistake. All chunk identifiers must be exactly four bytes, so if you want to use fewer than four characters you have to pad it out with spaces. - The top-level
FORM
has anotherFORM
nested inside it (which, yes, is allowed), corresponding to theData
resource #9998 at the end of the resource index.
I’m sure I’ll need to refine this code further as I incorporate it into my actual program, but I’m pretty satisifed to see it work for two very different files.
This exercise was pretty intimidating at first, starting off with a 9,000-word spec (which, admittedly, I skimmed) and a blinking cursor, but it turned out to be much simpler than I expected—fewer than 100 lines of code to get this far! Given the (relatively) low-level logic involved, it was also incredibly gratifying to implement each successive piece and actually see more or less the output I expected.
Footnotes
Even though GTK exposes a C API, I decided to write my app in C#, that being the language with which I am most comfortable. I didn’t want to be struggling with a new frontend framework and an unfamiliar backend language/ecosystem at the same time. ↩︎
The Interactive Fiction Technology Foundation does maintain a command-line utility written in C called babel, which can extract metadata from a wide range of IF game file formats, including blorbs. I probably could compile the blorb-related code from babel into a shared-object file and reference it with
extern
functions, but I felt like that would have been more painful than starting from scratch in C#. ↩︎Well, they have to appear in an order such that the byte offsets listed in the resource index are correct. However, there is no requirement that the resource numbers (used within the game’s code) correspond to the physical order of the data chunks within the blorb. ↩︎