Publication: 2025-08-23

Authoring binary data files, using an assembler

Whenever I try to design or implement a text-based file format, it is quite easy to manually write a bunch of test files. Alternatively, whenever I need to tweak text-based files, this is simply a matter of editing it with a regular text editor.

On the other hand, binary file formats have been quite a chore to manipulate by hand. Either I hope that someone has already grouped a few test files, or I must write a program to generate my binary file in the first place - debugging it in the process... 😐

So, does this prove that text-based file formats are superior? Once you start considering performance, and especially when writing a parser in a language without good string support such as C, binary files suddenly become very attractive!

Wouldn't it be nice to combine the ease of manipulating text-based file formats with the ease of programmatically dealing with binary files?

I've recently remembered that there exists a specialized tool that takes textual files and convert them into binary files... Maybe we can use an assembler to generate these, without even caring about making an actual executable file! 😼

This blog post explains my journey through this idea.



Introduction to NASM

At their core, all assemblers take a textual description, and translate it into a binary file, almost a 1-to-1. For instance, the line below outputs a sequence of 9 bytes:

db 1, 2, 'H', 'e', 'l', 'l', 'o', 8, 9

Assemblers also provide additional features:

  • The textual description can contain machine code instructions, the assembler having a correspondence table between these instructions and the target instruction set's binary encoding. This is usually a very valuable part of the assembler, but it is purely irrelevant for our journey! 😇

  • Pseudo-instructions and user-defined macros give more expressive power to us, authors. Mathematical expressions can be used, to avoid hardcoding specific values. This point is the main improvement over a traditional hex editor.

  • "Common" output formats have builtin formatters such that most metadata are automatically written. This is mostly irrelevant for us, as the supported formats are part of a compiler's toolchain... So we'll just ask to output a raw, binary file instead, without metadata. There is however a possible application, to embed the binary files directly inside an executable. More on that at the end of this article!

I have chosen to use NASM (Netwide Assembler), as I find its documentation well-organized, it seems at a first glance to be very expressive, and it is a small standalone program instead of being part of a full toolchain.



Example: DDS image file

As an introduction, let's generate a DDS image file, to familiarize ourselves with the assembler's capabilities.

DDS is a file format to hold images in the context of a 3D graphics pipeline, so in addition to plain 2D images, it supports 3D images, image arrays, mipmaps, etc. In our case, we will just author a simple 2D image of a smiley 😎.

Here is the DDS specification, which is described as plain C structures.

A minimal DDS file is composed of two sections: a header structure containing metadata, and the raw pixels' values.


DDS Header

The header's definition is:

// In Windows terminology, DWORD ("Double-Word") is a 4-byte unsigned integer.

DWORD dwMagic;                            // Always "DDS " ASCII bytes.

struct DDS_HEADER {
    DWORD           dwSize;               // Size in bytes of DDS_HEADER.
    DWORD           dwFlags;              // 0x100F for a simple 2D image
    DWORD           dwHeight;             // Height in pixels.
    DWORD           dwWidth;              // Width in pixels.
    DWORD           dwPitchOrLinearSize;  // Size in bytes of a row of pixels.
    DWORD           dwDepth;              // Unused.
    DWORD           dwMipMapCount;        // Unused.
    DWORD           dwReserved1[11];      // Unused.
    struct DDS_PIXELFORMAT {
        DWORD       dwSize;               // Size in bytes of DDS_PIXELFORMAT.
        DWORD       dwFlags;              // 0x0041 = RGB | ALPHAPIXELS.
        DWORD       dwFourCC;             // Unused.
        DWORD       dwRGBBitCount;        // 32 bits per pixel.
        DWORD       dwRBitMask;           // 0x000000FF for RGBA
        DWORD       dwGBitMask;           // 0x0000FF00 for RGBA
        DWORD       dwBBitMask;           // 0x00FF0000 for RGBA
        DWORD       dwABitMask;           // 0xFF000000 for RGBA
    } ddspf;
    DWORD           dwCaps;               // 0x1000 = DDSCAPS_TEXTURE for simple 2D image.
    DWORD           dwCaps2;              // 0 for simple 2D image.
    DWORD           dwCaps3;              // Unused.
    DWORD           dwCaps4;              // Unused.
    DWORD           dwReserved2;          // Unused.
} header;

This translates pretty much directly to the NASM syntax below. The NASM source is read top to bottom, describing the binary file from beginning to end. (Note that semi-colons ; start comments, which are ignored by the assembler.)

dds_start:

WIDTH equ 15
HEIGHT equ 15

dwMagic:
    db 'D', 'D', 'S', ' '

header:                         ; DDS_HEADER
    dd header_end - header      ; dwSize
    dd 0x0000_100F              ; dwFlags
    dd HEIGHT                   ; dwHeight
    dd WIDTH                    ; dwWidth
    dd WIDTH * 4                ; dwPitch
    dd 0                        ; UNUSED: dwDepth
    dd 0                        ; UNUSED: dwMipMapCount
    dd 11 dup 0                 ; UNUSED: dwReserved1[11]

    ddspf:                      ; DDS_PIXELFORMAT
        dd ddspf_end - ddspf    ; dwSize
        dd 0x0000_0041          ; dwFlags
        dd 0                    ; UNUSED: dwFourCC
        dd 32                   ; dwRGBBitCount
        dd 0x0000_00FF          ; dwRBitMask
        dd 0x0000_FF00          ; dwGBitMask
        dd 0x00FF_0000          ; dwBBitMask
        dd 0xFF00_0000          ; dwABitMask
    ddspf_end:

    dd 0x0000_1000              ; dwCaps
    dd 0                        ; dwCaps2
    dd 0                        ; UNUSED: dwCaps3
    dd 0                        ; UNUSED: dwCaps4
    dd 0                        ; UNUSED: dwReserved2
header_end:

This demonstrates the basic capabilities of an assembler:

  • db makes the assembler output 1-byte integers.
  • dd makes the assembler output 4-byte integers.
  • N dup ... makes the assembler output the same sequence N times.
  • KEY equ N makes the assembler associate the name KEY to the integer N.
  • LABEL: makes the assembler associate the name LABEL to the current file position, that is, the number of bytes which have been outputted thus far.

Info

Labels are very useful, both for clarifying the file's structure, and to avoid hardcoding arbitrary sizes. For instance, in the previous code sample, the computation ddspf_end - ddspf computes the size in bytes of the DDS_PIXELFORMAT structure. Labels are also valuable for file formats which include a table of content such that parsers can directly jump to the file location where the data they want is located (an example being the "Index" table of the KTX2 image file format).


DDS Pixel data

The emoji I want to display contains three colors; each color is represented by 4 bytes: Red, Green, Blue, Alpha.

Instead of hardcoding a sequence of 900 bytes by hand, the assembler let us define symbol for each color; I associate each color to a one-letter name, such that the pixel art is obvious from the source assembly (yes, a single dot . is a valid identifier in NASM 😅).

%define X 0x00, 0x00, 0x00, 0xFF  ; Black
%define . 0xF1, 0xC4, 0x0F, 0xFF  ; Yellow
%define _ 0xFF, 0xFF, 0xFF, 0xFF  ; White

db _, _, _, _, _, X, X, X, X, X, _, _, _, _, _
db _, _, _, X, X, ., ., ., ., ., X, X, _, _, _
db _, _, X, ., ., ., ., ., ., ., ., ., X, _, _
db _, X, ., ., ., ., ., ., ., ., ., ., ., X, _
db _, X, ., ., ., ., ., ., ., ., ., ., ., X, _
db X, X, X, X, X, X, X, X, X, X, X, X, X, X, X
db X, ., X, _, _, X, X, X, X, _, _, X, X, ., X
db X, ., X, _, X, X, X, ., X, _, X, X, X, ., X
db X, ., ., X, X, X, ., ., ., X, X, X, ., ., X
db X, ., ., ., ., ., ., ., ., ., ., ., ., ., X
db _, X, ., ., ., ., ., ., ., ., X, ., ., X, _
db _, X, ., ., ., X, X, X, X, X, ., ., ., X, _
db _, _, X, ., ., ., ., ., ., ., ., ., X, _, _
db _, _, _, X, X, ., ., ., ., ., X, X, _, _, _
db _, _, _, _, _, X, X, X, X, X, _, _, _, _, _

%undef X
%undef .
%undef _

dds_end:

This demonstrates preprocessor capabilities:

  • %define KEY ... makes NASM replace KEY by an arbitrary sequence of text.
  • %undef KEY makes NASM forget the previous %define.


Result

Let's invoke NASM:

nasm smiley.dds.nasm

This generates smiley.dds:

(Using UTEX.js for loading a DDS into the browser)



NASM cheatsheet

Here is a summary of the NASM pseudo-instructions I find the most useful for authoring binary data; the full documentation remains interesting!

Outputting bytes

(Documentation)

NASM outputs byte sequences using db, dw, dd, dq.

Byte size    Type name    Instruction   
1 ( 8 bits) Byte db ...
2 (16 bits) Word dw ...
4 (32 bits) Double Word dd ...
8 (64 bits) Quad Word dq ...
  • Instead of calling multiple times the same instruction, you can replace the instruction's argument by a list of values: dw 1, 2, 3, 4 is equivalent to having 4 dw instructions, outputting 8 bytes.
  • Using a string as argument is considered as a list of integers, containing the character's UTF-8 codeunits. Thus, dw 'hello' is equivalent to having 5 dw instructions, outputting 10 bytes.
  • The dX N dup ... is equivalent of repeating the same dX ... instruction, N times. Thus, db 4 dup 'hello' will output 20 bytes.

Another handy pseudo-instruction is align N, db 0; this makes NASM output as many 0 bytes as needed such that the file position becomes a multiple of N (N must be a power of two). While this seems arbitrary, this is required by some file formats, in order to make parsers more efficient by having a file format which can directly be used in-memory as a data structure.

Finally, the incbin ... instruction lets you output the bytes fetched from another binary file. This can be useful for authoring archive files which group multiple binary data together.


Preprocessor

(Documentation)

NASM's preprocessor is a way of modifying the syntax to have more readable expressions. The preprocessor is very powerful (much more than the C preprocessor). Here is a partial list of its capabilities:

  • %define KEY ... will replace subsequent KEY identifier by whatever code is ...; the substitution is done before the actual source parsing, so you can deviate from traditional syntax, as long as the result expansion is valid NASM syntax.

  • %if... to conditionally remove entire blocks of NASM code; this may be useful as authoring to quickly change between e.g. two data payloads.

  • %include ... to insert another NASM source file in the middle of the file.

  • %rep / %endrep / %exitrep% makes the preprocessor capable of complex loop structures. As an example, you can output the Fibonacci sequence! 😲

  • %!variable reads an environment variable. 🫢



Embedding data in executable

As a final treat, let's remember that an assembler's first role is to produce an executable program, thus assemblers know the file formats used in the build toolchain. This makes it possible to easily embed our binary file in an executable; which means it will directly be loaded in memory at program's start, no need for file I/O. The main issue is for the code to find where in memory our binary data resides.

Currently, the smiley.dds.nasm file structure is:

dds_start:
...
dds_end:

dds_start and dds_end are "local symbols"; this means once the binary content has been generated, these symbols have no more relevance and the assembler drops them. We can ask NASM instead to consider them as "global symbols"; NASM will output metadata regarding them in formats supporting it.

global dds_start
global dds_end

dds_start:
...
dds_end:

Then, we can write a C program that uses these symbols, to dump the binary data into standard output:

#include <stdio.h>

extern unsigned char dds_start[];
extern unsigned char dds_end[];

int main()
{
    unsigned dds_size = dds_end - dds_start;
    printf("DDS file size: %u bytes\n", dds_size);

    for (unsigned i = 0; i < dds_size; ++i) {
        if (i % 64 == 0)
            printf("\n%08x:  ", i);
        if (i % 4 == 0)
            printf(" ");
        printf("%02x", dds_start[i]);
    }
    printf("\n");
    return 0;
}

Finally, NASM can assemble into object files, which can be given to the traditional compiler/linker:

# Example for clang on Linux 64-bits (-f macho64 for Mac OS X, -f win64 for Windows)
nasm smiley.dds.nasm -o smiley.dds.o   -f elf64
clang smiley_main.c smiley.dds.o -o smiley
./smiley

Output:

DDS file size: 1028 bytes

00000000:   44445320 7c000000 0f100000 0f000000 0f000000 3c000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000040:   00000000 00000000 00000000 20000000 41000000 00000000 20000000 ff000000 00ff0000 0000ff00 000000ff 00100000 00000000 00000000 00000000 00000000
00000080:   ffffffff ffffffff ffffffff ffffffff ffffffff 000000ff 000000ff 000000ff 000000ff 000000ff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
000000c0:   ffffffff ffffffff 000000ff 000000ff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff 000000ff 000000ff ffffffff ffffffff ffffffff ffffffff ffffffff
00000100:   000000ff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff 000000ff ffffffff ffffffff ffffffff 000000ff f1c40fff
00000140:   f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff 000000ff ffffffff ffffffff 000000ff f1c40fff f1c40fff
00000180:   f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff 000000ff ffffffff 000000ff 000000ff 000000ff 000000ff 000000ff
000001c0:   000000ff 000000ff 000000ff 000000ff 000000ff 000000ff 000000ff 000000ff 000000ff 000000ff 000000ff f1c40fff 000000ff ffffffff ffffffff 000000ff
00000200:   000000ff 000000ff 000000ff ffffffff ffffffff 000000ff 000000ff f1c40fff 000000ff 000000ff f1c40fff 000000ff ffffffff 000000ff 000000ff 000000ff
00000240:   f1c40fff 000000ff ffffffff 000000ff 000000ff 000000ff f1c40fff 000000ff 000000ff f1c40fff f1c40fff 000000ff 000000ff 000000ff f1c40fff f1c40fff
00000280:   f1c40fff 000000ff 000000ff 000000ff f1c40fff f1c40fff 000000ff 000000ff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff
000002c0:   f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff 000000ff ffffffff 000000ff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff
00000300:   000000ff f1c40fff f1c40fff 000000ff ffffffff ffffffff 000000ff f1c40fff f1c40fff f1c40fff 000000ff 000000ff 000000ff 000000ff 000000ff f1c40fff
00000340:   f1c40fff f1c40fff 000000ff ffffffff ffffffff ffffffff 000000ff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff
00000380:   000000ff ffffffff ffffffff ffffffff ffffffff ffffffff 000000ff 000000ff f1c40fff f1c40fff f1c40fff f1c40fff f1c40fff 000000ff 000000ff ffffffff
000003c0:   ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff 000000ff 000000ff 000000ff 000000ff 000000ff ffffffff ffffffff ffffffff ffffffff
00000400:   ffffffff

Note

This trick is useful for embedding assets in a game's executable, or more generally whenever you don't want to or can't rely on file I/O to find your data: self-contained demos, embedded programming...


If you find this article useful, feel free to send me an email or share this article. 😊

Thanks for reading, have a nice day!