asmx semi-generic assembler

FOREWORD

Okay, so it's not really generic, just semi generic. This started from an 8080 assembler I wrote in Turbo Pascal back in my college days when I had a class where I had to write an 8080 emulator. The assembler wasn't part of the class, but no way was I going to hand assemble code again like I did back in my early TRS-80 days. Then when I started dabbling with ColecoVision and Atari 2600 code, I made a Z-80 and a 6502 assembler from it.

But Turbo Pascal and MS-DOS are way out of style now, so I ported the code to plain standard C. It should run on any Unix-like operating system. It should also run on any operating system that provides Unix-style file and command-line handling. In particular, it runs fine under Mac OS X, which is what I use on my laptop.

Porting it to C wasn't enough, though. I had added some nice features like macros to the 6502 assembler and I wanted them in the Z-80 assembler too. But I didn't want to have to copy and paste code every time I added a new feature. So I turned the code inside out, and made the common code into a gigantic .h file. This made writing an assembler for a new CPU easy enough that I was able to write a 6809 assembler in one day, plus another day for debugging.

Unlike most "generic" assemblers, I make an effort to conform to the standard mnemonics and syntax for a CPU, as you'd find them in the chip manufacturer's documentation. I'm a bit looser on the pseudo-ops, trying to be inclusive whenever possible so that existing code has a better chance of working with fewer changes, especially code written back in the '80s.

This is a two-pass assembler. That means that on the first pass it figures out where all the labels go, then on the second pass it generates code. I know there are popular multi-pass assemblers out there (like DASM for 6502), and they have their own design philosophy. I'm sticking with the philosopy that was used by the old EDTASM assemblers for the TRS-80. There are a few EDTASM-isms that you might notice, if you know what to look for.

But being a two-pass assembler, there are some things you can't do. You can't ORG to a label that hasn't been defined yet, because on the second pass it'll have a value, and your code will go into a different location, and all your labels will be at the wrong address. This is called a "phase error". You also can't use a label that hasn't been defined yet with DS or ALIGN because they affect the current location.

Some CPUs like the 6502 and 6809 have different instructions which can provide smaller faster code based on the size of an operand. To make this work, the assembler keeps an extra flag in the symbol table during the second pass, which tells if the symbol was known yet at this point in the first pass. Then the assembler can know to use the longer form to avoid a phase error. The 6809 assembler syntax uses "<" (force 8-bits) and ">" (force 16-bits) to override this decision. The 6502 assembler can also override this with a ">" before an absolute or absolute-indexed address operand. (Note that this usage is different from "<" and ">" as a high/low byte of a word value.)

Some assemblers can only output code in binary. This might be nice if you're making a video game cartridge ROM, but it's really not very flexible. Intel and Motorola both came up with very nice text file formats which don't require any kind of padding when you do an ORG instruction, and don't require silly "segment" definitions just to keep DS instructions from generating object code. Then, following the Unix philosophy of making tools that can connect to other tools, you can pipe the object code to another utility which makes the ROM image.

Anyhow, it works pretty well for what I want it to do.

- Bruce -


HOW TO BUILD asmx

The standard way to build asmx is using the makefile:

  make

This will create the asmx binary in the src sub-directory. That's it. Now you will might want to copy it to your /usr/local/bin or ~/bin directory, but that's your choice.

If you are using a unix-like OS such as Linux, OS X, or BSD, you can also use:

  make install

This will install the binaries to ~/bin, unless you change the makefile to install it somewhere else. Symbolic links are generated so that each CPU assembler can be used with a separate command.

If you can't use the makefile, the simplest way is this:

  gcc *.c -o asmx

Windows users should install Cygwin as the easiest way to get GCC.


RUNNING IT

Just give it the name of your assembler source file, and whatever options you want.

asmx [options] srcfile

Here are the command line options:

    --                  end of options
    -e                  show errors to screen
    -w                  show warnings to screen
    -l [filename]       make a listing file, default is srcfile.lst
    -o [filename]       make an object file, default is srcfile.hex or srcfile.s9
    -d label[[:]=value] define a label, and assign an optional value
    -s9                 output object file in Motorola S9 format (16-bit address)
    -s19                output object file in Motorola S9 format (16-bit address)
    -s28                output object file in Motorola S9 format (24-bit address)
    -s37                output object file in Motorola S9 format (32-bit address)
    -b [base[-end]]     output object file as binary with optional base/end addresses
    -t [reclen]         output object file in TRSDOS format (implies -C Z80)
    -T [reclen]         output object file as TRS-80 cassette file (implies -C Z80)
    -c                  send object code to stdout
    -C cputype          specify default CPU type (currently 6502)
    -@                  causes the old EDTASM pass/errors messages to be printed

Example:

asmx -l -o -w -e program.asm

This assembles the source file "program.asm", shows warnings and errors to the screen, creates a listing file "program.asm.lst", and puts the object code in an Intel hex format file named "program.asm.hex". (Binary files get named "program.asm.bin", and Motorola S9 files get an extension of .s9, .s19, .s28, or .s37.)

Notes:

The '--' option is needed when you use -l, -o, or -b as the last option on the command line with no parameters, so that they don't try to eat up your source file name. It's really better to just put -l and -o first in the options.

The value in -d must be a number. No expressions are allowed. The valid forms are:

-d label   defines the label as EQU 0
-d label=value   defines the label as EQU value
-d label:=value   defines the label as SET value

By default, object code is written as an Intel hex file unless the -s or -b option is specified.

The value in -b specifies the start address for your binary file. If you are making code for a ROM at address range 0xC000-0xFFFF, use "-b 0xC000-0xFFFF" and the first byte of the object file will be whatever belongs at 0xC000. Anything at a lower address is not written to the file, any gaps are filled with 0xFF, and no bytes past 0xFFFF are written to the file. The object file is not padded to the full address range. Be careful about using large ORG values without an end address, or the resulting binary file could become VERY large.

The -c and -o options are incompatible. Attempting to use both will result in an error. Normal screen output (pass number, total errors, error messages, etc.) always goes to stderr.


EXPRESSIONS

Whenever a value is needed, it goes through the expression evaluator. The expression evaluator will attempt to do the arithmetic needed to get a result.

Unary operations take a single value and do something with it. The supported unary operations are:

+ val positive of val
- val negative of val
~ val bitwise NOT of val
! val logical NOT of val (returns 1 if val is zero, else 0)
< val low 8 bits of val
> val high 8 bits of val
..DEF sym returns 1 if symbol 'sym' has already been defined
..UNDEF sym returns 1 if symbol 'sym' has not been defined yet
( expr ) parentheses for grouping sub-expressions
[ expr ] square brackets can be used as parentheses when necessary
'c' One or two character constants, equal to the ASCII value
'cc' of c or cc. In the two-byte case, the first character is the high byte.
H(val) high 8 bits of val; whitespace not allowed before '('
L(val) low 8 bits of val; whitespace not allowed before '('

NOTE: with the Z-80, (expr), H(val), and L(val) will likely not work at the start of an expression because of Z-80 operand syntax. Likewise with the 6809, <val and >val may have special meaning at the start of an operand.

Binary operations take two values and do something with them. The supported binary operations are:

x * y x multipled by y
x / y x divided by y
x % y x modulo y
x + y x plus y
x - y x minus y
x << y x shifted left by y bits
x >> y x shifted right by y bits
x & y bitwise x AND y
x | y bitwise x OR y
x ^ y bitwise x XOR y
x = y comparison operators, return 1 if condition is true
(note that = and == are the same)
x == y
x < y
x <= y
x > y
x >= y
x && y logical AND of x and y (returns 1 if x !=0 and y != 0)
x || y logical OR of x and y (returns 1 if x != 0 or y != 0)

Numbers:

. current program counter
*
$
$nnnn hexadecimal constant
nnnnH
0xnnnn
nnnn decimal constant
nnnnD
nnnnO octal constant
aaabbbA split-octal constant, where aaa and bbb are
the high and low bytes (377377A = 0xFFFF)
%nnnn binary constant
nnnnB

Hexadecimal constants of the form "nnnnH" don't need a leading zero if there is no label defined with that name.

Operator precedence:

( ) [ ]
unary operators: + - ~ ! < > ..DEF ..UNDEF
* / %
+ -
< <= > >= = == !=
& && | || ^ << >>

WARNING:
Shifts and AND, OR, and XOR have a lower precedence than the comparison operators! You must use parentheses when combining them with comparison operators!

Example:
Use "(OPTIONS & 3) = 2", not "OPTIONS & 3 = 2". The former checks the lowest two bits of the label OPTIONS, the latter compares "3 = 2" first, which always results in zero.

Also, unary operators have higher precedence, so if X = 255, "<X + 1" is 256, but "<(X + 1)" is 0.

With the 6809 assembler, a leading "<" or ">" often refers to an addressing mode. If you really want to use the low-byte or high-byte operator, surround the whole thing with parentheses, like "(<LABEL)". This does not apply to immediate mode, so "LDA #<LABEL" will use the low byte of LABEL.

NOTE: ..def and ..undef do not work with local labels. (the ones that start with '@' or '.')


LABELS AND COMMENTS

Labels must consist of alphanumeric characters or underscores, and must not begin with a digit. Examples are "FOO", "_BAR", and "BAZ_99". Labels are limited to 255 characters. Labels may also contain '$' characters, but must not start with one.

Labels must begin in the first column of the source file when they are declared, and may optionally have a ":" following them. Opcodes with no label must have at least one blank character before them.

Local labels are defined starting with "@" or ".". This glues whatever is after the "@" or "." to the last non-temporary code label defined so far, making a unique label. Example: "@1", "@99", ".TEMP", and "@LOOP". These can be used until the next non-local label, by using this short form. They appear in the symbol table with a long form of "LABEL@1" or "LABEL.1", but can not be referenced by this full name. Local labels starting with a "." can also be defined as subroutine local, by using the SUBROUTINE pseudo-op.

Comments may either be started with a "*" as the first non-blank character of a line, or with a ";" in the middle of the line.

Lines after the END pseudo-op are ignored as though they were comments, except for LIST and OPT lines.


PSEUDO-OPS

These are all the opcodes that have nothing to do with the instruction set of the CPU. All pseudo-ops can be preceeded with a "." (example: ".BYTE" works the same as "BYTE")

NOTE: All of the data pseudo-ops like DB, DW, and DS have a limit of 1023 bytes of initialized data. (This can be changed in asmx.h if you really need it bigger.)

.6502 / .68HC11 / etc.

The CPU type can be specified this way in addition to the CPU and PROCESSOR pseudo-ops.

ASCIC

Creates a text string preceeded by a single byte indicating the length of the string. This is equivalent to a Pascal-style string.

ASSERT expr

Generates an error if expr is false (equals zero).

ALIGN

This ensures that the next instruction or data will be located on a power-of-two boundary. The parameter must be a power of two (2, 4, 8, 16, etc.)

CPU

This is an alias for PROCESSOR.

DB / BYTE / DC.B / FCB / DEFB / DEFM

Defines one or more constant bytes in the code. You can use as many comma-separated values as you like. Strings use either single or double quotes. Doubled quotes inside a string assemble to a quote character. The backslash ("\") can escape a quote, or it can represent a tab ("\t"), linefeed ("\n"), or carriage return ("\r") character. Hex escapes ("\xFF") are also supported.

DW / WORD / DC.W / FDB

Defines one or more constant 16-bit words in the code, using the native endian-ness of the CPU. With the 6502, Z-80, and 8080, the low word comes first; with the 6809, the high word comes first. Quoted text strings are padded to a multiple of two bytes. The data is not aligned to a 2-byte address.

DL / LONG / DC.L

Defines one or more constant 32-bit words in the code, using the native endian-ness of the CPU. With the 6502, Z-80, and 8080, the low word comes first; with the 6809, the high word comes first. Quoted text strings are padded to a multiple of four bytes. The data is not aligned to a 4-byte address.

DRW

Define Reverse Word - just like DW, except the bytes are reversed from the current endian setting.

DS / RMB / BLKB

Skips a number of bytes, optionally initialized.

Examples:

     DS 5     ; skip 5 bytes (generates no object code)
     DS 6,"*" ; assemble 6 asterisks

Note that no forward-reference values are allowed for the length because this would cause phase errors.

ERROR message

This prints a custom error message.

EVEN

This is an alias for ALIGN 2.

FCC

Motorola's equivalent to DB with a string. Each string starts and ends with the same character. The start/end character must not be alphanumeric or an underscore.

Examples:

     FCC /TEXT/     ; 4 bytes "TEXT"
     FCC \TEXT\     ; 4 bytes "TEXT"

In this assembler, FCC is extended by allowing it to work like DB afterward, only with a different quote character. Also, the string delimiter can be repeated twice inside the string to include the delimiter in the string.

Examples:

     FCC /TEXT//TEXT/    ; 9 bytes "TEXT/TEXT"
     FCC /TEXT/,0        ; 5 bytes "TEXT" followed by a null
     FCC /TEXT/,0,/TEXT/ ; 9 bytes "TEXT", null, "TEXT"

There is also a second mode where the length is specified, the text has no quotes, and the text is padded to the specified length with blanks. Be aware that if the text is too short, it will copy more data from your source line, even if you have a comment in the line! However, it will stop copying when it encounters a tab character.

Example:

     FCC 9,TEXT          <- this is 9 bytes "TEXT     "
     FCC 9,TEXT;comm     <- this is 9 bytes "TEXT;comm"
     FCC 9,TEXT;comment  <- this is 9 bytes "TEXT;comm", then an error from "ent"

END

This marks the end of code. After the END statement, all input is ignored except for LIST and OPT lines.

EQU / = / SET / :=

Sets a label to a value. The difference between EQU and SET is that a SET label is allowed to be redefined later in the source code. EQU and '=' are equivalent, and SET and ':=' are equivalent.

HEX

Defines raw hexadecimal data. Individual hex bytes may be separated by spaces.

Examples:

     HEX 123456     ; assembles to hex bytes 12, 34, and 56
     HEX 78 9ABC DE ;  assembles to hex bytes 78, 9A, BC and DE
     HEX 1 2 3 4    ; Error: hexadecimal digits must be in pairs

IF expr / ELSE / ELSIF expr / ENDIF

Conditional assembly. Depending on the value in the IF statement, code between it and the next ELSE / ELSIF / ENDIF, and code between an ELSE and an ENDIF, may or may not be assembled.

ELSIF is the same as "ELSE" followed by "IF", only without the need for an extra ENDIF.

Example:

     IF .undef mode
       ERROR mode not defined!
     ELSIF mode = 1
       JSR mode1
     ELSIF mode = 2
       JSR mode2
     ELSE
       ERROR Invalid value of mode!
     ENDIF

IF statements inside a macro only work inside that macro. When a macro is defined, IF statements are checked for matching ENDIF statements.

INCBIN filename

This inserts the contents of the named binary file into the object code output. The size of the binary file is shown in the listing.

INCLUDE filename

This starts reading source code from the named file. The file is read once in each pass. INCLUDE files can be nested to a maximum of 10 levels. (This can be changed in asmx.c if you really need it bigger.)

LIST / OPT

These set assembler options. Currently, the options are:

LIST ON / OPT LIST Turn on listing
LIST OFF / OPT NOLIST Turn off listing
LIST MACRO / OPT MACRO Turn on macro expansion in listing
LIST NOMACRO / OPT NOMACRO Turn off macro expansion in listing
LIST EXPAND / OPT EXPAND Turn on data expansion in listing
LIST NOEXPAND / OPT NOEXPAND Turn off data expansion in listing
LIST SYM / OPT SYM Turn on symbol table in listing
LIST NOSYM / OPT NOSYM Turn off symbol table in listing
LIST TEMP / OPT TEMP Turn on temp symbols in symbol table listing
LIST NOTEMP / OPT NOTEMP Turn off temp symbols in symbol table listing
OPT EXACT / OPT NOOPT Turn off assembler-specific optimizations
OPT NOEXACT / OPT OPT Turn on assembler-specific optimizations (Z80, 68K)

The default is listing on, macro expansion off, data expansion on, symbol table on, exact off.

MACRO / ENDM

Defines a macro. This macro is used whenver the macro name is used as an opcode. Parameters are defined on the MACRO line, and replace values used inside the macro.

Macro calls can be nested to a maximum of 10 levels. (This can be changed in asmx.c if you really need it bigger.)

Example:

     TWOBYTES  MACRO parm1, parm2     ; start recording the macro
               DB    parm1, parm2
               ENDM                   ; stop recording the macro

     TWOBYTES  1, 2        ; use the macro - expands to "DB 1, 2"

An alternate form with the macro name after MACRO, instead of as a label, is also accepted. A comma after the macro name is optional.

               MACRO plusfive parm
               DB    (parm)+5
               ENDM

When a macro is invoked with insufficient parameters, the remaining parameters are replaced with a null string. It is an error to invoke a macro with too many parameters.

Macro parameters can be inserted without surrounding whitespace by using the '##' concatenation operator.

     TEST      MACRO labl
     labl ## 1 DB    1
     labl ## 2 DB    2
               ENDM

               TEST  HERE ; labl ## 1 gets replaced with "HERE1"
                          ; labl ## 2 gets replaced with "HERE2"

Macro parameters can also be inserted by using the backslash ("\") character. This method also includes a way to access the actual number of macro parameters supplied, and a unique identifier for creating temporary labels.

\0 = number of macro parameters
\1..\9 = nth macro parameter
\? = unique ID per macro invocation (padded with leading zeros to five digits)

NOTE: The line with the ENDM may have a label, and that will be included in the macro definition. However if you include a backslash escape before the ENDM, the ENDM will not be recognized, and the macro definition will not end. Be careful!

ORG

Sets the origin address of the following code. This defaults to zero at the start of each assembler pass.

PROCESSOR

This selects a specific CPU type to assemble code for. Some assemblers support multiple CPU sub-types. Currently supported CPU types are:

NONE No CPU type selected
1802 RCA 1802
6502 MOS Technology 6502
6502U MOS Technology 6502 with undocumented instructions
65C02 Rockwell 65C02
65816 65C816 Western Digital 65C816
68K 68000 Motorola 68000
68010 Motorola 68010
6805 68HC05 Motorola 6805
68HSC08 Motorola 68HSC08 (6805 variant)
6809 Motorola 6809
6309 Hitachi 6309
6800 6802 6808Motorola 6800
6801 6803 Motorola 6801
6303 Hitachi 6303 (6800 variant)
6811 68HC11
68HC711 68HC811
68HC99
Motorola 68HC11 variants
68HC16 Motorola 68HC16
8048 Intel 8048
8051 8052
8031 8032
Intel 8051 variants
8080 Intel 8080
8085 Intel 8085
8080Z Intel 8080 with Z-80 JR and DJNZ opcodes
8085U Intel 8085 with undocumented instructions
Z80 Zilog Z-80
Z180 Zilog Z-180
GBZ80 Gameboy Z-80 variant
Z8085 Intel 8085 with Z-80 mnemonics
Z8 Zilog Z8
8008 Intel 8008
F8 Fairchild F8
TOM Atari Jaguar GPU
JERRY Atari Jaguar DSP
ARM ARM (32-bit little-endian)
ARM_BE ARM big-endian
ARM_LE ARM little-endian
THUMB ARM Thumb (16-bit little-endian)
THUMB_BE ARM Thumb big-endian
THUMB_LE ARM Thumb little-endian

At the start of each pass, this defaults to the assembler specified in the "-C" command line option, if any, or the assembler type determined from the name of the executable used on the command line. The latter is useful with soft-links when using Unix-type systems. In that case, the default assembler name can be determined by looking at the end of the executable name used to invoke asmx, then selecting that CPU type.

If no default assembler is specified, the DW/WORD and DL/LONG pseudo-ops will generate errors because they do not know which endian order to use.

Opcodes for the selected processor will have priority over generic pseudo-ops. However, assemblers for CPUs which have a "SET" opcode have been specifically designed to pass control to the generic "SET" pseudo-op.

REND

Ends an RORG block. A label in front of REND receives the relocated address + 1 of the last relocated byte in the RORG / REND block.

RORG

Sets the relocated origin address of the following code. Code in the object file still goes to the same addresses that follow the previous ORG, but labels and branches are handled as though the code were being assembled starting at the RORG address.

SEG / RSEG / SEG.U segment

Switches to a new code segment. Code segments are simply different sections of code which get assembled to different addresses. They remember their last location when you switch back to them. If no segment name is specified, the null segment is used.

At the start of each assembler pass, all segment pointers are reset to zero, and the null segment becomes the current segment.

SEG.U is for DASM compatibility. DASM uses SEG.U to indicate an "unitialized" segment. This is necessary because its DS pseudo-op always generates data even when none is specified. Since the DS pseudo-op in this assembler normally doesn't generate any data, unitialized segments aren't supported as such.

RSEG is for compatibility with vintage Atari 7800 source code.

SUBROUTINE / SUBR name

This sets the scope for temporary labels beginning with a dot. At the start of each pass, and when this pseudo-op is used with no name specified, temporary labels beginning with a dot use the previous non-temporary label, just as the temporary labels beginning with an '@'.

Example:

       START
       .LABEL  NOP        ; this becomes "START.LABEL"
               SUBROUTINE foo
       .LABEL  NOP        ; this becomes "FOO.LABEL"
               SUBROUTINE bar
       .LABEL  NOP        ; this becomes "BAR.LABEL"
               SUBROUTINE
       LABEL
       .LABEL  NOP        ; this becomes "LABEL.LABEL"

WORDSIZE n

Specifies the CPU's word size in bits. This is for CPUs which do not support byte addressing. If the word size is zero, the native CPU word size is used. Currently only the Jaguar DSP/GSP uses a word size that is not equal to 8.

This is primarily intended for using DS pseudo-ops to create data structure offsets, using WORDSIZE 8.

ZSCII

Creates a compressed text string in the version 1 Infocom format. Otherwise, this works exactly like the DB pseudo-op. Note that this will always generate a multiple of two bytes of data.

WARNING: using a forward-referenced value could cause phase errors!

See http://www.wolldingwacht.de/if/z-spec.html for more information on the compressed text format.

There is also one CPU-specific pseudo-op:

SETDP value

With the 6809 assembler, this sets the current value of the direct page register, for determining whether to use direct or extended mode. It defaults to zero at the start of each assembler pass.

.LONGA ON|OFF

With the 65C816 assembler, this sets the data size for immediate instructions that use the A register. .LONGA OFF generates an 8-bit operand, and .LONGA ON generates a 16-bit operand. It defaults to OFF at the start of each assembler pass.

.LONGI ON|OFF

With the 65C816 assembler, this sets the data size for immediate instructions that use the X and Y registers. .LONGI OFF generates an 8-bit operand, and .LONGI ON generates a 16-bit operand. It defaults to OFF at the start of each assembler pass.

SYMBOL TABLE DUMP

The symbol table is dumped at the end of the listing file. Each symbol shows its name, value, and flags. The flags are:

UUndefinedthis symbol was referenced but never defined
MMultiply definedthis symbol was defined more than once with different values (only the first is kept)
SSETthis symbol was defined with the SET pseudo-op, or from the -dLABEL:=VALUE command line option
EEQUthis symbol was defined with the EQU pseudo-op, or from the -dLABEL=VALUE command line option


CHANGE HISTORY

Version 1.1 changes (April 1995)

(this version was on the original Starpath "Stella Gets a New Brain" CD)

Version 1.2 changes (September 1996)


Version 1.3 changes (December 1996)


Version 1.4 changes (February 2002)


Version 1.5 changes (2004-02-24)


Version 1.6 changes (2004-04-30)


Version 1.7 changes (2004-08-25)


Version 1.7.1 changes (2004-10-20)


Version 1.7.2 changes (2005-08-21)


Version 1.7.3 changes (2006-01-23)


Version 1.7.4 changes (2006-11-09)


Version 1.8 changes (2007-01-11)


Version 2.0.0 changes (2023-10-31)


Version 2.0.1 changes (2024-xx-xx)


Notes: