Assembler patterns

A pattern is a reusable solution to a commonly occurring problem. The following patterns are useful to save memory and execution cycles. Note that many patterns come at the expense of some code-readability.

Register addressing with R0
Setting the cc status bits
Reuse an absolute addresses
Indirection for repeated branches
Increment or decrement an 8-bit value
Increment a 16-bit memory value
Decrement a 16-bit memory value
Tail and fall-through subroutine calls
Use indirection to access data in other pages
Returning from an interrupt

Where patterns are ‘good’ useful solutions, anti-patterns are ‘bad’ or inefficient solutions that should not be used. Some anti-patterns:

Don’t use instruction lodz,r0
Don’t use register bank 1

Register addressing with R0

These instructions are 1 byte and 2 cycles (1 cycle on the 2650B) and make register zero operate on itself. They have interesting side-effects.

Opcode	Instruction	Effects
`00`	lodz,r0	See separate section below.
`20`	eorz,r0	Clears r0, and clears both condition code bits in PSL. Saves one byte compared to `lodi,r0 0`.
`40`	andz,r0	Do not use. Opcode is used for the HALT instruction.
`60`	iorz,r0	Sets condition code bits to the value of r0, without changing r0.
`80`	addz,r0	Doubles r0 (with carry, if set). Similar to `rrl,r0`.
`a0`	subz,r0	Same as `eorz,r0` (if the WC bit is not set).
`c0`	strz,r0	Do not use. Opcode is used for the NOP instruction.
`e0`	comz,r0	Clears both condition code bits, without changing R0.

Setting the cc status bits

The base method for setting the two status bits to a known value is by using CPSL and PPSL. This is inefficient.

* Inefficient
SetEq   cpsl    CC1+CC0     ; 2 bytes   3 cycles

SetGt   cpsl    CC1
        ppsl    CC0         ; 4 bytes   6 cycles 

SetLt   ppsl    CC1
        cpsl    CC0         ; 4 bytes   6 cycles

Note that it is entirely legal to set both CC1 and CC0 to 1, although the 2650 will never do so itself.

A better way to clear the two cc status bits is to use register addressing on register 0.

SetEq   comz,r0             ; 1 bytes   2 cycles

SetGt   comz,r0
        ppsl    CC0         ; 3 bytes   5 cycles

SetLt   comz,r0
        ppsl    CC1         ; 3 bytes   5 cycles

If the contents of some register rx do not have to be retained, you can us this:

SetGt   lodi,rx 1           ; 2 bytes   2 cycles

SetLt   lodi,rx -1          ; 2 bytes   2 cycles

To set the status bits to the current value of r1 .. r3:

        comi,rx 0           ; 2 bytes   2 cycles

When COM is 0 (Logical compare) the status code bits will be set to Zero or Positive; Negative can only happen when COM is 1 (Arithmetic compare).

Reuse an absolute addresses

Use a relative indirect instruction to reuse a nearby absolute address. Using relative addressing saves one execution cycle, but the indirection costs an extra two cycles. In effect, you spend one extra execution cycle to save one byte of memory.

code    loda,r0  addr
         :
        strr,r0  *code+1    ; 2 bytes   5 cycles

instead of:

code    loda,r0  addr
         :
        stra,r0  addr       ; 3 bytes   4 cycles

Indirection for repeated branches

If execution speed is not a concern, then you can save memory when code makes repeated branches to the same address.

Err     ACON     Error

         : some code
        bcfr,eq  *Err
         : some code
        bcfr,gt  *Err
         : some code
        bcfr,lt  *Err
         : some code
        bctr,un  *Err

The relative branches use two bytes instead of three for an absolute branch (but require an extra cycle). If you have more than two branches to the same address you start saving memory.

This pattern can be combined with Reuse An Absolute Address, to save an extra byte.

         : some code
Err     EQU      $+1
        bcfa,eq  Error
         : some code
        bcfr,gt  *Err
         : some code
        bcfr,lt  *Err
         : some code
        bctr,un  *Err

Increment or decrement an 8-bit value

The base method to increment or decrement a register:

        addi,r1 1           ; 2 bytes   2 cycles
        subi,r2 1           ; 2 bytes   2 cycles

This is subject to the state of the WC bit. The CC bits are set to the new register value. In TWIN TOS the following pattern is used.

        birr,r1 $+2         ; 2 bytes   3 cycles
        bdrr,r2 $+2         ; 2 bytes   3 cycles

This takes an extra cycle, but may be useful when the state of the CC bits needs to be preserved, or when the state of WC cannot be assumed. Otherwise a direct addi or subi is faster and more readable.

The base method to increment a value in memory:

        loda,r0 addr        ; 3 bytes   4 cycles
        addi,r0 1           ; 2 bytes   2 cycles
        stra,r0 addr        ; 3 bytes   4 cycles

This only works when WC is set to 0, which is normally the case, or if CAR is known to be 0. Otherwise the counter may inadvertently be increased by two.

The following pattern is used in VHS Dos. Save one byte of memory without a compute penalty and is independent of the WC bit.

        eorz,r0             ; 1 byte    2 cycles
        adda,r0 addr-1,r0+  ; 3 bytes   4 cycles
        stra,r0 addr        ; 3 bytes   4 cycles

Register r0 is first incremented, and becomes 1. It is then added (as the index register) to addr-1 yielding addr. The contents at that address are added to r0 (which contains 1), to increment the value.

Especially useful if you need the zero in r0 anyway to initialise other registers or memory bytes.

When decrementing, the following pattern is inspired by the “VHS DOS-pattern”.

        eorz,r0
        adda,r0 addr-h'ff',r0-
        stra,r0 addr

Register r0 is first decremented, and becomes -1 / h’ff’. It is then added (as the index register) to addr-h'ff' yielding addr. The contents at that address are added to r0 (which contains -1).

Increment a 16-bit memory value

This base method is quite inefficient with 20 bytes and 26 cycles.

        loda,r0 addr+1      ; 3 bytes   4 cycles
        addi,r0 1           ; 2 bytes   2 cycles
        stra,r0 addr+1      ; 3 bytes   4 cycles
        ppsl    WC          ; 2 bytes   3 cycles
        loda,r0 addr        ; 3 bytes   4 cycles
        addi,r0 0           ; 2 bytes   2 cycles
        stra,r0 addr        ; 3 bytes   4 cycles
        cpsl    WC          ; 2 bytes   3 cycles

It can be done without using the program status word in 18 bytes and 13 cycles most of the time, 23 cycles worst case. This only works when WC is set to 0.

        loda,r0 addr+1      ; 3 bytes   4 cycles
        addi,r0 1           ; 2 bytes   2 cycles
        stra,r0 addr+1      ; 3 bytes   4 cycles
        brnr,r0 skip        ; 2 bytes   3 cycles
        loda,r0 addr        ; 3 bytes   4 cycles
        addi,r0 1           ; 2 bytes   2 cycles
        stra,r0 addr        ; 3 bytes   4 cycles
skip     :

This can be combined with the pattern for incrementing an 8-bit value, reducing it to 16 bytes and 13 cycles (23 cycles worst case).

        eorz,r0             ; 1 byte    2 cycles
        adda,r0 addr,r0+    ; 3 bytes   4 cycles
        stra,r0 addr+1      ; 3 bytes   4 cycles
        brnr,r0 skip        ; 2 bytes   3 cycles
        eorz,r0             ; 1 byte    2 cycles
        adda,r0 addr-1,r0+  ; 3 bytes   4 cycles
        stra,r0 addr        ; 3 bytes   4 cycles
skip     :

If you can use two registers, this neat variant is 16 bytes and 11 cycles (22 cycles worst case). It is also independent of the state of the WC bit.

        loda,r1 addr+1      ; 3 bytes   4 cycles
        birr,r1 skip2       ; 2 bytes   3 cycles
        loda,r0 addr        ; 3 bytes   4 cycles
        birr,r0 skip1       ; 2 bytes   3 cycles
skip1   stra,r0 addr        ; 3 bytes   4 cycles
skip2   stra,r1 addr+1      ; 3 bytes   4 cycles

If no registers are available, avoid the bank-1-antipattern: switching to bank 1 and back again raises the cost to 20 bytes and 28 cycles worst case.

Decrement a 16-bit memory value

This base method uses 20 bytes and 26 cycles.

        loda,r0 addr+1      ; 3 bytes   4 cycles
        subi,r0 1           ; 2 bytes   2 cycles
        stra,r0 addr+1      ; 3 bytes   4 cycles
        ppsl    WC          ; 2 bytes   3 cycles
        loda,r0 addr        ; 3 bytes   4 cycles
        subi,r0 0           ; 2 bytes   2 cycles
        stra,r0 addr        ; 3 bytes   4 cycles
        cpsl    WC          ; 2 bytes   3 cycles

A simple alternative without the With Carry bit uses the same memory but reduces the typical case to 15 cycles (25 cycles worst case):

        loda,r0 addr+1      ; 3 bytes   4 cycles
        subi,r0 1           ; 2 bytes   2 cycles
        stra,r0 addr+1      ; 3 bytes   4 cycles
        comi,r0 h'ff'       ; 2 bytes   2 cycles
        bcfr,eq skip        ; 2 bytes   3 cycles
        loda,r0 addr        ; 3 bytes   4 cycles
        subi,r0 1           ; 2 bytes   2 cycles
        stra,r0 addr        ; 3 bytes   4 cycles
skip     :

The Overflow bit can be used to detect the wrap-around. Memory-usage is still the same with 20 bytes, and the typical case takes 16 cycles (26 cycles worst case):

        loda,r0 addr+1      ; 3 bytes   4 cycles
        subi,r0 1           ; 2 bytes   2 cycles
        stra,r0 addr+1      ; 3 bytes   4 cycles
        tpsl    OVF         ; 2 bytes   3 cycles
        bcfr,eq skip        ; 2 bytes   3 cycles
        loda,r0 addr        ; 3 bytes   4 cycles
        subi,r0 1           ; 2 bytes   2 cycles
        stra,r0 addr        ; 3 bytes   4 cycles
skip     :

Chain and fall-through subroutine calls

Chain subroutine calls at the end of a subroutine. Instead of ending a subroutine like this:

        : part of first sub
        bsta,un SUB         ; 3 bytes   3 cycles
        retc,un             ; 1 byte    3 cycles

use:

        : part of first sub
        bcta,un SUB ** chain ; 3 bytes  3 cycles

To save one byte and three cycles. It is good practice to mark this branch with “** chain” as a comment, to warn about the hidden return instruction.

When the subroutine immediate follows the code, you can omit the branch. It is good practice to mark this with “** fall-through” as a comment, to warn about the implicit branch instruction.

        : part of first sub
        ** fall-through

SUB     :start of new sub

Use indirection to access data in other pages

The 2650 divides its 32K address space into four pages of 8K each. While absolute branch instructions can jump to any address in the 32K address space, absolute data instructions such as LODA and ADDA are restricted to their page.

In order to load or store into an address location in another page, use indirection.

        ORG     H'2000'     Code lives in page 1
Other   ACON    H'6100'     vector into page 3
        loda,r0 *Other      Fetch byte from another page
        :
        stra,r0 *Other      Store data to another page

Sometimes you need to access a 16-bit value in another page. For example, your code wants to access the location of the cursor. The Central Data computer stores the cursor position at H’17FE’ (high byte) and H’17FF’ (low byte). Set this by combining indexing and indirection. Remember: indexing is applied to the result of indirection; adding the index is the last step in determining the effective address.

        ORG     H'2000'     Code lives in page 1
Curs    ACON    H'17FE'     vector to 16-bit value in page 0
* Set the cursor at location H'1234'
        lodi,r0 H'12'
        stra,r0 *Curs       Set the high byte of the cursor
        lodi,r0 H'34'
        lodi,r1 1
        stra,r0 *Curs,r1    Set the low byte of the cursor

Alternatively, you can define two vectors. This saves two cycles, without an increase in memory.

        ORG     H'2000'     Code lives in page 1
Curs    ACON    H'17FE'     vector into 16-bit value in page 0
CursLo  ACON    H'17FF'     vector into low byte
* Set the cursor at location H'1234'
        lodi,r0 H'12'
        stra,r0 *Curs       Set the high byte of the cursor
        lodi,r0 H'34'
        stra,r0 *CursLo     Set the low byte of the cursor

Note that asm2650 will generate an error when an absolute data instruction accesses data in another page. Most other assemblers will silently generate incorrect code

Returning from an interrupt

Interrupts can occur anytime between two instructions. In order not to upset the running program the interrupt handler must ensure that the processor state is not changed in any unexpected way. This applies especially to the Lower Program Status Word, as these bits are set and cleared according to the instructions in the interrupt handler. It is therefore essential for the interrupt handler to save the PSL, and restore it at the end of the interrupt handler. The instructions spsl and lpsl can be used for this. Since these modify register zero, R0 has to be saved and restored as well.

* Incorrect interrupt handler
Handler	stra,r0	SavR0
	spsl		Stores PSL into R0
	stra,r0	SavPSL
	*
	* handler code here, restore any changes to other registers
	*
	loda,r0	SavPSL
	lpsl		Restores PSL
	loda,r0	SavR0	Changes condition code!
	rete,un

SavPSL	RES	1
SavR0	RES	1

The problem here is that the last loda instruction changes the PSL again. To work around this issue, use the following pattern.

* Correct interrupt handler, 2650 and 2650A
Handler	stra,r0	SavR0
	spsl		Stores PSL into R0
	stra,r0	SavPSL
	*
	* handler code here, restore any changes to other registers
	*
	loda,r0	SavPSL
	lpsl            Restores PSL
	bctr,gt	RetGT
	bctr,lt	RetLT
	*
RetZ	loda,r0	SavR0
	comz,r0         CC = EQ
	rete,un
	*
RetGT	loda,r0	SavR0
	comz,r0
	ppsl	CC0     CC = GT
	rete,un
	*
RetLT	loda,r0	SavR0
	comz,r0
	ppsl	CC1     CC = LT
	rete,un

SavPSL	RES	1
SavR0	RES	1

The above works when running from ROM (e.g. from a game cartridge). When running from RAM it is possible to modify the code at runtime:

* Correct interrupt handler for RAM, 2650 and 2650A
Handler	stra,r0	SavR0+1
	spsl		Stores PSL into R0
	stra,r0	SavPSL+1
	*
	* handler code here, restore any changes to other registers
	*
SavR0   lodi,r0 00      Will be overwritten with actual value
        cpsl    h'ff'
SavPSL  ppsl    00      Will be overwritten with actual value
	rete,un

Now compare this to how this is done using the 2650B microprocessor, using the two instructions that were added in this variant.

* Correct interrupt handler, 2650B only
Handler	stpl	SavPSL
	*
	* handler code here, restore any changes to registers
	*
	ldpl	SavPSL
	rete,un

SavPSL	RES	1

Don’t use instruction `lodz,r0`

There seems to be conflicting information on whether the lodz,r0 instruction is legal or not. The hardware manuals on the 2650 microprocessor by Signetics and Philips (even later ones describing the 2650B) are very clear:

When the specified register, r, equals 0, the operation code is changed to 60₁₆ by the assembler. The instruction, 00000000, yields indeterminate results.
Signetics 2650 Microprocessor manual.

Opcode 60 stands for iorz,r0. Arguably both instructions should yield the same result: the contents of r0 are unchanged but the Condition Code bits in the Program Status Word are set to either 00 (zero), 01 (positive) or 10 (negative) according to the value in r0. However, lodz,r0 does not reliably work (“yields indeterminate results”), and iorz,r0 should be used instead.

Problem solved? Well, several programs — including a lot of software written by Central Data — make use of lodz,r0. One conclusion is that apparently they used a different assembler than Signetics, but more importantly: the lodz,r0 instruction appears to work fine in practice. Furthermore, the official manual to the Instructor 50 mentions this:

When the specified register, r, equals 0, the operation code is changed to 60₁₆ (IORZ) by the assembler. However, the processor will execute the instruction 00₁₆ correctly.
Introduction to the Instructor 50 Desktop Computer.

To avoid issues (and discussions) it is best to avoid lodz,r0 and use iorz,r0 instead. The asm2650 assembler issues an warning for it, but does not silently change it to iorz,r0.

Don’t use register bank 1

Normally bank 0 is selected, and instructions operate on R0..R3. A switch to bank 1 is done only at the very beginning of certain subroutines, to avoid modifying R1..R3 in bank 0. With only three registers holding data (R0 is used as a general accumulator), it becomes necessary to store data in memory. One might think that it is a waste not to use 3 scarce registers in bank 1 during normal operations, but bank switching is expensive and often not worth the effort.

For example, scratch space can be used to control a loop like this (14 bytes, 18 cycles):

TEMP    RES 1               ; 1 byte

        eorz,r0             ; 1 byte    2 cycles
        strr,r0 RES         ; 2 bytes   3 cycles
Loop     :
         :
        lodr,r0 RES         ; 2 bytes   3 cycles
        addi,r0 1           ; 2 bytes   2 cycles
        strr,r0 RES         ; 2 bytes   3 cycles
        comi,r0 MAX         ; 2 bytes   2 cycles
        bcfr,eq Loop        ; 2 bytes   3 cycles

This is more efficient than using a register in bank 1. The following anti-pattern uses 16 bytes, 23 and cycles:

        eorz,r0             ; 1 byte    2 cycles
        ppsl    RS          ; 2 bytes   3 cycles
        strz,r4             ; 1 byte    2 cycles
        cpsl    RS          ; 2 bytes   3 cycles
Loop     :
         :
        ppsl    RS          ; 2 bytes   3 cycles
        addi,r4 1           ; 2 bytes   2 cycles
        comi,r4 MAX         ; 2 bytes   2 cycles
        cpsl    RS          ; 2 bytes   3 cycles
        bcfr,eq Loop        ; 2 bytes   3 cycles

Perhaps this anti-pattern is a specific example of the more generic anti-pattern of using the xPSL / xPSU instructions. Working with the Program Status Word is expensive.