pspemu (VII)

ALGORITMOS , D , PSPEMU

April 18, 2010

As promised: after a couple of weeks of relax, I'm continuing the work. As I said, I'm going to comment the kind of changes I have been doing in these revisions (r111 -> r124).

In these posts I always miss a few interesting things to comment. I'm doing a lot of refactors most of the time making use of the power of CTFE. But I'm not usually commenting which ones I'm doing or why.

Hex editor. Relative and pattern-based search

For now I have started to do some tries to add a debugger to the emulator.

I have been trying how to create an hexadecimal editor having a easy-to-use edition, and the ability to search directly without having to make dumps. A lot of emulators have hexadecimal viewers, but most of them doesn't allow you to edit, or they don't allow you to search, so you end using external tools. I want to make a component that allows to do all this. Including relative search, the ability to use different encodings and personalized encoding tables.

Another thing I wanted to add is to allow to search by pattern. I do not know any hexadecimal editor that has this kind of search that works where relative search fail.s. For example: Valkyrie Profile game has a character table no correlative and no ASCII that makes the traditional search, even the relative one failing. For this kind of cases you can use the pattern-based search. In addition it helps to find tiles.

If we have the phrase "Welcome back." with a relative-search we should search something like "elcome" so it can properly work (we shouldn't mix upper and lower cases in this kind of search since it couldn't work).

With relative-search what we do is to convert "elcome" into:

php -r"$s = 'elcome'; $list = array(); for ($n = 1, $l = strlen($s); $n < $l; $n++) $list[] = ord($s[$n]) - ord($s[$n - 1]); print_r($list);"  

Array  
(  
[0] => 7  
[1] => -9  
[2] => 12  
[3] => -2  
[4] => -8  
)

When searching, we don't search characters, but increments. This works when the text is not ascii, but the table of characters has the letters in the alphabetic order (the normal way).

Pattern-based search what do is to search repetitions.

"Welcome back"

"a**bcdefb**ghidj"

In this case: a != b != c != d... Each character must be unique.

Sometime ago I made several tries for an implementation. Here's the D implementation for the people with interest:

import std.string, std.stdio, std.stream, std.file;  

ubyte[] normalize(ubyte[] text) {  
 int[0x100] translate = void; translate[0..0x100] = -1;  

 ubyte[] text2 = new ubyte[text.length];  

 int count_symbols = 0;  

 for (int n = 0, len = text.length; n < len; n++) {  
  int cn = text[n];  
  int* translate_cn = &translate[cn];  
  if (*translate_cn == -1) *translate_cn = count_symbols++;  
  text2[n] = *translate_cn;  
 }  

 return text2;  
}  

void search_pattern(ubyte[] data, ubyte[] search) {  
 auto search_normalized = normalize(search);  

 for (int n = 0; n <= data.length - search.length; n++) {  
  if (normalize(data[n..n + search.length]) == search_normalized) {  
   writefln("%08X find!", n);  

   for (int m = 0; m < search.length; m++) {  
    writefln("%02X: %s", data[n + m], cast(char)search[m]);  
   }  
  }  
 }  
}  

void main() {  
 search_pattern(cast(ubyte[])read("ramdump.bin"), cast(ubyte[])"you're waiting for someone");  
}

It is possible to optimize it making that instead of normalizing the whole string, would be done incrementally at the same time that you compare the original string normalized, this way you can make a reduction of te process via shortcircuit. But this implementation was fast enough for what I wanted to do, and I didn't want to make it more complex.

Dynamic recompilation (JIT)

Other things I have been working on has been the dynamic recompilation. I didn't have experience at all in this field. I have been thinking sometimes about it and I imagined how could it be implemented, but from the theory to the practice there is a pretty long way. Fortunately the ideas I had in mind looked pretty correct to me.

One of the things I changed was the register I used for the pointer table to the MIPS registers. I commented that was better to use a register that we didn't need to keep to avoid having to do PUSH and POP everytime we wanted to call a block of native generated code. But on the other hand it happened that everytime I wanted to call to a native function I had to preserve that register using PUSH and POP And generally there was much more calls to native code that the opposite. So what I did was to make a PUSH just before calling the native code in the intermediate function I use to call the generated code, and a POP just after returning. This way I have been avoiding lots of PUSH and POPS all the code around. Now I'm using the EBX register for this.

The other cool improvement has been to mapping of instructions from guest to host. Initially I made it using an Associative Array, that should't produce "too much" problems. But the problem comes when there are dynamic jumps everywhere. There I'm not pretty sure that the cost is worth. So I have ended scrifying more memory (more than twice), but I have made it to have a constant time. If the PSP memory is 32MB, the memory required would be 64MB. Having in mind that ahy we browser is wasting much more than that already just by being opened, I think this is not a big deal ;)

I don't recall if I wrote that at some execution points, you might need to stop things to execute a callback, interruptions, thread switching, breakpoint checking etc. If I commented it, I suppose that I commented also that before doing jumps is a good point to make this kind of checks. Those are instructions that eventually will happen but they are less frequent than others.

Other things I have been working on is to improve this process. In simple loops with tons of iterations, this might be an expensive operation. And even if generally this kind of loops will eventually be converted into memsets, memcpys and similar, improving this process might be a good speedup for all the applications.

VirtualAlloc

By using several VirtualAlloc we can accelerate substantially every memory accesses (both writing and reading) to/from the guest memory. Specially if we have in mind that in the HLE emulation I'm doing, games are not going to access to DMA, this should be pretty straight forward.

Normally, everytime that we have a write or read access to/from the guest, we have to have in mind that in the PSP case, there are 3 segments of memory phisically accessible. Scratchpad, graphic memory and normal RAM. This makes that in order to access memory, we have to determine in the first place which segment of memory are we going to access, to check that the code is not going to exit that segment, and calculate the position that will have in the host memory. With different optimiations, this is more or less fast, but memory access is a pretty common operation, so all the optimizations are welcome.

It happens that on Windows, there is a function VirtualAlloc (for what I know Linux has an equivalent one called mmap) that allows to alloc an arbitrary memory to the current process, allowing us to reserve memory giving the original segment space between the memory segments on the guest, making the conversion between guest and host memories just a single addition.

I made a simple demo that painted the screen completely with a color that was increasing in its value. If the from the interpreted version to the dynarec there was a change from 6 to 50fps, using VirtualAlloc it was 60fps directly. Still a lot of room for improvements, but step by step :)

Clut (Color LookUp Table)

One of the things still not implemented was the clut. My initial idea was to make the unwizzling of textures and the clut by using shaders, but for now I have done it with the typical approach: making a hash for the data and for the texture and if any of them changes, generate an standard 32-bit texture with the palette applied and unswizzled.

HLE/KD

Added UIDs with low ids for the Thread identifiers and File handles. Before, I was passing a pointer to the host memory as UID value. The main problem with this is that was more difficult to track and that when I will implement savestates, it will produce more problems that advantages.

On the other hand I made several thread and file corrections. Always with a very superflual knowledge that makes me to be aware of important details after implementing several things in a very different way that is incompatible.

I made a lot of fixes for the kernel functions and added lots and lots of not implemented nids.

Audio

Other things that I have been experimenting with in these revisions have been audio.

Audio mixing is something that I haven't done before too. The first thing I thought was to use FMOD and did some tries with it. But the truth is that I wanted to know how would you make it using the windows API directly on one hand, and on the other hand I wanted to avoid at all the cost to depend on external libraries. And since all the data reaching the audio library have always the same frequency and have always two channels, it was a pretty straightforward process. What could worry me a little was the resampling; anothing thing I haven't done before just yet. But mixing audio to the same frequency, shouldn't have too much complication: it is enough to extract the arithmetic mean of all the samples. There there was a few interesting things, but in theory it shouldn't be too much complicated. Despite of all this, the only thing I got was a lot of pretty nasty and bad sounds; though you could appreciate the sound. Maybe there is a buffer underrun somewhere with the mixer or due to other thing, but it wasn't something that I worried about too much, having it half-working was enough for me at this point and switched back to other things. I will make it work properly in a future version.

Misc

I added support to take PNG screenshots.

Started to partially implement support for sceIoDevctl and partially for callbacks.

Instruction table

One of the coolest things of the implementation of the emulator in D, that doesn't have other emulators is the compilation-time generation of switches in this case ()or the tables if required for the fast decoding of instructions. And with a single table, we can generate the decoder, interpreted mode, dynarec and disassembler. Generally it is usually much more complex and emulators end having lots of duplicated code that difficults and requires a lot mainteinance.

Tables was something like:

ID( "mfdr",               VM(0x7000003D, 0xFFE007FF), "%t, %r", ADDR_TYPE_NONE, INSTR_TYPE_PSP ),  
ID( "mfhi",               VM(0x00000010, 0xFFFF07FF), "%d", ADDR_TYPE_NONE, 0 ),  
ID( "mfic",               VM(0x70000024, 0xFFE007FF), "%t, %p", ADDR_TYPE_NONE, INSTR_TYPE_PSP ),  
ID( "mflo",               VM(0x00000012, 0xFFFF07FF), "%d", ADDR_TYPE_NONE, 0 ),  
ID( "movn",               VM(0x0000000B, 0xFC0007FF), "%d, %s, %t", ADDR_TYPE_NONE, INSTR_TYPE_PSP ),

It's pretty centralized. Two integer values: one indicating bigs that should be set or not, to be the required instruction, and the other a mask of bits that are going to be compared. It is pretty practical and this way you can generate several nested switches or tables. Problems: that it is coded in hexadecimal and this has binary logic. Even if it was in binary the mask continues being separated and it is harder to see the interesting part.

Sometime ago I saw the Allegrex.isa, that improves this aspect. Each instruction is associated to a string like:

000001:rs:00001:imm16

With this you can visually determine which bits are being considered, and the different parts of the instruction. I liked it a lot.

So making use of the wonderful power of the D's CTFE, I'm converting instructions into something like this:

ID( "xor",    VM("000000:rs:rt:rd:00000:100110"), "%d, %s, %t", ADDR_TYPE_NONE, 0 ),  
ID( "xori",   VM("001110:rs:rt:imm16"          ), "%t, %s, %I", ADDR_TYPE_NONE, 0 ),  

// Shift Left/Right Logical/Arithmethic (Variable).  
ID( "sll",    VM("000000:00000:rt:rd:sa:000000"), "%d, %t, %a", ADDR_TYPE_NONE, 0 ),  
ID( "sllv",   VM("000000:rs:rt:rd:00000:000100"), "%d, %t, %s", ADDR_TYPE_NONE, 0 ),  
ID( "sra",    VM("000000:00000:rt:rd:sa:000011"), "%d, %t, %a", ADDR_TYPE_NONE, 0 ),

It improves what I had before, and makes it much more clear and mainteintable.

VM is an structure with two static opCall, one for the old definition and other for the new one. Since it is a pure function, we can call it at compilation time without any issues and generate the original value, and the original mask using a string as input :) :

struct ValueMask {  
 string format;  
 uint value, mask;  

 static ValueMask opCall(uint value, uint mask) {  
  ValueMask ret;  
  ret.format = "";  
  ret.value = value;  
  ret.mask  = mask;  
  return ret;  
 }  

 bool opCmp(ValueMask that) {  
  return (this.value == that.value) && (this.mask == that.mask);  
 }  

 static ValueMask opCall(string format) {  
  ValueMask ret;  
  string[] parts;  
  ret.format = format;  
  int start, total;  

  for (int n = 0; n <= format.length; n++) {  
   if ((n == format.length) || (format[n] == ':')) {  
    parts ~= format[start..n];  
    start = n + 1;  
   }  
  }  

  void alloc(uint count) {  
   ret.value <<= count;  
   ret.mask  <<= count;  
   total += count;  
  }  

  void set(uint value, uint mask) {  
   ret.value |= value;  
   ret.mask  |= mask;  
  }  

  foreach (part; parts) {  
   switch (part) {  
    case "rs", "rd", "rt", "sa", "lsb", "msb", "fs", "fd", "ft": alloc(5); break;  
    case "fcond": alloc(4 ); break;  
    case "imm16": alloc(16); break;  
    case "imm26": alloc(26); break;  
    default:  
     if ((part[0] != '0') && (part[0] != '1')) {  
      assert(0, "Unknown identifier");  
     } else {  
      for (int n = 0; n < part.length; n++) {  
       alloc(1);  

       if (part[n] == '0') {  
        set(0, 1);  
       } else if (part[n] == '1') {  
        set(1, 1);  
       } else {  
        //pragma(msg, part);  
        assert(0);  
        set(0, 0);  
       }  
      }  
     }  

    break;  
   }  
  }  

  assert(total == 32);  

  return ret;  
 }  
}  

struct InstructionDefinition {  
 string  name;  
 ValueMask opcode;  
 string  fmt;  
 uint    addrtype;  
 uint    type;  

 string toString() {  
  return format("InstructionDefinition('%s', %08X, %08X, '%s', %s, %s)", name, opcode.value, opcode.mask, fmt, addrtype, type);  
 }  
}

Magic!

BTW: I'm the most active contributor of the year of D on ohloh :D http://www.ohloh.net/languages/32

Spanish

Lo prometido es deuda. Tras un par de semanitas de relax, vuelvo a la carga. Como dije, voy a comentar por encima los cambios que he ido haciendo en estas revisiones (r111 -> r124).

En estos posts siempre me dejo algunas cosas interesantes en el tintero por comentar. Suelo hacer cantidad de refactorizaciones muchas veces haciendo uso de la potencia del CTFE de D. Pero no suelo comentar cuales o por qué las he hecho.

Editor hexadecimal. Búsqueda relativa y de patrón

Por lo pronto he empezado a hacer pruebas para añadir un debugger al emulador. He estado viendo cómo montar un componente de editor hexadecimal que tenga una edición cómoda y posibilidad de buscar directamente sin tener que dumpear. Muchos emuladores tienen visores hexadecimal, pero o bien no dejan editar o bien no dejan buscar con lo que hace falta usar herramientas externas. Yo quiero hacer un componente que tenga todo eso. Incluyendo búsquedas relativas, posibilidad de usar diferentes codificaciones y tablas de codificación personalizadas. Otra cosa que me gustaría añadir es la "búsqueda por patrón". No conozco ningún editor hexadecimal que tenga este tipo de búsqueda que funciona en los casos en los que falla la búsqueda relativa. Por ejemplo. El Valkyrie Profile tiene una tabla de caracteres no correlativos y no ascii que hace que la búsqueda tradicional, y incluso la búsqueda relativa fallen. Ahí es donde la búsqueda de patrón funciona. Además sirve para hacer búsquedas de tiles.

Si tenemos la frase "Welcome back." con una búsqueda relativa convendría buscar algo tipo "elcome" para que funcione correctamente (no conviene mezclar mayúsculas y minúsculas en este tipo de búsquedas porque podría no funcionar).

La búsqueda relativa lo que hace es convertir "elcome" en:

php -r"$s = 'elcome'; $list = array(); for ($n = 1, $l = strlen($s); $n < $l; $n++) $list[] = ord($s[$n]) - ord($s[$n - 1]); print_r($list);"  

Array  
(  
[0] => 7  
[1] => -9  
[2] => 12  
[3] => -2  
[4] => -8  
)

Y a la hora de buscar, no se buscan caracteres, sino incrementos. Esto funciona cuando el texto no es ascii, pero la tabla de caracteres tiene las letras en orden alfabético (lo normal).

La búsqueda de patrón lo que hace es buscar repeticiones.

"Welcome back"

"a**bcdefb**ghidj"

Por lo pronto a != b != c != d... Cada caracter debe ser único.

Hace tiempo estuve haciendo varias pruebas de implementación. Dejo aquí una en D para los interesados:

import std.string, std.stdio, std.stream, std.file;  

ubyte[] normalize(ubyte[] text) {  
 int[0x100] translate = void; translate[0..0x100] = -1;  

 ubyte[] text2 = new ubyte[text.length];  

 int count_symbols = 0;  

 for (int n = 0, len = text.length; n < len; n++) {  
  int cn = text[n];  
  int* translate_cn = &translate[cn];  
  if (*translate_cn == -1) *translate_cn = count_symbols++;  
  text2[n] = *translate_cn;  
 }  

 return text2;  
}  

void search_pattern(ubyte[] data, ubyte[] search) {  
 auto search_normalized = normalize(search);  

 for (int n = 0; n <= data.length - search.length; n++) {  
  if (normalize(data[n..n + search.length]) == search_normalized) {  
   writefln("%08X find!", n);  

   for (int m = 0; m < search.length; m++) {  
    writefln("%02X: %s", data[n + m], cast(char)search[m]);  
   }  
  }  
 }  
}  

void main() {  
 search_pattern(cast(ubyte[])read("ramdump.bin"), cast(ubyte[])"you're waiting for someone");  
}

Se podría optimizar haciendo que en vez de normalizar la cadena entera, se hiciese incrementalmente a la vez que se compara con la cadena original normalizada, de esta forma se podría hacer una reducción de proceso por cortocircuito. Pero esta implementación ya era suficientemente rápida para lo que me interesaba, y no quería complicarla.

Recompilación dinámica

Otra de las cosas en las que he estado trabajando más es en la recompilación dinámica. No tenía ninguna experiencia en este campo todavía. Había pensado en el tema en algunas ocasiones y me imaginaba cómo se podría implementar. Pero de la teoría a la práctica va un buen trecho. Por suerte las ideas que tenía en mente parecía que eran bastante acertadas.

Por lo pronto una de las cosas que cambié fue el registro a utilizar para el puntero a la tabla de registros de mips. Comenté que mejor usar un registro que no tuviese que conservarse para evitar tener que hacer PUSH y POP cada vez que se llamaba a un bloque de código nativo generado. Pero resulta que cada vez que se hacía una llamada a una función externa a ese bloque nativo, había que guardarse ese registro con PUSH y POP. Y generalmente habían muchas más llamadas que veces que se salía del código nativo. Así que lo que hice fue hacer el PUSH justo antes de llamar al código nativo en una función intermedia y el POP al volver de la función. De esta forma me ahorraba PUSH y POPs por todo el código generado tanto por una cosa como por la otra. Ahora estoy utilizando el registro EBX para esto.

Otra mejora sustancial que hice fue el mapeado de instrucciones del guest al host. Inicialmente lo hice con un AA (Array Asociativo) que tiene un coste de acceso de una tabla de hashes. En códigos que no tengan muchos saltos dinámicos, no debería suponer "demasiado" problema. Pero el problema viene cuando hay saltos dinámicos por doquier. Ahí no tengo claro que ese coste sea suficientemente satisfactorio. Así que sacrificando mucha más memoria (mas o menos el doble), he hecho un acceso en tiempo constante. Si la RAM de la psp son 32MB, pues la memoria usada sería 64MB. Teniendo en cuenta que cualquier navegador web ya gasta mucho más que eso por el hecho de estar abierto, no creo que sea un problema muy grave ;)

No sé si comenté que en determinados puntos de la ejecución, hay que hacer una parada para ejecutar callbacks, interrupciones, switching de threads, comprobaciones de breakpoints etc. Si lo comenté, imagino que conté también que un buen momento para hacer este tipo de comprobaciones es en los saltos. Son instrucciones que por narices se tienen que ejecutar tarde o temprano y que son menos frecuente que otras.

Otra de las cosas en las que estoy trabajando es en mejorar este proceso. En bucles sencillos con muchas iteraciones puede ser una operación costosa. Y aunque generalmente este tipo de bucles más adelante se convertirán en memsets, memcpy y similares, mejorar este proceso debería suponer un speedup importante en todas las aplicaciones.

VirtualAlloc

Mediante varios VirtualAlloc se puede acelerar sustancialmente todos los accesos (de escritura o lectura) a la memoria del guest. Especialmente si tenemos en cuenta que en la emulación HLE que estoy haciendo, los juegos no van a acceder a DMA, es especialmente sencillo.

De normal, cada vez que se hace un acceso de escritura o lectura a la memoria del guest, tenemos que tener en cuenta que en el caso de la psp hay 3 segmentos de memoria física accesibles. El scratchpad, la memória gráfica y la RAM normal. Esto hace que para acceder a memoria tengamos que determinar en primer lugar el segmento de memoria al que vamos a acceder, ver que no se salga de ahí y calcular la posición que tendrá en la memoria del host. Con diferentes optimizaciones, esto es más o menos rápido, pero el acceso a memoria es una operación muy común, así que cualquier optimización es bienvenida.

Resulta que en windows hay una función VirtualAlloc (tengo entendido que en linux hay una equivalente llamada mmap) que permite reservar una zona de memoria arbitraria al proceso actual. Con lo que podemos reservar la memoria dejando el espacio entre segmentos original en el guest y haciendo que la conversión de dirección de memoria de guest a host y viceversa sea una simple suma.

Hice una pequeña demo que iba pintando la pantalla entera de la psp de un color que iba aumentando. Si de la versión interpretada a la de dynarec hubo un cambio de 6 a 50fps, con el VirtualAlloc se ha pasado a 60fps directamente. Todavía quedan muchas cosas que se pueden optimizar, pero poco a poco :)

Clut (Color LookUp Table)

Una cosa que faltaba por implementar era el clut. Mi idea inicial era hacer el unswizzling de texturas y el clut usando shaders, pero por ahora he tirado a la opción típica: hacer un hash para los datos y para la textura y si alguno de los dos cambia, generar una textura estándar de 32 bits con la paleta aplicada y unswizzleado.

HLE/KD

Añadí UIDs con lowid para los identificadores de threads y handles de archivos. Antes se pasaba un puntero a la memoria del host como valor de UID. Las pegas de esto es que se hace más difícil de trackear y que cuando implemente los savestates, daría más problemas que ventajas.

Por otra parte hice diversas correcciones en threads y archivos. Siempre con un conocimiento muy superficial que hace que en ocasiones me de cuenta de un detalle importante después de haber implementado algunas cosas de una forma bastante diferente e incompatible.

Hice muchas correcciones en funciones del kernel y añadí montañas y montañas de nids sin implementar.

Audio

Otra de las cosas con las que he experimentado en estas revisiones ha sido con el audio. Mixing de audio era algo que tampoco había hecho nunca. En lo primero que pensé fue en usar la librería FMOD y hice algunas pruebas con ella. Pero la verdad quería probar como se haría con el Api de windows directamente por una parte, y por otra parte quería evitar a toda costa la dependencia de librerías externas. Y como los datos que llegaban a la librería de audio son datos que están todos con la misma frecuencia y que solo tienen uno o dos canales, era un proceso relativamente sencillo. Lo que me podía preocupar un poco era el resampling, que no he hecho nunca. Pero el mixing de audio a la misma frecuencia, no tiene mucha complicación: basta con sacar la media aritmética de los samples de todos los canales. Aquí había alguna cosilla interesante, pero en "teórica" no tiene mucha complicación. Aún con todo, solo conseguí que saliese un sonido muy chungo; aunque se podía apreciar. Quizá sería por un buffer underrun con el buffer del mixer o por otra cosa, pero tampoco era un tema que me preocupase demasiado, así que con esto medio funcionando pasé a otros temas. Ya haré que se escuche bien más adelante.

Variado

Añadí soporte para tomar pantallazos en png.

Empecé a implementar parcialmente el soporte para sceIoDevctl y parcialmente para callbacks.

Tabla de instrucciones

Una de las cosas más molonas de la implementación del emulador en D, que no tienen otros emuladores es la generación en tiempo de compilación de switch en este caso (o de tablas si hiciese falta) para la decodificación rápida de instrucciones. Y además con una sola tabla, se genera la decodificación del modo intérprete, del dynarec y del desensamblador. Generalmente suele ser mucho más complejo y suele acabar en demasiadas redundancias que dificultan mucho el mantenimiento.

La tabla era algo así:

ID( "mfdr",               VM(0x7000003D, 0xFFE007FF), "%t, %r", ADDR_TYPE_NONE, INSTR_TYPE_PSP ),  
ID( "mfhi",               VM(0x00000010, 0xFFFF07FF), "%d", ADDR_TYPE_NONE, 0 ),  
ID( "mfic",               VM(0x70000024, 0xFFE007FF), "%t, %p", ADDR_TYPE_NONE, INSTR_TYPE_PSP ),  
ID( "mflo",               VM(0x00000012, 0xFFFF07FF), "%d", ADDR_TYPE_NONE, 0 ),  
ID( "movn",               VM(0x0000000B, 0xFC0007FF), "%d, %s, %t", ADDR_TYPE_NONE, INSTR_TYPE_PSP ),

Bastante centralizado. Dos valores enteros: uno indicando los bits que deben estar seteados o no para que sea la instrucción que toca, y otro con la máscara de los bits que se van a comprar. Es bastante práctico y con esto se puede generar los switch anidados o las tablas que interesan. Pegas: está en hexadecimal y esto tiene lógica en binario. Aunque estuviese en binario la máscara sigue por separado y cuesta de ver qué parte nos interesa.

Hace tiempo vi el Allegrex.isa, que mejora este aspecto. A cada instrucción se le asocia una cadena de este tipo:

000001:rs:00001:imm16

Con esto se ve visualmente los bits que se consideran y las diferentes partes de la instrucción. Me gustó mucho.

Así que haciendo uso del maravilloso poder del CTFE en D, estoy convirtiendo las instrucciones a algo así:

ID( "xor",    VM("000000:rs:rt:rd:00000:100110"), "%d, %s, %t", ADDR_TYPE_NONE, 0 ),  
ID( "xori",   VM("001110:rs:rt:imm16"          ), "%t, %s, %I", ADDR_TYPE_NONE, 0 ),  

// Shift Left/Right Logical/Arithmethic (Variable).  
ID( "sll",    VM("000000:00000:rt:rd:sa:000000"), "%d, %t, %a", ADDR_TYPE_NONE, 0 ),  
ID( "sllv",   VM("000000:rs:rt:rd:00000:000100"), "%d, %t, %s", ADDR_TYPE_NONE, 0 ),  
ID( "sra",    VM("000000:00000:rt:rd:sa:000011"), "%d, %t, %a", ADDR_TYPE_NONE, 0 ),

Que mejora sustancialmente lo que había. Lo hace visualmente más claro y mantenible.

VM es una estructura con dos static opCall, uno para la definición antigua y otro para la nueva. Como es una función pura, se puede ejecutar en tiempo de compilación sin mayor complicación y generar el value, mask originales utilizando como entrada la cadena :) :

struct ValueMask {  
 string format;  
 uint value, mask;  

 static ValueMask opCall(uint value, uint mask) {  
  ValueMask ret;  
  ret.format = "";  
  ret.value = value;  
  ret.mask  = mask;  
  return ret;  
 }  

 bool opCmp(ValueMask that) {  
  return (this.value == that.value) && (this.mask == that.mask);  
 }  

 static ValueMask opCall(string format) {  
  ValueMask ret;  
  string[] parts;  
  ret.format = format;  
  int start, total;  

  for (int n = 0; n <= format.length; n++) {  
   if ((n == format.length) || (format[n] == ':')) {  
    parts ~= format[start..n];  
    start = n + 1;  
   }  
  }  

  void alloc(uint count) {  
   ret.value <<= count;  
   ret.mask  <<= count;  
   total += count;  
  }  

  void set(uint value, uint mask) {  
   ret.value |= value;  
   ret.mask  |= mask;  
  }  

  foreach (part; parts) {  
   switch (part) {  
    case "rs", "rd", "rt", "sa", "lsb", "msb", "fs", "fd", "ft": alloc(5); break;  
    case "fcond": alloc(4 ); break;  
    case "imm16": alloc(16); break;  
    case "imm26": alloc(26); break;  
    default:  
     if ((part[0] != '0') && (part[0] != '1')) {  
      assert(0, "Unknown identifier");  
     } else {  
      for (int n = 0; n < part.length; n++) {  
       alloc(1);  

       if (part[n] == '0') {  
        set(0, 1);  
       } else if (part[n] == '1') {  
        set(1, 1);  
       } else {  
        //pragma(msg, part);  
        assert(0);  
        set(0, 0);  
       }  
      }  
     }  

    break;  
   }  
  }  

  assert(total == 32);  

  return ret;  
 }  
}  

struct InstructionDefinition {  
 string  name;  
 ValueMask opcode;  
 string  fmt;  
 uint    addrtype;  
 uint    type;  

 string toString() {  
  return format("InstructionDefinition('%s', %08X, %08X, '%s', %s, %s)", name, opcode.value, opcode.mask, fmt, addrtype, type);  
 }  
}

¡Magia!

BTW: Soy el contribuidor de D más activo del año en ohloh :D http://www.ohloh.net/languages/32