dreamcast.wiki - User contributions [en]

SH4 FTRV Optimizations

2026-07-03T05:30:00Z

GyroVorbis:

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==
There are two things to look out for when developing an intuition for when you can leverage FTRV:
# There are 3 or more dot products (ax + by + cz + dw) being calculated back-to-back.
# One of the vectors is held constant while the other vector argument is variable across each.

When you see such a scenario, the first thing that should pop into your head is that you are dealing with a vector x matrix transform, and you can use '''FTRV''' to accelerate it.

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of the view frustum planes. The result of each dot product represents the sphere's distance from that plane.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
// Compute dot product between each plane and the centroid.
// Your FIPR senses should be tingling, since we've got dot products!
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec3_dot(s->center, p->plane) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;

p++;
}

/* Since each plane is a 4D vector, we can load each one as a row vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_rows(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(shz_vec4_t(s->center, -1.0f));

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[i])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[i])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

Development

2026-01-08T20:39:08Z

GyroVorbis: Added link to SH4ZAM documentation

=== Getting started ===
* [[Getting Started with Dreamcast development]] -- start here!
====Ready-to-use environments====
* [[Codespaces]] (Browser-based development)
* [[Docker images]]
* [[DreamSDK]] (Windows only)

====[[KallistiOS]]====
* Building on Linux, macOS, Windows Subsystem for Linux
** see [[Getting Started with Dreamcast development|''Getting Started with Dreamcast development'']]
* [[Building KOS on Cygwin]]
* [[Building KOS on MinGW/MSYS]]
* [[Building KOS on MinGW-w64/MSYS2]]
* [https://kos-docs.dreamcast.wiki/ KallistiOS Doxygen documentation]

====Other====
* [[Using Ruby for Sega Dreamcast development]] (experimental)
* [[Compiling for Naomi]]

=== Build & test ===
* [[Building your project]]
* [[Emulators]]
* [[Broadband adapter]] / [[LAN adapter]]
** [[Using dcload-ip with Linux]]
** [[Using dcload-ip with Windows Subsystem for Linux|Using dcload-ip with Windows 10]] (via Windows Subsystem for Linux)
* [[Coder's cable]]

=== Environments and IDEs ===
* [[CLion Debugging]]
* [[Visual Studio Code]]

=== Tools & utilities ===
* [[Debugging throught GNU Debugger (GDB) and dcload/dc-tool]]
* [[Using dcprof]]

=== Releasing your project ===
* Plain files
* Disc image
* Selfboot Inducer package

=== Engines ===
''See'' [[Engine & Library]]

=== General ===
* [[Store Queues]]
* [[Romdisk Swapping]]
* [https://mc.pp.se/dc/hw.html Marcus Comstedt's Dreamcast Hardware Reference]

=== Graphics ===
* [[Texture Formats]]
* [[Graphics APIs]]
* [[Paletted Textures]]
* [[2D Rendering Without PVR]]
* [[Twiddling]]

* PVR
** [[PowerVR Introduction]]
** [[PVR Spritesheets]]
* [[GLdc]]
** [[Drawing 2D sprites using GLdc]]
** [[Drawing 3D shapes using GLdc]]
** [https://hkowsoftware.com/articles/gldc-vertex-formats-from-vec3f-to-fastpath-to-map_buffer/ GLdc Vertex Formats: From vec3f to fastpath to map_buffer]
* Others
** [http://www.numechanix.com/blog/index.php/2015/10/03/20/ Procedural texture]
** [[Notes on fillrate and drawing large textures]]
** [[KMG Textures]]
** [[Loading PNG images as OpenGL textures]]

=== Audio ===

=== Maple ===
* Controller input

=== VMU ===
* [[Save/Load file]]
* [[Show icon]]
* [[Play tone]]

=== Optimization ===
* [[GCC-SH4 tips]]
* [[Fast SH4 Vertex Processing]]
* [[Useful programming tips]]
* [[Efficient usage of the Dreamcast RAM]]
* [[SH4 FIPR Optimizations]]
* [[SH4 FTRV Optimizations]]
* [http://sh4zam.falcogirgis.net SH4ZAM docs]
* Registers
* DMA
* TA
* PVR

=== Website Development ===
*[[Development Resources]]

=== Random Snippets ===
* [[Objdump]]

Engine & Library

2026-01-08T20:35:53Z

GyroVorbis: Added SH4ZAM

This list is an adaptation of the [https://github.com/dreamcastdevs/awesome-dreamcast awesome-dreamcast] repo on github.

==Tutorial==
*[https://dreamcast.wiki Dreamcast.wiki] - A brand new wiki with up-to-date information about the Dreamcast.
*[https://dcemulation.org/index.php?title=Development DCEmu Developement Wiki] - Great resource to start. Somewhat incomplete in certain aspect.
*[https://github.com/dreamcastdevs/dreamcast_tutorial Dreamcast-tutorial Github] - A new-ish sets of tutorial with code example. Covers the basic (installing the toolchain, graphics, audio, controller)

==Framework==
*[https://sourceforge.net/projects/cadcdev/ KOS] - The pseudo-OS that's been used in a lot of homebrew/indie.
*[https://www.dreamsdk.org DreamSDK] - A multitool environment made for Windows. Maintained by [SiZiOUS](User:SiZiOUS)
*[http://wiki.bennugd.org/index.php?title=Bennu_Wiki BennuGD] - A multi-platform engine
*[https://github.com/FaucetDC/WincastCE WincastCE] - An experimental windows CE shell (?)
*[https://github.com/DC-SWAT/DreamShell DreamShell] - The popular alternative operating system for loading games/app from SD Card and IDE Drive
*[[libronin]] - an independent development library created by the DreamSNES team
*[http://sh4zam.falcogirgis.net SH4ZAM] - Fast math library for the Sega Dreamcast's SH4 CPU (included within kos-ports).

==Engine==
*[[Simulant]] - A general purpose 2D-3D engine in active developement.
*[[nuQuake]] - Quake engine by MrNeo240
*[[RADquake]] - Quake engine by Ian Micheal
*[https://github.com/ianmicheal/Dreambor6.0 DreamBOR - unofficial] - OpenBOR dreamcast port forked an improved by Ian Michael
*[https://github.com/CaptainDreamcast/DolmexicaInfinite DolmexicaInfinite] - A Mugen-like engine for fighter games
*[[Antiruins]] - Minimal 2D game engine with Lua scripting by Lerabot.

==Graphics==
*[https://gitlab.com/simulant/GLdc GLdc] - A OpenGL 1.2 implementation started by Kazade
*[https://github.com/Kannagi/LMP3D LMP3D] - A multi-platform 3D Lib. (*Looks abandoned but might be good for research*)
*[https://github.com/multimediamike/dreamroq DreamROQ] - A ROQ video player. (*Stable, no sound*)

==Audio==
*[https://gitlab.com/simulant/ALdc ALdc] - A OpenAL 1.2 implementation started by Kazade
*https://github.com/Aurelien34/DreamcastAicaSoundDriver - A hardware accelerated S3M and sfx AICA driver.

==VMU==
*[https://github.com/Protofall/Crayon-Utilities CrayonUtil] - Mostly tools for VMU icons, but also some texture converters. Made by [Protofall](https://github.com/Protofall)

==Utilities==
*[https://github.com/CaptainDreamcast/prism Prism] - CaptainDreamcast's set of utilities for Physics, files loading, etc. (*untested*)

==Memory Management==

==Debugging==

==Random==
*[https://github.com/Protofall/Homebrew-Tests Homebrew Tests (Protofall)]

Development

2025-12-12T04:33:49Z

GyroVorbis:

=== Getting started ===
* [[Getting Started with Dreamcast development]] -- start here!
====Ready-to-use environments====
* [[Codespaces]] (Browser-based development)
* [[Docker images]]
* [[DreamSDK]] (Windows only)

====[[KallistiOS]]====
* Building on Linux, macOS, Windows Subsystem for Linux
** see [[Getting Started with Dreamcast development|''Getting Started with Dreamcast development'']]
* [[Building KOS on Cygwin]]
* [[Building KOS on MinGW/MSYS]]
* [[Building KOS on MinGW-w64/MSYS2]]
* [https://kos-docs.dreamcast.wiki/ KallistiOS Doxygen documentation]

====Other====
* [[Using Ruby for Sega Dreamcast development]] (experimental)
* [[Compiling for Naomi]]

=== Build & test ===
* [[Building your project]]
* [[Emulators]]
* [[Broadband adapter]] / [[LAN adapter]]
** [[Using dcload-ip with Linux]]
** [[Using dcload-ip with Windows Subsystem for Linux|Using dcload-ip with Windows 10]] (via Windows Subsystem for Linux)
* [[Coder's cable]]

=== Environments and IDEs ===
* [[CLion Debugging]]
* [[Visual Studio Code]]

=== Tools & utilities ===
* [[Debugging throught GNU Debugger (GDB) and dcload/dc-tool]]
* [[Using dcprof]]

=== Releasing your project ===
* Plain files
* Disc image
* Selfboot Inducer package

=== Engines ===
''See'' [[Engine & Library]]

=== General ===
* [[Store Queues]]
* [[Romdisk Swapping]]
* [https://mc.pp.se/dc/hw.html Marcus Comstedt's Dreamcast Hardware Reference]

=== Graphics ===
* [[Texture Formats]]
* [[Graphics APIs]]
* [[Paletted Textures]]
* [[2D Rendering Without PVR]]
* [[Twiddling]]

* PVR
** [[PowerVR Introduction]]
** [[PVR Spritesheets]]
* [[GLdc]]
** [[Drawing 2D sprites using GLdc]]
** [[Drawing 3D shapes using GLdc]]
** [https://hkowsoftware.com/articles/gldc-vertex-formats-from-vec3f-to-fastpath-to-map_buffer/ GLdc Vertex Formats: From vec3f to fastpath to map_buffer]
* Others
** [http://www.numechanix.com/blog/index.php/2015/10/03/20/ Procedural texture]
** [[Notes on fillrate and drawing large textures]]
** [[KMG Textures]]
** [[Loading PNG images as OpenGL textures]]

=== Audio ===

=== Maple ===
* Controller input

=== VMU ===
* [[Save/Load file]]
* [[Show icon]]
* [[Play tone]]

=== Optimization ===
* [[GCC-SH4 tips]]
* [[Fast SH4 Vertex Processing]]
* [[Useful programming tips]]
* [[Efficient usage of the Dreamcast RAM]]
* [[SH4 FIPR Optimizations]]
* [[SH4 FTRV Optimizations]]
* Registers
* DMA
* TA
* PVR

=== Website Development ===
*[[Development Resources]]

=== Random Snippets ===
* [[Objdump]]

SH4 in Compiler Explorer

2025-08-27T17:31:21Z

GyroVorbis: /* Configuration */

Thanks to the effort of Matt Godbolt (who hilariously enough is a former Dreamcast developer himself), the SuperH GCC toolchain is now available for use with [https://godbolt.org Compiler Explorer], along with all of the SH4-specific compiler flags and options typically used when targeting the Dreamcast. This gives us an invaluable tool for getting quick and immediate feedback on how well a given C or C++ source segment tends to translate into SH4 assembly, offering a little sandbox for testing and optimizing code targeting the Dreamcast.

[[File:GCC Compiler Benchmarks.png|thumb|Benchmarking various GCC versions and flags]]

= Configuration =
To arrive at a configuration mirroring a Dreamcast development environment, first select one of the GCC compiler versions for the SH architecture. Secondly, the following compiler options should be used as the baseline configuration:
* <code>-ml</code>: compile code for the processor in little-endian mode
* FPU Mode:
** <code>-m4-single</code>: generate code for the SH4 with a floating-point unit that supports both single and double-precision floating point arithmetic that defaults to single-precision mode upon function entry
** <code>-m4-single-only</code>: generate code for the SH4 with a floating-point unit that only supports single-precision floating point arithmetic
* <code>-ffast-math</code>: breaks strict IEEE compliance and allows for faster floating point approximations
* <code>-O3</code>: optimization level 3
* <code>-mfsrra</code>: enables emission of the fsrra instruction for reciprocal square root approximations (not available in GCC 4.7.4)
* <code>-mfsca</code>: enables emission of the fsca instruction for sine and cosine approximations (not available in GCC 4.7.4)
* <code>-matomic-model=soft-imask</code>: enables support for C11 and C++11 atomics by disabling then reenabling interrupts around atomic variable operations
* <code>-mtas</code>: Backs C11's atomic_flag type by emitting the tas.b instruction, which atomically tests and sets the flag's value, also causing a purge of the cache line it lies within.
* <code>-ftls-model=local-exec</code>: enables the model used by KOS for supporting variables declared with the "thread_local" keyword

= Convenience Templates =
The following are pre-configured templates you can use as sample Dreamcast build configurations:
* GCC4.9.4:
** [https://godbolt.org/z/9MKzeMfMj C11 Hello World]
** [https://godbolt.org/z/qGzoeo4sj C++14 Hello World]
* GCC9.5.0:
** [https://godbolt.org/z/rvW3s3594 C17 Hello World]
** [https://godbolt.org/z/qYfE5G6Mx C++17 Hello World]
* GCC12.2.0:
** [https://godbolt.org/z/94TKvxazn C17 Hello World]
** [https://godbolt.org/z/61jqhE3zn C++20 Hello World]
* GCC13.1.0:
** [https://godbolt.org/z/Kb9bKe8ro C2X Hello World]
** [https://godbolt.org/z/51dv4ePsG C++23 Hello World]
* GCC13.2.0:
** [https://godbolt.org/z/rafvMdWGb C2X Hello World]
** [https://godbolt.org/z/MeG3rqna7 C++23 Hello World]
* GCC 14.1.0
** [https://godbolt.org/z/7ha8oj4vd C23 Hello World]
** [https://godbolt.org/z/eW8Yd8sG9 C++26 Hello World]
* GCC 14.2.0
** [https://godbolt.org/z/a54qTYon4 C23 Hello World]
** [https://godbolt.org/z/5njMrrh8z C++26 Hello World]
* GCC 15.1.0
** [https://godbolt.org/z/WWsaaEqWW C23 Hello World]
** [https://godbolt.org/z/raP67zWcY C++26 Hello World]
* GCC 15.2.0
** [https://godbolt.org/z/TGYr875vs C23 Hello World]
** [https://godbolt.org/z/7ET55a9sh C++26 Hello World]

Pre-configured template for ARM GCC8.5, Dreamcast's "AICA" sound chip:
* [https://godbolt.org/z/895dWG9qf C Hello World]

= Tips and Notes =
* It has been noted that while <code>-O3</code> is claimed to be the highest optimization level according to recent GCC documentation, some code differences can still be seen under certain circumstances when using <code>-O4</code> and beyond.
* The compiler seems to ignore both <code>-mfsrra</code> and <code>-mfsca</code> without the <code>-ffast-math</code> option.
* With the proper flags enabled for fast-math, the compiler is smart enough to leverage the following from pure C code, almost certainly better than you can do with small intrinsic-style inline ASM calls, provided you're using the proper single-precision versions of any <code><math.h></code> routines:
**
<center>
{|
|+ Compiler-Generated Fast-Math Optimizations
|-
! Assembly Output !! C/C++ Input
|-
| FMAC || <code>z = x * y + z</code>
|-
| FSCA || <code>s = sinf(angle); c = cosf(angle)</code>
|-
| FSRRA || <code>1.0f / sqrtf(x)</code>
|}
</center>
*Unfortunately the compiler has no knowledge of the following SIMD instructions, even with fast-math, so it's quite necessary to use inline assembly routines (provided by KallistiOS) for fully leveraging the SH4's FPU, when working with vectors and matrices in linear algebra routines:
**
<center>
{|
|+ Manual Inline Assembly Fast-Math Optimizations
|-
! Assembly Output !! C/C++ Input
|-
| FIPR || <code>Vector4 Dot Product</code>
|-
| FTRV || <code>Vector4 * Matrix4x4 Transform</code>
|}
</center>
* Typically smaller code sizes and more tightly optimized code are seen with newer versions of GCC versus the older ones; however, this is not always the case.
* Evidently, even without a branch predictor, the C++20 <code><nowiki>[[likely]]</nowiki></code> and <code><nowiki>[[unlikely]]</nowiki></code> attributes as well as the GCC intrinsic <code>__builtin_expect()</code> can have a fairly profound impact on code generation and optimization for conditionals and branches. More information can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106029 here].
* <code>-fipa-pta</code> allows the compiler to analyze pointer and reference usage beyond the scope of the current compiling function, which very often results in pretty decent performance increases at the cost of increased compile times and RAM usage.
* <code>-flto</code> allows GCC to perform optimizations over the entire program and all translation units as a single entity during the linking phase, for the cost of increased compile times and RAM usage. This frequently results in more performant code.
* An in-depth benchmark comparing the run-time performance and compiled binary size output of every toolchain version officially supported by KOS with various optimization levels can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106068 here].

SH4 in Compiler Explorer

2025-08-27T17:30:34Z

GyroVorbis: Templates for GCC 15.2.0

Thanks to the effort of Matt Godbolt (who hilariously enough is a former Dreamcast developer himself), the SuperH GCC toolchain is now available for use with [https://godbolt.org Compiler Explorer], along with all of the SH4-specific compiler flags and options typically used when targeting the Dreamcast. This gives us an invaluable tool for getting quick and immediate feedback on how well a given C or C++ source segment tends to translate into SH4 assembly, offering a little sandbox for testing and optimizing code targeting the Dreamcast.

[[File:GCC Compiler Benchmarks.png|thumb|Benchmarking various GCC versions and flags]]

= Configuration =
To arrive at a configuration mirroring a Dreamcast development environment, first select one of the GCC compiler versions for the SH architecture. Secondly, the following compiler options should be used as the baseline configuration:
* <code>-ml</code>: compile code for the processor in little-endian mode
* FPU Mode:
** <code>-m4-single</code>: generate code for the SH4 with a floating-point unit that supports both single and double-precision floating point arithmetic that defaults to single-precision mode upon function entry
** <code>-m4-single-only</code>: generate code for the SH4 with a floating-point unit that only supports single-precision floating point arithmetic
* <code>-ffast-math</code>: breaks strict IEEE compliance and allows for faster floating point approximations
* <code>-O3</code>: optimization level 3
* <code>-mfsrra</code>: enables emission of the fsrra instruction for reciprocal square root approximations (not available in GCC 4.7.4)
* <code>-mfsca</code>: enables emission of the fsca instruction for sine and cosine approximations (not available in GCC 4.7.4)
* <code>-matomic-model=soft-imask</code>: enables support for C11 and C++11 atomics by disabling then reenabling interrupts around atomic variable operations
* <code>-mtas</code> Backs C11's atomic_flag type by emitting the tas.b instruction, which atomically tests and sets the flag's value, also causing a purge of the cache line it lies within.
* <code>-ftls-model=local-exec</code>: enables the model used by KOS for supporting variables declared with the "thread_local" keyword

= Convenience Templates =
The following are pre-configured templates you can use as sample Dreamcast build configurations:
* GCC4.9.4:
** [https://godbolt.org/z/9MKzeMfMj C11 Hello World]
** [https://godbolt.org/z/qGzoeo4sj C++14 Hello World]
* GCC9.5.0:
** [https://godbolt.org/z/rvW3s3594 C17 Hello World]
** [https://godbolt.org/z/qYfE5G6Mx C++17 Hello World]
* GCC12.2.0:
** [https://godbolt.org/z/94TKvxazn C17 Hello World]
** [https://godbolt.org/z/61jqhE3zn C++20 Hello World]
* GCC13.1.0:
** [https://godbolt.org/z/Kb9bKe8ro C2X Hello World]
** [https://godbolt.org/z/51dv4ePsG C++23 Hello World]
* GCC13.2.0:
** [https://godbolt.org/z/rafvMdWGb C2X Hello World]
** [https://godbolt.org/z/MeG3rqna7 C++23 Hello World]
* GCC 14.1.0
** [https://godbolt.org/z/7ha8oj4vd C23 Hello World]
** [https://godbolt.org/z/eW8Yd8sG9 C++26 Hello World]
* GCC 14.2.0
** [https://godbolt.org/z/a54qTYon4 C23 Hello World]
** [https://godbolt.org/z/5njMrrh8z C++26 Hello World]
* GCC 15.1.0
** [https://godbolt.org/z/WWsaaEqWW C23 Hello World]
** [https://godbolt.org/z/raP67zWcY C++26 Hello World]
* GCC 15.2.0
** [https://godbolt.org/z/TGYr875vs C23 Hello World]
** [https://godbolt.org/z/7ET55a9sh C++26 Hello World]

Pre-configured template for ARM GCC8.5, Dreamcast's "AICA" sound chip:
* [https://godbolt.org/z/895dWG9qf C Hello World]

= Tips and Notes =
* It has been noted that while <code>-O3</code> is claimed to be the highest optimization level according to recent GCC documentation, some code differences can still be seen under certain circumstances when using <code>-O4</code> and beyond.
* The compiler seems to ignore both <code>-mfsrra</code> and <code>-mfsca</code> without the <code>-ffast-math</code> option.
* With the proper flags enabled for fast-math, the compiler is smart enough to leverage the following from pure C code, almost certainly better than you can do with small intrinsic-style inline ASM calls, provided you're using the proper single-precision versions of any <code><math.h></code> routines:
**
<center>
{|
|+ Compiler-Generated Fast-Math Optimizations
|-
! Assembly Output !! C/C++ Input
|-
| FMAC || <code>z = x * y + z</code>
|-
| FSCA || <code>s = sinf(angle); c = cosf(angle)</code>
|-
| FSRRA || <code>1.0f / sqrtf(x)</code>
|}
</center>
*Unfortunately the compiler has no knowledge of the following SIMD instructions, even with fast-math, so it's quite necessary to use inline assembly routines (provided by KallistiOS) for fully leveraging the SH4's FPU, when working with vectors and matrices in linear algebra routines:
**
<center>
{|
|+ Manual Inline Assembly Fast-Math Optimizations
|-
! Assembly Output !! C/C++ Input
|-
| FIPR || <code>Vector4 Dot Product</code>
|-
| FTRV || <code>Vector4 * Matrix4x4 Transform</code>
|}
</center>
* Typically smaller code sizes and more tightly optimized code are seen with newer versions of GCC versus the older ones; however, this is not always the case.
* Evidently, even without a branch predictor, the C++20 <code><nowiki>[[likely]]</nowiki></code> and <code><nowiki>[[unlikely]]</nowiki></code> attributes as well as the GCC intrinsic <code>__builtin_expect()</code> can have a fairly profound impact on code generation and optimization for conditionals and branches. More information can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106029 here].
* <code>-fipa-pta</code> allows the compiler to analyze pointer and reference usage beyond the scope of the current compiling function, which very often results in pretty decent performance increases at the cost of increased compile times and RAM usage.
* <code>-flto</code> allows GCC to perform optimizations over the entire program and all translation units as a single entity during the linking phase, for the cost of increased compile times and RAM usage. This frequently results in more performant code.
* An in-depth benchmark comparing the run-time performance and compiled binary size output of every toolchain version officially supported by KOS with various optimization levels can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106068 here].

SH4 in Compiler Explorer

2025-08-27T17:28:06Z

GyroVorbis: Added -mtas flag.

Thanks to the effort of Matt Godbolt (who hilariously enough is a former Dreamcast developer himself), the SuperH GCC toolchain is now available for use with [https://godbolt.org Compiler Explorer], along with all of the SH4-specific compiler flags and options typically used when targeting the Dreamcast. This gives us an invaluable tool for getting quick and immediate feedback on how well a given C or C++ source segment tends to translate into SH4 assembly, offering a little sandbox for testing and optimizing code targeting the Dreamcast.

[[File:GCC Compiler Benchmarks.png|thumb|Benchmarking various GCC versions and flags]]

= Configuration =
To arrive at a configuration mirroring a Dreamcast development environment, first select one of the GCC compiler versions for the SH architecture. Secondly, the following compiler options should be used as the baseline configuration:
* <code>-ml</code>: compile code for the processor in little-endian mode
* FPU Mode:
** <code>-m4-single</code>: generate code for the SH4 with a floating-point unit that supports both single and double-precision floating point arithmetic that defaults to single-precision mode upon function entry
** <code>-m4-single-only</code>: generate code for the SH4 with a floating-point unit that only supports single-precision floating point arithmetic
* <code>-ffast-math</code>: breaks strict IEEE compliance and allows for faster floating point approximations
* <code>-O3</code>: optimization level 3
* <code>-mfsrra</code>: enables emission of the fsrra instruction for reciprocal square root approximations (not available in GCC 4.7.4)
* <code>-mfsca</code>: enables emission of the fsca instruction for sine and cosine approximations (not available in GCC 4.7.4)
* <code>-matomic-model=soft-imask</code>: enables support for C11 and C++11 atomics by disabling then reenabling interrupts around atomic variable operations
* <code>-mtas</code> Backs C11's atomic_flag type by emitting the tas.b instruction, which atomically tests and sets the flag's value, also causing a purge of the cache line it lies within.
* <code>-ftls-model=local-exec</code>: enables the model used by KOS for supporting variables declared with the "thread_local" keyword

= Convenience Templates =
The following are pre-configured templates you can use as sample Dreamcast build configurations:
* GCC4.9.4:
** [https://godbolt.org/z/9MKzeMfMj C11 Hello World]
** [https://godbolt.org/z/qGzoeo4sj C++14 Hello World]
* GCC9.5.0:
** [https://godbolt.org/z/rvW3s3594 C17 Hello World]
** [https://godbolt.org/z/qYfE5G6Mx C++17 Hello World]
* GCC12.2.0:
** [https://godbolt.org/z/94TKvxazn C17 Hello World]
** [https://godbolt.org/z/61jqhE3zn C++20 Hello World]
* GCC13.1.0:
** [https://godbolt.org/z/Kb9bKe8ro C2X Hello World]
** [https://godbolt.org/z/51dv4ePsG C++23 Hello World]
* GCC13.2.0:
** [https://godbolt.org/z/rafvMdWGb C2X Hello World]
** [https://godbolt.org/z/MeG3rqna7 C++23 Hello World]
* GCC 14.1.0
** [https://godbolt.org/z/7ha8oj4vd C23 Hello World]
** [https://godbolt.org/z/eW8Yd8sG9 C++26 Hello World]
* GCC 14.2.0
** [https://godbolt.org/z/a54qTYon4 C23 Hello World]
** [https://godbolt.org/z/5njMrrh8z C++26 Hello World]
* GCC 15.1.0
** [https://godbolt.org/z/WWsaaEqWW C23 Hello World]
** [https://godbolt.org/z/raP67zWcY C++26 Hello World]

Pre-configured template for ARM GCC8.5, Dreamcast's "AICA" sound chip:
* [https://godbolt.org/z/895dWG9qf C Hello World]

= Tips and Notes =
* It has been noted that while <code>-O3</code> is claimed to be the highest optimization level according to recent GCC documentation, some code differences can still be seen under certain circumstances when using <code>-O4</code> and beyond.
* The compiler seems to ignore both <code>-mfsrra</code> and <code>-mfsca</code> without the <code>-ffast-math</code> option.
* With the proper flags enabled for fast-math, the compiler is smart enough to leverage the following from pure C code, almost certainly better than you can do with small intrinsic-style inline ASM calls, provided you're using the proper single-precision versions of any <code><math.h></code> routines:
**
<center>
{|
|+ Compiler-Generated Fast-Math Optimizations
|-
! Assembly Output !! C/C++ Input
|-
| FMAC || <code>z = x * y + z</code>
|-
| FSCA || <code>s = sinf(angle); c = cosf(angle)</code>
|-
| FSRRA || <code>1.0f / sqrtf(x)</code>
|}
</center>
*Unfortunately the compiler has no knowledge of the following SIMD instructions, even with fast-math, so it's quite necessary to use inline assembly routines (provided by KallistiOS) for fully leveraging the SH4's FPU, when working with vectors and matrices in linear algebra routines:
**
<center>
{|
|+ Manual Inline Assembly Fast-Math Optimizations
|-
! Assembly Output !! C/C++ Input
|-
| FIPR || <code>Vector4 Dot Product</code>
|-
| FTRV || <code>Vector4 * Matrix4x4 Transform</code>
|}
</center>
* Typically smaller code sizes and more tightly optimized code are seen with newer versions of GCC versus the older ones; however, this is not always the case.
* Evidently, even without a branch predictor, the C++20 <code><nowiki>[[likely]]</nowiki></code> and <code><nowiki>[[unlikely]]</nowiki></code> attributes as well as the GCC intrinsic <code>__builtin_expect()</code> can have a fairly profound impact on code generation and optimization for conditionals and branches. More information can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106029 here].
* <code>-fipa-pta</code> allows the compiler to analyze pointer and reference usage beyond the scope of the current compiling function, which very often results in pretty decent performance increases at the cost of increased compile times and RAM usage.
* <code>-flto</code> allows GCC to perform optimizations over the entire program and all translation units as a single entity during the linking phase, for the cost of increased compile times and RAM usage. This frequently results in more performant code.
* An in-depth benchmark comparing the run-time performance and compiled binary size output of every toolchain version officially supported by KOS with various optimization levels can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106068 here].

SH4 FTRV Optimizations

2025-08-06T08:32:21Z

GyroVorbis: /* Bounding Sphere vs View Frustum Culling */

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==
There are two things to look out for when developing an intuition for when you can leverage FTRV:
# There are 3 or more dot products (ax + by + cz + dw) being calculated back-to-back.
# One of the vectors is held constant while the other vector argument is variable across each.

When you see such a scenario, the first thing that should pop into your head is that you are dealing with a vector x matrix transform, and you can use '''FTRV''' to accelerate it.

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of the view frustum planes. The result of each dot product represents the sphere's distance from that plane.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
// Compute dot product between each plane and the centroid.
// Your FIPR senses should be tingling, since we've got dot products!
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec3_dot(s->center, p->plane) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;

p++;
}

/* Since each plane is a 4D vector, we can load each one as a column vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_cols(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(shz_vec4_t(s->center, -1.0f));

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[i])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[i])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

SH4 FTRV Optimizations

2025-08-06T08:13:07Z

GyroVorbis: /* Bounding Sphere vs View Frustum Culling */

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==
There are two things to look out for when developing an intuition for when you can leverage FTRV:
# There are 3 or more dot products (ax + by + cz + dw) being calculated back-to-back.
# One of the vectors is held constant while the other vector argument is variable across each.

When you see such a scenario, the first thing that should pop into your head is that you are dealing with a vector x matrix transform, and you can use '''FTRV''' to accelerate it.

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of the view frustum planes. The result of each dot product represents the sphere's distance from that plane.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
// Compute dot product between each plane and the centroid.
// Your FIPR senses should be tingling, since we've got dot products!
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec3_dot(s->center, p->plane) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;

p++;
}

/* Since each plane is a 4D vector, we can load each one as a column vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_cols(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(shz_vec4_t(s->center, -1.0f));

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[0])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[0])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

SH4 FTRV Optimizations

2025-08-06T08:11:35Z

GyroVorbis: /* Bounding Sphere vs View Frustum Culling */

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==
There are two things to look out for when developing an intuition for when you can leverage FTRV:
# There are 3 or more dot products (ax + by + cz + dw) being calculated back-to-back.
# One of the vectors is held constant while the other vector argument is variable across each.

When you see such a scenario, the first thing that should pop into your head is that you are dealing with a vector x matrix transform, and you can use '''FTRV''' to accelerate it.

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of the view frustum planes. The result of each dot product represents the sphere's distance from that plane.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
// Compute dot product between each plane and the centroid.
// Your FIPR senses should be tingling, since we've got dot products!
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec3_dot(s->center, p->plane) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;

p++;
}

/* Since each plane is a 4D vector, we can load each one as a column vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_cols(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(s->center);

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[0])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[0])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

SH4 FTRV Optimizations

2025-08-06T08:04:52Z

GyroVorbis: /* When to use FTRV */

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==
There are two things to look out for when developing an intuition for when you can leverage FTRV:
# There are 3 or more dot products (ax + by + cz + dw) being calculated back-to-back.
# One of the vectors is held constant while the other vector argument is variable across each.

When you see such a scenario, the first thing that should pop into your head is that you are dealing with a vector x matrix transform, and you can use '''FTRV''' to accelerate it.

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of the view frustum planes. The result of each dot product represents the sphere's distance from that plane.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
// Compute dot product between each plane and the centroid.
// Your FIPR senses should be tingling, since we've got dot products!
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec4_dot(s->center, p->plane);

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;

p++;
}

/* Since each plane is a 4D vector, we can load each one as a column vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_cols(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(s->center);

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[0])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[0])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

SH4 FTRV Optimizations

2025-08-06T08:04:16Z

GyroVorbis: /* When to use FTRV */

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==
There are two things to look out for when developing an intuition for when you can leverage FTRV:
# There are 3 or more dot products (ax + by + cz + dw) being calculated back-to-back.
# One of the vectors is held constant while the other vector argument is variable.

When you see such a scenario, the first thing that should pop into your head is that you are dealing with a vector x matrix transform, and you can use '''FTRV''' to accelerate it.

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of the view frustum planes. The result of each dot product represents the sphere's distance from that plane.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
// Compute dot product between each plane and the centroid.
// Your FIPR senses should be tingling, since we've got dot products!
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec4_dot(s->center, p->plane);

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;

p++;
}

/* Since each plane is a 4D vector, we can load each one as a column vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_cols(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(s->center);

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[0])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[0])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

SH4 FTRV Optimizations

2025-08-06T07:57:49Z

GyroVorbis: /* Bounding Sphere vs View Frustum Culling */

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of the view frustum planes. The result of each dot product represents the sphere's distance from that plane.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
// Compute dot product between each plane and the centroid.
// Your FIPR senses should be tingling, since we've got dot products!
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec4_dot(s->center, p->plane);

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;

p++;
}

/* Since each plane is a 4D vector, we can load each one as a column vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_cols(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(s->center);

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[0])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[0])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

SH4 FTRV Optimizations

2025-08-06T07:54:55Z

GyroVorbis: /* Bounding Sphere vs View Frustum Culling */

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of the view frustum planes. The result of each dot product represents the sphere's distance from that plane.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
// Compute dot product between each plane and the centroid.
// Your FIPR senses should be tingling, since we've got dot products!
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec4_dot(s->center, p->plane);

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;
p++;
}

/* Since each plane is a 4D vector, we can load each one as a column vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_cols(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(s->center);

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[0])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[0])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

SH4 FTRV Optimizations

2025-08-06T07:53:52Z

GyroVorbis: /* Bounding Sphere vs View Frustum Culling */

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of the view frustum planes. The result of each dot product represents the sphere's distance from that plane.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
// Compute dot product between each plane and the centroid.
// Your FIPR senses should be tingling, since we've got dot products!
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec4_dot(s->center, p->plane);

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;
p++;
}

/* Since each plane is a 4D vector, we can load each one as a column vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_cols(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(s->center);

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[0])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[0])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

SH4 FTRV Optimizations

2025-08-06T07:53:24Z

GyroVorbis: /* Bounding Sphere vs View Frustum Culling */

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of the view frustum planes. The result of each dot product represents the sphere's distance from that plane.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
/* Compute dot product between each plane and the centroid
Your FIPR senses should be tingling, since we've got dot products! */
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec4_dot(s->center, p->plane);

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;
p++;
}

/* Since each plane is a 4D vector, we can load each one as a column vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_cols(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(s->center);

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[0])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[0])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

SH4 FTRV Optimizations

2025-08-06T07:52:37Z

GyroVorbis:

Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have yielded fantastic gainz within the community.

==Relationship to FIPR==

{| class="wikitable"
|+ Instruction Summaries
|-
! Format !! Function !! Encoding !! Group !! Issue Cycles !! Latency Cycles
|-
| '''fipr''' FVm,FVn || inner_product (FVm, FVn) -> FR[n+3] || 1111nnmm11101101 || FE || 1 || 4/5
|-
| '''ftrv''' XMTRX, FVn || transform_vector(XMTRX, FVn) -> FVn || 1111nn0111111101 || FE || 1 || 5/8
|}

==When to use FTRV==

==Real-World Examples==
The following are real-world examples of FTRV-based optimizations used within games and applications for the Sega Dreamcast within the community.
===Vertex Position Transformation===
The first and most obvious use of the FTRV instruction is for doing position transform calculations on the incoming vertex stream, transforming from local to view-space, while submitting vertices to the PowerVR during T&L. This is the first and absolute most crucial area for leveraging FTRV and was its original intended purpose. If you do nothing else with the instruction, bear in mind that the only way to come even remotely close to pushing a considerable volume of polygons on the DC is by properly harnessing the SH4 by using FTRV to transform your vertices.

===Diffuse Lighting===

===Collision and Physics===
===Bounding Sphere vs View Frustum Culling===
The following code snippet is taken from the DCA3 codebase as part of its Renderware driver back-end for Dreamcast. To check for intersection between a bounding sphere and the view frustum, you must compute the dot product between the centroid of the bounding sphere against all 6 of its planes.

The original algorithm was as follows:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;

// Iterate over each of the 6 frustum planes.
const FrustumPlane *p = this->frustumPlanes;
for(int32 i = 0; i < 6; i++){
/* Compute dot product between each plane and the centroid
Your FIPR senses should be tingling, since we've got dot products! */
float32 distance = dot(p->plane, s->center) - p->plane.distance;

if(s->radius < distance)
return SPHEREOUTSIDE; // No intersection
if(s->radius > -distance)
res = SPHEREBOUNDARY; // Intersection

p++;
}
return res;
}
</syntaxhighlight>

Since we have a scenario where we have 4+ dot products being taken where one of the vectors remains constant for each, this is a textbook case for using FTRV to compute 4 of them in parallel.

So we use FIPR to accelerate two of the dot products and FTRV for the other 4 of them at once. The following code uses [https://github.com/gyrovorbis/sh4zam libSH4ZAM] to achieve this:
<syntaxhighlight lang="c++">
int32 Camera::frustumTestSphere(const Sphere *s) const {
int32 res = SPHEREINSIDE;
const FrustumPlane *p = this->frustumPlanes;

// Use FIPR to accelerate first two dot products independently
for(unsigned i = 0; i < 2; ++i) {
float distance = shz_vec4_dot(s->center, p->plane);

if(s->radius < distance)
return SPHEREOUTSIDE;
else if(s->radius > -distance)
res = SPHEREBOUNDARY;
p++;
}

/* Since each plane is a 4D vector, we can load each one as a column vector
into XMTRX, creating a 4x4 matrix out of the 4 planes. */
shz_xmtrx_load_4x4_cols(&p[0].plane, &p[1].plane, &p[2].plane, &p[3].plane);

/* Now we transform our constant vector, the bounding sphere's center, against
our 4 plane vectors. This gives us a result vector where the value of each
component is equal to the dot product between the sphere's centroid and the
corresponding plane column vector. */
shz_vec4_t distances = shz_xmtrx_trans_vec4(s->center);

/* Now we simply iterate over each of our 4 result components to check
for intersection. */
for(unsigned i = 0; i < 4; ++i) {
if(s->radius < distances.elem[0])
return SPHEREOUTSIDE;
else if(s->radius > -distances.elem[0])
res = SPHEREBOUNDARY;
}

return res;
}
</syntaxhighlight>

===ADPCM Decoding===

SH4 FIPR Optimizations

2025-07-26T14:25:29Z

GyroVorbis:

Yo, guys. At like 1AM @ian micheal got me looking at pl_mpeg's audio decoder to see if I could see any potential gainz... So here is its innermost hottest audio synthesis loop:

<pre>
for (int i = 32; i; --i) {
float u;
u = pl_fipr(d[0], d[1], d[2], d[3], v1[0], v2[0], v1[128], v2[128]);
u += pl_fipr(d[4], d[5], d[6], d[7], v1[256], v2[256], v1[384], v2[384]);
u += pl_fipr(d[8], d[9], d[10], d[11], v1[512], v2[512], v1[640], v2[640]);
u += pl_fipr(d[12], d[13], d[14], d[15], v1[768], v2[768], v1[896], v2[896]);
d += 32;
v1++;
v2++;
*out++ = (short)((int)u >> 16);
}
</pre>
Which... you'd think would be preeeetty efficient, right? 4 back-to-back FIPRs? I mean, it is hella gainzy compared to not using FIPR.

But there are two problems with back-to-back FIPR-y, I wanna teach anyone interested:

1) Very often one of the vector arguments stays constant between FIPR calls, but unfortunately the compiler is too dumb to not reload all 8 registers between calls regardless.
* LUCKILY every argument to these FIPRs is unique so this is not applicable, but... very often that's a perf destroyer.

2) THE COMPILER CANNOT PIPELINE FIPR FOR SHIT.
* VERY applicable here. You know what the ASM looks like for these FIPR calls? Something like this:
<pre>
! load first vector arg into fv0 (nothing wrong with this)
fmov.s @%[d]+, fr0
fmov.s @%[d}+, fr1
fmov.s @%[d]+, fr2
fmov.s @%[d]+, fr3

! load second vector arg into fv4 (nothing wrong with this)
fmov.s @%[v1], fr4
add %[offset], @[v1]
fmov.s @%[v2], fr5
add %[offset], @[v2]
fmov.s @%[v1], fr6
fmov.s @%[v2], fr7

! issue actual FIPR calculation
fipr fv0, fv4

! VERY NEXT INSTRUCTION TRY TO STORE THE RESULT
fmov.s fr7, @%[result] ! PIPELINE STALL!!!!
</pre>
Now this is very very bad. FIPR has 4-5 cycles of latency, so every fucking call to FIPR, since the very next instruction tries to use the result before its been calculated, the entire pipeline must stall waiting for the result... FOR EVERY FIPR CALL.
So you're losing MASSIVE perf benefits there.
The solution? You have to pipeline your FIPRs so that while the previous FIPR call is still calculating, you're loading up and issuing the next FIPR call.

So I wrote a new routine that replaces that inner loop body doing manually pipelined FIPR calls... This should be way better:
<pre>
for (int i = 32; i; --i) {
#if 0 // Old FIPR path which didn't pipeline for shit.
float u;
u = pl_fipr(d[0], d[1], d[2], d[3], v1[0], v2[0], v1[128], v2[128]);
u += pl_fipr(d[4], d[5], d[6], d[7], v1[256], v2[256], v1[384], v2[384]);
u += pl_fipr(d[8], d[9], d[10], d[11], v1[512], v2[512], v1[640], v2[640]);
u += pl_fipr(d[12], d[13], d[14], d[15], v1[768], v2[768], v1[896], v2[896]);
#else // New hand-written FIPR path with manual pipelining
float u = shz_pl_inner_loop(d, v1, v2);
#endif
d += 32;
v1++;
v2++;
*out++ = (short)((int)u >> 16);
}
</pre>

Where the new implementation is this inline ASM:
<pre>
__always_inline
float shz_pl_inner_loop(const float *d, const float *v1, const float *v2) {
float fp_scratch[2];
uint32_t int_scratch;

asm volatile(R"(
! Swap to back-bank so we don't need to clobber any FP regs.
frchg

! Load first vector into fv0 for first FIPR.
xor %[s], %[s]
fmov.s @%[d]+, fr0
add #64, %[s]
fmov.s @%[d]+, fr1
add #64, %[s]
fmov.s @%[d]+, fr2
add #16, %[r]
fmov.s @%[d]+, fr3

! Load second vector into fv4 for first FIPR
fmov.s @%[v1], fr4
add %[s], %[v1]
fmov.s @%[v2], fr5
add %[s], %[v2]
fmov.s @%[v1], fr6
add %[s], %[v1]
fmov.s @%[v2], fr7
add %[s], %[v2]

! Issue first FIPR
fipr fv0, fv4
! DO NOT SAVE THE RESULT YET

! Load first vector into fv8 for second FIPR.
fmov.s @%[d]+, fr8
fmov.s @%[d]+, fr9
fmov.s @%[d]+, fr10
fmov.s @%[d]+, fr11

! Load second vector into fv12 for second FIPR.
fmov.s @%[v1], fr12
add %[s], %[v1]
fmov.s @%[v2], fr13
add %[s], %[v2]
fmov.s @%[v1], fr14
add %[s], %[v1]
fmov.s @%[v2], fr15
add %[s], %[v2]

! Issue second FIPR
fipr fv8, fv12
! Store result from FIRST FIPR now that it's ready
fmov.s fr7, @-%[r]

! Load first vector into fv0 for third FIPR
fmov.s @%[d]+, fr0
fmov.s @%[d]+, fr1
fmov.s @%[d]+, fr2
fmov.s @%[d]+, fr3

! Load second vector into fv4 for third FIPR
fmov.s @%[v1], fr4
add %[s], %[v1]
fmov.s @%[v2], fr5
add %[s], %[v2]
fmov.s @%[v1], fr6
add %[s], %[v1]
fmov.s @%[v2], fr7
add %[s], %[v2]

! Issue third FIPR
fipr fv0, fv4
! Store result from SECOND FIPR now that it's ready.
fmov.s fr15, @-%[r]

! Load first vector into fv8 for fourth FIPR
fmov.s @%[d]+, fr8
fmov.s @%[d]+, fr9
fmov.s @%[d]+, fr10
fmov.s @%[d]+, fr11

! Load second vector into fv12 for fourth FIPR
fmov.s @%[v1], fr12
add %[s], %[v1]
fmov.s @%[v2], fr13
add %[s], %[v2]
fmov.s @%[v1], fr14
fmov.s @%[v2], fr15

! Issue fourth FIPR
fipr fv8, fv12

! Add up results from previous FIPRs while we wait
fmov.s @%[r]+, fr0
fmov.s @%[r]+, fr1
fadd fr1, fr0
fadd fr7, fr0
add #-8, %[r]

! Add result from fourth FIPR now that it's ready
fadd fr15, fr0

! Store final result
fmov.s fr0, @%[r]

! Swap back to primary FP register bank
frchg
)"
: [d] "+&r" (d), [v1] "+r" (v1), [v2] "+r" (v2),
[r] "+r" (fp_scratch), [s] "=&r" (int_scratch),
"=m" (*fp_scratch));

return fp_scratch[0];
}
</pre>

SH4 FIPR Optimizations

2025-07-26T14:22:56Z

GyroVorbis:

Yo, guys. At like 1AM @ian micheal got me looking at pl_mpeg's audio decoder to see if I could see any potential gainz... So here is its innermost hottest audio synthesis loop:

<pre>
for (int i = 32; i; --i) {
float u;
u = pl_fipr(d[0], d[1], d[2], d[3], v1[0], v2[0], v1[128], v2[128]);
u += pl_fipr(d[4], d[5], d[6], d[7], v1[256], v2[256], v1[384], v2[384]);
u += pl_fipr(d[8], d[9], d[10], d[11], v1[512], v2[512], v1[640], v2[640]);
u += pl_fipr(d[12], d[13], d[14], d[15], v1[768], v2[768], v1[896], v2[896]);
d += 32;
v1++;
v2++;
*out++ = (short)((int)u >> 16);
}
</pre>
Which... you'd think would be preeeetty efficient, right? 4 back-to-back FIPRs? I mean, it is hella gainzy compared to not using FIPR.

But there are two problems with back-to-back FIPR-y, I wanna teach anyone interested:

1) Very often one of the vector arguments stays constant between FIPR calls, but unfortunately the compiler is too dumb to not reload all 8 registers between calls regardless.
* LUCKILY every argument to these FIPRs is unique so this is not applicable, but... very often that's a perf destroyer.

2) THE COMPILER CANNOT PIPELINE FIPR FOR SHIT.
* VERY applicable here. You know what the ASM looks like for these FIPR calls? Something like this:
<pre>
! load first vector arg into fv0 (nothing wrong with this)
fmov.s @%[d]+, fr0
fmov.s @%[d}+, fr1
fmov.s @%[d]+, fr2
fmov.s @%[d]+, fr3

! load second vector arg into fv4 (nothing wrong with this)
fmov.s @%[v1], fr4
add %[offset], @[v1]
fmov.s @%[v2], fr5
add %[offset], @[v2]
fmov.s @%[v1], fr6
fmov.s @%[v2], fr7

! issue actual FIPR calculation
fipr fv0, fv4

! VERY NEXT INSTRUCTION TRY TO STORE THE RESULT
fmov.s fr7, @%[result] ! PIPELINE STALL!!!!
</pre>
Now this is very very bad. FIPR has 5-8 cycles of latency, so every fucking call to FIPR, since the very next instruction tries to use the result before its been calculated, the entire pipeline must stall waiting for the result... FOR EVERY FIPR CALL.
So you're losing MASSIVE perf benefits there.
The solution? You have to pipeline your FIPRs so that while the previous FIPR call is still calculating, you're loading up and issuing the next FIPR call.

So I wrote a new routine that replaces that inner loop body doing manually pipelined FIPR calls... This should be way better:
<pre>
for (int i = 32; i; --i) {
#if 0 // Old FIPR path which didn't pipeline for shit.
float u;
u = pl_fipr(d[0], d[1], d[2], d[3], v1[0], v2[0], v1[128], v2[128]);
u += pl_fipr(d[4], d[5], d[6], d[7], v1[256], v2[256], v1[384], v2[384]);
u += pl_fipr(d[8], d[9], d[10], d[11], v1[512], v2[512], v1[640], v2[640]);
u += pl_fipr(d[12], d[13], d[14], d[15], v1[768], v2[768], v1[896], v2[896]);
#else // New hand-written FIPR path with manual pipelining
float u = shz_pl_inner_loop(d, v1, v2);
#endif
d += 32;
v1++;
v2++;
*out++ = (short)((int)u >> 16);
}
</pre>

Where the new implementation is this inline ASM:
<pre>
__always_inline
float shz_pl_inner_loop(const float *d, const float *v1, const float *v2) {
float fp_scratch[2];
uint32_t int_scratch;

asm volatile(R"(
! Swap to back-bank so we don't need to clobber any FP regs.
frchg

! Load first vector into fv0 for first FIPR.
xor %[s], %[s]
fmov.s @%[d]+, fr0
add #64, %[s]
fmov.s @%[d]+, fr1
add #64, %[s]
fmov.s @%[d]+, fr2
add #16, %[r]
fmov.s @%[d]+, fr3

! Load second vector into fv4 for first FIPR
fmov.s @%[v1], fr4
add %[s], %[v1]
fmov.s @%[v2], fr5
add %[s], %[v2]
fmov.s @%[v1], fr6
add %[s], %[v1]
fmov.s @%[v2], fr7
add %[s], %[v2]

! Issue first FIPR
fipr fv0, fv4
! DO NOT SAVE THE RESULT YET

! Load first vector into fv8 for second FIPR.
fmov.s @%[d]+, fr8
fmov.s @%[d]+, fr9
fmov.s @%[d]+, fr10
fmov.s @%[d]+, fr11

! Load second vector into fv12 for second FIPR.
fmov.s @%[v1], fr12
add %[s], %[v1]
fmov.s @%[v2], fr13
add %[s], %[v2]
fmov.s @%[v1], fr14
add %[s], %[v1]
fmov.s @%[v2], fr15
add %[s], %[v2]

! Issue second FIPR
fipr fv8, fv12
! Store result from FIRST FIPR now that it's ready
fmov.s fr7, @-%[r]

! Load first vector into fv0 for third FIPR
fmov.s @%[d]+, fr0
fmov.s @%[d]+, fr1
fmov.s @%[d]+, fr2
fmov.s @%[d]+, fr3

! Load second vector into fv4 for third FIPR
fmov.s @%[v1], fr4
add %[s], %[v1]
fmov.s @%[v2], fr5
add %[s], %[v2]
fmov.s @%[v1], fr6
add %[s], %[v1]
fmov.s @%[v2], fr7
add %[s], %[v2]

! Issue third FIPR
fipr fv0, fv4
! Store result from SECOND FIPR now that it's ready.
fmov.s fr15, @-%[r]

! Load first vector into fv8 for fourth FIPR
fmov.s @%[d]+, fr8
fmov.s @%[d]+, fr9
fmov.s @%[d]+, fr10
fmov.s @%[d]+, fr11

! Load second vector into fv12 for fourth FIPR
fmov.s @%[v1], fr12
add %[s], %[v1]
fmov.s @%[v2], fr13
add %[s], %[v2]
fmov.s @%[v1], fr14
fmov.s @%[v2], fr15

! Issue fourth FIPR
fipr fv8, fv12

! Add up results from previous FIPRs while we wait
fmov.s @%[r]+, fr0
fmov.s @%[r]+, fr1
fadd fr1, fr0
fadd fr7, fr0
add #-8, %[r]

! Add result from fourth FIPR now that it's ready
fadd fr15, fr0

! Store final result
fmov.s fr0, @%[r]

! Swap back to primary FP register bank
frchg
)"
: [d] "+&r" (d), [v1] "+r" (v1), [v2] "+r" (v2),
[r] "+r" (fp_scratch), [s] "=&r" (int_scratch),
"=m" (*fp_scratch));

return fp_scratch[0];
}
</pre>

SH4 FIPR Optimizations

2025-07-26T14:18:22Z

GyroVorbis: Created page with "Yo, guys. At like 1AM @ian micheal got me looking at pl_mpeg's audio decoder to see if I could see any potential gainz... <pre> for (int i = 32; i; --i) { float u; u = pl_fipr(d[0], d[1], d[2], d[3], v1[0], v2[0], v1[128], v2[128]); u += pl_fipr(d[4], d[5], d[6], d[7], v1[256], v2[256], v1[384], v2[384]); u += pl_fipr(d[8], d[9], d[10], d[11], v1[512], v2[512], v1[640], v2[640]); u += pl_fipr(d[12], d[13], d[14], d[15], v1[768], v2[768], v1[896], v2[..."

Yo, guys. At like 1AM @ian micheal got me looking at pl_mpeg's audio decoder to see if I could see any potential gainz...

<pre>
for (int i = 32; i; --i) {
float u;
u = pl_fipr(d[0], d[1], d[2], d[3], v1[0], v2[0], v1[128], v2[128]);
u += pl_fipr(d[4], d[5], d[6], d[7], v1[256], v2[256], v1[384], v2[384]);
u += pl_fipr(d[8], d[9], d[10], d[11], v1[512], v2[512], v1[640], v2[640]);
u += pl_fipr(d[12], d[13], d[14], d[15], v1[768], v2[768], v1[896], v2[896]);
d += 32;
v1++;
v2++;
*out++ = (short)((int)u >> 16);
}
</pre>
Which... you'd think would be preeeetty efficient, right? 4 back-to-back FIPRs? I mean, it is hella gainzy compared to not using FIPR.

But there are two problems with back-to-back FIPR-y, I wanna teach anyone interested:

1) Very often one of the vector arguments stays constant between FIPR calls, but unfortunately the compiler is too dumb to not reload all 8 registers between calls regardless.
* LUCKILY every argument to these FIPRs is unique so this is not applicable, but... very often that's a perf destroyer.

2) THE COMPILER CANNOT PIPELINE FIPR FOR SHIT.
* VERY applicable here. You know what the ASM looks like for these FIPR calls? Something like this:
<pre>
! load first vector arg into fv0 (nothing wrong with this)
fmov.s @%[d]+, fr0
fmov.s @%[d}+, fr1
fmov.s @%[d]+, fr2
fmov.s @%[d]+, fr3

! load second vector arg into fv4 (nothing wrong with this)
fmov.s @%[v1], fr4
add %[offset], @[v1]
fmov.s @%[v2], fr5
add %[offset], @[v2]
fmov.s @%[v1], fr6
fmov.s @%[v2], fr7

! issue actual FIPR calculation
fipr fv0, fv4

! VERY NEXT INSTRUCTION TRY TO STORE THE RESULT
fmov.s fr7, @%[result] ! PIPELINE STALL!!!!
</pre>
Now this is very very bad. FIPR has 5-8 cycles of latency, so every fucking call to FIPR, since the very next instruction tries to use the result before its been calculated, the entire pipeline must stall waiting for the result... FOR EVERY FIPR CALL.
So you're losing MASSIVE perf benefits there.
The solution? You have to pipeline your FIPRs so that while the previous FIPR call is still calculating, you're loading up and issuing the next FIPR call.

So I wrote a new routine that replaces that inner loop body doing manually pipelined FIPR calls... This should be way better:
<pre>
for (int i = 32; i; --i) {
#if 0 // Old FIPR path which didn't pipeline for shit.
float u;
u = pl_fipr(d[0], d[1], d[2], d[3], v1[0], v2[0], v1[128], v2[128]);
u += pl_fipr(d[4], d[5], d[6], d[7], v1[256], v2[256], v1[384], v2[384]);
u += pl_fipr(d[8], d[9], d[10], d[11], v1[512], v2[512], v1[640], v2[640]);
u += pl_fipr(d[12], d[13], d[14], d[15], v1[768], v2[768], v1[896], v2[896]);
#else // New hand-written FIPR path with manual pipelining
float u = shz_pl_inner_loop(d, v1, v2);
#endif
d += 32;
v1++;
v2++;
*out++ = (short)((int)u >> 16);
}
</pre>

Where the new implementation is this inline ASM:
<pre>
__always_inline
float shz_pl_inner_loop(const float *d, const float *v1, const float *v2) {
float fp_scratch[2];
uint32_t int_scratch;

asm volatile(R"(
! Swap to back-bank so we don't need to clobber any FP regs.
frchg

! Load first vector into fv0 for first FIPR.
xor %[s], %[s]
fmov.s @%[d]+, fr0
add #64, %[s]
fmov.s @%[d]+, fr1
add #64, %[s]
fmov.s @%[d]+, fr2
add #16, %[r]
fmov.s @%[d]+, fr3

! Load second vector into fv4 for first FIPR
fmov.s @%[v1], fr4
add %[s], %[v1]
fmov.s @%[v2], fr5
add %[s], %[v2]
fmov.s @%[v1], fr6
add %[s], %[v1]
fmov.s @%[v2], fr7
add %[s], %[v2]

! Issue first FIPR
fipr fv0, fv4
! DO NOT SAVE THE RESULT YET

! Load first vector into fv8 for second FIPR.
fmov.s @%[d]+, fr8
fmov.s @%[d]+, fr9
fmov.s @%[d]+, fr10
fmov.s @%[d]+, fr11

! Load second vector into fv12 for second FIPR.
fmov.s @%[v1], fr12
add %[s], %[v1]
fmov.s @%[v2], fr13
add %[s], %[v2]
fmov.s @%[v1], fr14
add %[s], %[v1]
fmov.s @%[v2], fr15
add %[s], %[v2]

! Issue second FIPR
fipr fv8, fv12
! Store result from FIRST FIPR now that it's ready
fmov.s fr7, @-%[r]

! Load first vector into fv0 for third FIPR
fmov.s @%[d]+, fr0
fmov.s @%[d]+, fr1
fmov.s @%[d]+, fr2
fmov.s @%[d]+, fr3

! Load second vector into fv4 for third FIPR
fmov.s @%[v1], fr4
add %[s], %[v1]
fmov.s @%[v2], fr5
add %[s], %[v2]
fmov.s @%[v1], fr6
add %[s], %[v1]
fmov.s @%[v2], fr7
add %[s], %[v2]

! Issue third FIPR
fipr fv0, fv4
! Store result from SECOND FIPR now that it's ready.
fmov.s fr15, @-%[r]

! Load first vector into fv8 for fourth FIPR
fmov.s @%[d]+, fr8
fmov.s @%[d]+, fr9
fmov.s @%[d]+, fr10
fmov.s @%[d]+, fr11

! Load second vector into fv12 for fourth FIPR
fmov.s @%[v1], fr12
add %[s], %[v1]
fmov.s @%[v2], fr13
add %[s], %[v2]
fmov.s @%[v1], fr14
fmov.s @%[v2], fr15

! Issue fourth FIPR
fipr fv8, fv12

! Add up results from previous FIPRs while we wait
fmov.s @%[r]+, fr0
fmov.s @%[r]+, fr1
fadd fr1, fr0
fadd fr7, fr0
add #-8, %[r]

! Add result from fourth FIPR now that it's ready
fadd fr15, fr0

! Store final result
fmov.s fr0, @%[r]

! Swap back to primary FP register bank
frchg
)"
: [d] "+&r" (d), [v1] "+r" (v1), [v2] "+r" (v2),
[r] "+r" (fp_scratch), [s] "=&r" (int_scratch),
"=m" (*fp_scratch));

return fp_scratch[0];
}
</pre>

SH4 FTRV Optimizations

2025-07-17T17:32:03Z

GyroVorbis: Created page with "Without a doubt, the single most computationally powerful instruction on the SuperH4 CPU in the Sega Dreamcast is '''FTRV''', or the '''F'''loating-point '''TR'''ansform '''V'''ector instruction. It is a single instruction which multiplies a 4D vector by the 4x4 matrix held within the back-bank of FPU registers, '''XMTRX'''. This article will teach you how to leverage this god instruction for FP performance gainz and introduce you to several example scenarios that have y..."

SH4 in Compiler Explorer

2025-04-26T14:17:13Z

GyroVorbis: Added GCC15.1.0 templates

Thanks to the effort of Matt Godbolt (who hilariously enough is a former Dreamcast developer himself), the SuperH GCC toolchain is now available for use with [https://godbolt.org Compiler Explorer], along with all of the SH4-specific compiler flags and options typically used when targeting the Dreamcast. This gives us an invaluable tool for getting quick and immediate feedback on how well a given C or C++ source segment tends to translate into SH4 assembly, offering a little sandbox for testing and optimizing code targeting the Dreamcast.

[[File:GCC Compiler Benchmarks.png|thumb|Benchmarking various GCC versions and flags]]

= Configuration =
To arrive at a configuration mirroring a Dreamcast development environment, first select one of the GCC compiler versions for the SH architecture. Secondly, the following compiler options should be used as the baseline configuration:
* <code>-ml</code>: compile code for the processor in little-endian mode
* FPU Mode:
** <code>-m4-single</code>: generate code for the SH4 with a floating-point unit that supports both single and double-precision floating point arithmetic that defaults to single-precision mode upon function entry
** <code>-m4-single-only</code>: generate code for the SH4 with a floating-point unit that only supports single-precision floating point arithmetic
* <code>-ffast-math</code>: breaks strict IEEE compliance and allows for faster floating point approximations
* <code>-O3</code>: optimization level 3
* <code>-mfsrra</code>: enables emission of the fsrra instruction for reciprocal square root approximations (not available in GCC 4.7.4)
* <code>-mfsca</code>: enables emission of the fsca instruction for sine and cosine approximations (not available in GCC 4.7.4)
* <code>-matomic-model=soft-imask</code>: enables support for C11 and C++11 atomics by disabling then reenabling interrupts around atomic variable operations
* <code>-ftls-model=local-exec</code>: enables the model used by KOS for supporting variables declared with the "thread_local" keyword

= Convenience Templates =
The following are pre-configured templates you can use as sample Dreamcast build configurations:
* GCC4.9.4:
** [https://godbolt.org/z/9MKzeMfMj C11 Hello World]
** [https://godbolt.org/z/qGzoeo4sj C++14 Hello World]
* GCC9.5.0:
** [https://godbolt.org/z/rvW3s3594 C17 Hello World]
** [https://godbolt.org/z/qYfE5G6Mx C++17 Hello World]
* GCC12.2.0:
** [https://godbolt.org/z/94TKvxazn C17 Hello World]
** [https://godbolt.org/z/61jqhE3zn C++20 Hello World]
* GCC13.1.0:
** [https://godbolt.org/z/Kb9bKe8ro C2X Hello World]
** [https://godbolt.org/z/51dv4ePsG C++23 Hello World]
* GCC13.2.0:
** [https://godbolt.org/z/rafvMdWGb C2X Hello World]
** [https://godbolt.org/z/MeG3rqna7 C++23 Hello World]
* GCC 14.1.0
** [https://godbolt.org/z/7ha8oj4vd C23 Hello World]
** [https://godbolt.org/z/eW8Yd8sG9 C++26 Hello World]
* GCC 14.2.0
** [https://godbolt.org/z/a54qTYon4 C23 Hello World]
** [https://godbolt.org/z/5njMrrh8z C++26 Hello World]
* GCC 15.1.0
** [https://godbolt.org/z/WWsaaEqWW C23 Hello World]
** [https://godbolt.org/z/raP67zWcY C++26 Hello World]

Pre-configured template for ARM GCC8.5, Dreamcast's "AICA" sound chip:
* [https://godbolt.org/z/895dWG9qf C Hello World]

= Tips and Notes =
* It has been noted that while <code>-O3</code> is claimed to be the highest optimization level according to recent GCC documentation, some code differences can still be seen under certain circumstances when using <code>-O4</code> and beyond.
* The compiler seems to ignore both <code>-mfsrra</code> and <code>-mfsca</code> without the <code>-ffast-math</code> option.
* With the proper flags enabled for fast-math, the compiler is smart enough to leverage the following from pure C code, almost certainly better than you can do with small intrinsic-style inline ASM calls, provided you're using the proper single-precision versions of any <code><math.h></code> routines:
**
<center>
{|
|+ Compiler-Generated Fast-Math Optimizations
|-
! Assembly Output !! C/C++ Input
|-
| FMAC || <code>z = x * y + z</code>
|-
| FSCA || <code>s = sinf(angle); c = cosf(angle)</code>
|-
| FSRRA || <code>1.0f / sqrtf(x)</code>
|}
</center>
*Unfortunately the compiler has no knowledge of the following SIMD instructions, even with fast-math, so it's quite necessary to use inline assembly routines (provided by KallistiOS) for fully leveraging the SH4's FPU, when working with vectors and matrices in linear algebra routines:
**
<center>
{|
|+ Manual Inline Assembly Fast-Math Optimizations
|-
! Assembly Output !! C/C++ Input
|-
| FIPR || <code>Vector4 Dot Product</code>
|-
| FTRV || <code>Vector4 * Matrix4x4 Transform</code>
|}
</center>
* Typically smaller code sizes and more tightly optimized code are seen with newer versions of GCC versus the older ones; however, this is not always the case.
* Evidently, even without a branch predictor, the C++20 <code><nowiki>[[likely]]</nowiki></code> and <code><nowiki>[[unlikely]]</nowiki></code> attributes as well as the GCC intrinsic <code>__builtin_expect()</code> can have a fairly profound impact on code generation and optimization for conditionals and branches. More information can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106029 here].
* <code>-fipa-pta</code> allows the compiler to analyze pointer and reference usage beyond the scope of the current compiling function, which very often results in pretty decent performance increases at the cost of increased compile times and RAM usage.
* <code>-flto</code> allows GCC to perform optimizations over the entire program and all translation units as a single entity during the linking phase, for the cost of increased compile times and RAM usage. This frequently results in more performant code.
* An in-depth benchmark comparing the run-time performance and compiled binary size output of every toolchain version officially supported by KOS with various optimization levels can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106068 here].

VMU peripheral

2025-01-29T15:27:27Z

GyroVorbis:

The VMU peripheral on the [[Maple_bus|Maple Bus]] contains 3 functions: screen, storage, and timer.

'''NOTE:''' Some information here is misleading and/or incomplete

== Storage ==

The "block read" and "block write" commands (0x0B and 0x0C) with storage function code are used to read and write blocks of memory in the storage peripheral.

Normally, there are 256 blocks of memory that make up the entire storage space, and normally each block consists of 512 bytes. That makes a total of 128 KB of memory. These values are configurable according to the Maple Bus spec, but anything other than these values is not practically usable by most games due to many high-level APIs and libraries within the SDKs not adhering properly to the Maple spec and making hardcoded assumptions about the volume information.<ref>''[https://www.dreamcast-talk.com/forum/viewtopic.php?f=5&t=15562&p=171565] TapamN | 400 block VMU on emulator''</ref>

=== Function Definition ===
The function definition may be found in the peripheral's [[Maple_bus#Device_Info_Payload_Structure_.28cmd_0x05.29|device info packet]]. It is necessary to read this word to know how to access blocks of memory on the storage.
{| class="wikitable"
|-
! Byte 0 (LSB) !! Byte 1 !! Byte 2 !! Byte 3 (MSB)
|-
| bit 7: removable (1: true) bit 6: CRC (1: needed) bits 0-5: unused (0) || bits 4-7: number of write accesses per block bits 0-3: number of read accesses per block || number of bytes per block (x + 1)*32 bytes || number of partitions (x + 1)
|}

=== Get Media Info ===
Execute a [[Maple_bus#Commands|get memory information]] command to get information about the size and locations of some information within the media. On successful execution, a [[Maple_bus#Commands|data transfer]] packet with the following payload will be returned.

{| class="wikitable"
|-
! Word 0 !! Word 1 !! Word 2 !! Word 3 !! Word 4 !! Word 5 !! Word 6
|-
| [[Maple_bus#Function_Codes|Function code]] 0x00000002 || 2 most sig bytes: total size 2 least sig bytes: the partition number of this media || 2 most sig bytes: block number of system area 2 least sig bytes: block number of start of FAT area || 2 most sig bytes: number of FAT blocks 2 least sig bytes: block number of file information || 2 most sig bytes: number of file info blocks 2 least sig bytes: icon number || 2 most sig bytes: block number of save area 2 least sig bytes: number of blocks in save area || execution file info
|}

=== Block Read ===
Execute a sequence of [[Maple_bus#Commands|block reads]] to read data from storage. The number of sequences is dependent on the number of read accesses per block defined in the function definition above. On successful execution, a [[Maple_bus#Commands|data transfer]] packet will be returned.

=== Block Write ===
Execute a sequence of [[Maple_bus#Commands|block writes]] to write data to storage. The number of sequences is dependent on the number of write accesses per block defined in the function definition above. On successful execution, a [[Maple_bus#Commands|acknowledge]] packet will be returned. After the final write sequence, execute a [[Maple_bus#Commands|get last error]] command with an incremented phase value to commit the block from RAM to storage.
* A VMU cannot handle successive block write packets faster than about 10 ms per write
** After such error, the storage functionality acts as if it never received this sequence and never returns a response packet - it may be possible to successfully resend this sequence, but writing a "get last error" command would be necessary before starting over at the first sequence
** The Dreamcast does each read/write sequence on the same cadence as controller polling (about every 16 ms)

== Screen ==

The "block write" command (0x0C) with screen function code and 48 data words is used to write monochrome images to the screen. A screen is 48 bits wide and 32 bits tall. For each bit in the 48 data words, a value of 1 means the pixel is on (black) and 0 means the pixel is off (white). Data is written from left to right and top to bottom (when holding the VMU in the upright orientation i.e. controller flipped upside down). The most significant bit of the first word sets the pixel on the top, left of the screen. The two most significant bytes write to the 33rd through 48th bit of the first row. The next two bytes write to the 1st through 16th bits of the second row. This is repeated for the rest of the 48 words like pictured below.

[[File:Dreamcast Screen Words.png|Dreamcast Screen Words]]

== Timer ==

The timer function allows the host to activate the buzzer on the VMU and get button conditions.

== References ==
<references />

User:GyroVorbis

2025-01-29T15:20:52Z

GyroVorbis:

My name is Falco Girgis.

I am the infamous lead developer of Elysian Shadows, ESTk, and ElysianVMU. I am widely hated for failing to deliver on a successfully funded Kickstarter, but I've actually poured my heart and soul into giving back to this community and contributing to its infrastructure for years now to atone for my sins.

Feel free to come hang out with me on Discord or follow me on Twitter or something. I'm happy to help!

* GitHub: https://github.com/gyrovorbis
* Discord: https://discord.gg/SX2txgr
* Twitter: https://twitter.com/falco_girgis
* LinkedIn: https://www.linkedin.com/feed/

SH4 in Compiler Explorer

2025-01-29T15:18:13Z

GyroVorbis:

Thanks to the effort of Matt Godbolt (who hilariously enough is a former Dreamcast developer himself), the SuperH GCC toolchain is now available for use with [https://godbolt.org Compiler Explorer], along with all of the SH4-specific compiler flags and options typically used when targeting the Dreamcast. This gives us an invaluable tool for getting quick and immediate feedback on how well a given C or C++ source segment tends to translate into SH4 assembly, offering a little sandbox for testing and optimizing code targeting the Dreamcast.

[[File:GCC Compiler Benchmarks.png|thumb|Benchmarking various GCC versions and flags]]

= Configuration =
To arrive at a configuration mirroring a Dreamcast development environment, first select one of the GCC compiler versions for the SH architecture. Secondly, the following compiler options should be used as the baseline configuration:
* <code>-ml</code>: compile code for the processor in little-endian mode
* FPU Mode:
** <code>-m4-single</code>: generate code for the SH4 with a floating-point unit that supports both single and double-precision floating point arithmetic that defaults to single-precision mode upon function entry
** <code>-m4-single-only</code>: generate code for the SH4 with a floating-point unit that only supports single-precision floating point arithmetic
* <code>-ffast-math</code>: breaks strict IEEE compliance and allows for faster floating point approximations
* <code>-O3</code>: optimization level 3
* <code>-mfsrra</code>: enables emission of the fsrra instruction for reciprocal square root approximations (not available in GCC 4.7.4)
* <code>-mfsca</code>: enables emission of the fsca instruction for sine and cosine approximations (not available in GCC 4.7.4)
* <code>-matomic-model=soft-imask</code>: enables support for C11 and C++11 atomics by disabling then reenabling interrupts around atomic variable operations
* <code>-ftls-model=local-exec</code>: enables the model used by KOS for supporting variables declared with the "thread_local" keyword

= Convenience Templates =
The following are pre-configured templates you can use as sample Dreamcast build configurations:
* GCC4.9.4:
** [https://godbolt.org/z/9MKzeMfMj C11 Hello World]
** [https://godbolt.org/z/qGzoeo4sj C++14 Hello World]
* GCC9.5.0:
** [https://godbolt.org/z/rvW3s3594 C17 Hello World]
** [https://godbolt.org/z/qYfE5G6Mx C++17 Hello World]
* GCC12.2.0:
** [https://godbolt.org/z/94TKvxazn C17 Hello World]
** [https://godbolt.org/z/61jqhE3zn C++20 Hello World]
* GCC13.1.0:
** [https://godbolt.org/z/Kb9bKe8ro C2X Hello World]
** [https://godbolt.org/z/51dv4ePsG C++23 Hello World]
* GCC13.2.0:
** [https://godbolt.org/z/rafvMdWGb C2X Hello World]
** [https://godbolt.org/z/MeG3rqna7 C++23 Hello World]
* GCC 14.1.0
** [https://godbolt.org/z/7ha8oj4vd C23 Hello World]
** [https://godbolt.org/z/eW8Yd8sG9 C++26 Hello World]
* GCC 14.2.0
** [https://godbolt.org/z/a54qTYon4 C23 Hello World]
** [https://godbolt.org/z/5njMrrh8z C++26 Hello World]

Pre-configured template for ARM GCC8.5, Dreamcast's "AICA" sound chip:
* [https://godbolt.org/z/895dWG9qf C Hello World]

= Tips and Notes =
* It has been noted that while <code>-O3</code> is claimed to be the highest optimization level according to recent GCC documentation, some code differences can still be seen under certain circumstances when using <code>-O4</code> and beyond.
* The compiler seems to ignore both <code>-mfsrra</code> and <code>-mfsca</code> without the <code>-ffast-math</code> option.
* With the proper flags enabled for fast-math, the compiler is smart enough to leverage the following from pure C code, almost certainly better than you can do with small intrinsic-style inline ASM calls, provided you're using the proper single-precision versions of any <code><math.h></code> routines:
**
<center>
{|
|+ Compiler-Generated Fast-Math Optimizations
|-
! Assembly Output !! C/C++ Input
|-
| FMAC || <code>z = x * y + z</code>
|-
| FSCA || <code>s = sinf(angle); c = cosf(angle)</code>
|-
| FSRRA || <code>1.0f / sqrtf(x)</code>
|}
</center>
*Unfortunately the compiler has no knowledge of the following SIMD instructions, even with fast-math, so it's quite necessary to use inline assembly routines (provided by KallistiOS) for fully leveraging the SH4's FPU, when working with vectors and matrices in linear algebra routines:
**
<center>
{|
|+ Manual Inline Assembly Fast-Math Optimizations
|-
! Assembly Output !! C/C++ Input
|-
| FIPR || <code>Vector4 Dot Product</code>
|-
| FTRV || <code>Vector4 * Matrix4x4 Transform</code>
|}
</center>
* Typically smaller code sizes and more tightly optimized code are seen with newer versions of GCC versus the older ones; however, this is not always the case.
* Evidently, even without a branch predictor, the C++20 <code><nowiki>[[likely]]</nowiki></code> and <code><nowiki>[[unlikely]]</nowiki></code> attributes as well as the GCC intrinsic <code>__builtin_expect()</code> can have a fairly profound impact on code generation and optimization for conditionals and branches. More information can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106029 here].
* <code>-fipa-pta</code> allows the compiler to analyze pointer and reference usage beyond the scope of the current compiling function, which very often results in pretty decent performance increases at the cost of increased compile times and RAM usage.
* <code>-flto</code> allows GCC to perform optimizations over the entire program and all translation units as a single entity during the linking phase, for the cost of increased compile times and RAM usage. This frequently results in more performant code.
* An in-depth benchmark comparing the run-time performance and compiled binary size output of every toolchain version officially supported by KOS with various optimization levels can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106068 here].

Pushing Polygons

2025-01-08T04:09:08Z

GyroVorbis: Created Basic skeleton page with index of topics

WIP page for a deep-dive into how to efficiently maximize polygon throughput and and optimize T&L for the Sega Dreamcast with the KallistiOS SDK.

= Rendering =
== Vertex Submission ==
* Basic
* DMA
* Direct
* Hybrid
== Mesh Formats ==
* triangle Strips
== Vertex Formats ==
* sprites
* 16-bit UV
* floating-point colors
== Polygon Headers ==
* Caching
= SH4 Math Acceleration =
* vec3f
* matrix_t
* fmath.h
= Cache Management =
* Prefetching
* OCRAM
* OCINDEX

SH4 in Compiler Explorer

2025-01-08T03:42:00Z

GyroVorbis: Created page with "Thanks to the effort of Matt Godbolt (who hilariously enough is a former Dreamcast developer himself), the SuperH GCC toolchain is now available for use with [https://godbolt.org Compiler Explorer], along with all of the SH4-specific compiler flags and options typically used when targeting the Dreamcast. This gives us an invaluable tool for getting quick and immediate feedback on how well a given C or C++ source segment tends to translate into SH4 assembly, offering a li..."

Thanks to the effort of Matt Godbolt (who hilariously enough is a former Dreamcast developer himself), the SuperH GCC toolchain is now available for use with [https://godbolt.org Compiler Explorer], along with all of the SH4-specific compiler flags and options typically used when targeting the Dreamcast. This gives us an invaluable tool for getting quick and immediate feedback on how well a given C or C++ source segment tends to translate into SH4 assembly, offering a little sandbox for testing and optimizing code targeting the Dreamcast.

[[File:GCC Compiler Benchmarks.png|thumb|Benchmarking various GCC versions and flags]]

= Configuration =
To arrive at a configuration mirroring a Dreamcast development environment, first select one of the GCC compiler versions for the SH architecture. Secondly, the following compiler options should be used as the baseline configuration:
* <code>-ml</code>: compile code for the processor in little-endian mode
* FPU Mode:
** <code>-m4-single-only</code>: generate code for the SH4 with a floating-point unit that only supports single-precision floating point arithmetic
** <code>-m4-single</code>: generate code for the SH4 with a floating-point unit that supports both single and double-precision floating point arithmetic that defaults to single-precision mode upon function entry
* <code>-ffast-math</code>: breaks strict IEEE compliance and allows for faster floating point approximations
* <code>-O3</code>: optimization level 3
* <code>-mfsrra</code>: enables emission of the fsrra instruction for reciprocal square root approximations (not available in GCC 4.7.4)
* <code>-mfsca</code>: enables emission of the fsca instruction for sine and cosine approximations (not available in GCC 4.7.4)
* <code>-matomic-model=soft-imask</code>: enables support for C11 and C++11 atomics by disabling then reenabling interrupts around atomic variable operations
* <code>-ftls-model=local-exec</code>: enables the model used by KOS for supporting variables declared with the "thread_local" keyword

= Convenience Templates =
The following are pre-configured templates you can use as sample Dreamcast build configurations:
* GCC4.9.4:
** [https://godbolt.org/z/9MKzeMfMj C11 Hello World]
** [https://godbolt.org/z/qGzoeo4sj C++14 Hello World]
* GCC9.5.0:
** [https://godbolt.org/z/rvW3s3594 C17 Hello World]
** [https://godbolt.org/z/qYfE5G6Mx C++17 Hello World]
* GCC12.2.0:
** [https://godbolt.org/z/94TKvxazn C17 Hello World]
** [https://godbolt.org/z/61jqhE3zn C++20 Hello World]
* GCC13.1.0:
** [https://godbolt.org/z/Kb9bKe8ro C2X Hello World]
** [https://godbolt.org/z/51dv4ePsG C++23 Hello World]
* GCC13.2.0:
** [https://godbolt.org/z/rafvMdWGb C2X Hello World]
** [https://godbolt.org/z/MeG3rqna7 C++23 Hello World]
* GCC 14.1.0
** [https://godbolt.org/z/7ha8oj4vd C23 Hello World]
** [https://godbolt.org/z/eW8Yd8sG9 C++26 Hello World]

Pre-configured template for ARM GCC8.5, Dreamcast's "AICA" sound chip:
* [https://godbolt.org/z/895dWG9qf C Hello World]

= Tips and Notes =
* It has been noted that while <code>-O3</code> is claimed to be the highest optimization level according to recent GCC documentation, some code differences can still be seen under certain circumstances when using <code>-O4</code> and beyond.
* The compiler seems to ignore both <code>-mfsrra</code> and <code>-mfsca</code> without the <code>-ffast-math</code> and <code>-m4-single-only</code> options.
* It is highly recommended that C code is written to use <code>-mfsrra</code> (1.0/sqrt(N)) and <code>-mfsca</code> (builtin sin/cos) over using inline assembly directly, as this seems to give the compiler more context for code optimization around these instructions.
* The <code>__builtin_prefetch</code> intrinsic does seem to generate a single "pref" instruction and should be preferred over inline assembly.
* The compiler does not seem smart enough to utilize the FIPR (inner/dot product), FMAC (multiply and accumulate), or FTRV (transform vector) instructions regardless of how embarrassingly vectorizable the supplied C code seems to be, so linear algebra routines are forced to use inline assembly to fully leverage the SH4's SIMD instructions.
* Typically smaller code sizes and more tightly optimized code are seen with newer versions of GCC versus the older ones; however, this is not always the case.
* Evidently, even without a branch predictor, the C++20 <code><nowiki>[[likely]]</nowiki></code> and <code><nowiki>[[unlikely]]</nowiki></code> attributes as well as the GCC intrinsic <code>__builtin_expect()</code> can have a fairly profound impact on code generation and optimization for conditionals and branches. More information can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106029 here].
* <code>-fipa-pta</code> allows the compiler to analyze pointer and reference usage beyond the scope of the current compiling function, which very often results in pretty decent performance increases at the cost of increased compile times and RAM usage.
* <code>-flto</code> allows GCC to perform optimizations over the entire program and all translation units as a single entity during the linking phase, for the cost of increased compile times and RAM usage. This frequently results in more performant code.
* An in-depth benchmark comparing the run-time performance and compiled binary size output of every toolchain version officially supported by KOS with various optimization levels can be found [https://dcemulation.org/phpBB/viewtopic.php?t=106068 here].

DCWiki:Hardware

2024-01-15T17:47:37Z

GyroVorbis: /* Hardware */

== Hardware ==
{| style="width:100%"
! style="width: 50%"|Console and Peripherals
! style="width: 50%"|Modifications and Repair
|-
| style="padding: 5px;vertical-align:top"|
* [[Hardware overview]], [[VMU hardware overview]]
* [[Hardware variations]]
* [[A/V connectivity]]
* [[G2 bus]]
** [[Modem]], [[Broadband adapter]], [[LAN adapter]], [[Karaoke]], [[Zip drive]], [[G2 Terminator|Terminator]]
* [[Maple bus]]
** [[Controller]], [[Keyboard]], [[Mouse]], [[Arcade stick]], [[Twin stick]], [[Race wheel]], [[Light gun]], [[Fishing rod]], [[Maracas]], [[DreamEye webcam]]
** [[Mission Stick]], [[Panther DC]], [[Densha de Go! controller]], [[Dance mat]], [[Pop'n controller]], [[DreamParaPara Sensors]], [[VCD remote]]
** [[VMU]], [[Memory card]], [[Jump pack]], [[Microphone]]
** [[Aftermarket adapters]]
* [[Serial interface]]
** [[Neo Geo Pocket Color link cable]], [[VS cable]], [[MIDI adapter]], [[Coder's cable]], [[Serial SD card adapter]], [[Touchscreen]]
| style="padding: 5px;vertical-align:top"|
{|
* [[Optical drive replacements]]
* [[DCDigital]] HDMI, [[S/PDIF]], [[Internal VGA]], [[MIDI expansion|MIDI]]
* [[BIOS modification]]
* [[Region change]], [[NTSC/PAL mode enforcement]]
* [[Power supply replacement]], [[Fan replacement]]
* [[DreamPi]]
* [[IDE hard drive modification]]
* [[Overclocking]], [[32MB RAM expansion]]
* [[VMU mods]]
 
* [[GD-ROM drive repair]]
* [[PSU repair]]
* Controller board: [[F1 fuse repair|Fuse repair]], [[battery replacement]]
* [[Case whitening]]
|}
|}

DreamParaPara Sensors

2024-01-15T17:46:14Z

GyroVorbis: Created page with "Sensor controller for DreamParaPara"

[[File:DreamParaPara Controller.jpg|thumb|Sensor controller for DreamParaPara]]

File:DreamParaPara Controller.jpg

2024-01-15T17:45:57Z

GyroVorbis: GyroVorbis uploaded a new version of File:DreamParaPara Controller.jpg

Sensor controller for DreamParaPara

File:DreamParaPara Controller.jpg

2024-01-15T17:43:39Z

GyroVorbis:

Sensor controller for DreamParaPara

Performance 3rd Party VMU

2023-12-05T03:01:24Z

GyroVorbis:

[[File:Performance VMU.jpg|thumb|Performance 3rd Party VMU]]
Despite popular belief, there is actually one known third-party Visual Memory Unit: the ultra-rare "Performance" VMU. While it lacks the full standalone GAME mode featured in the first-party VMUs, preventing it from playing minigames, it does provide equivalent functionality while plugged into the controller and even has a builtin firmware offering several features when used in standalone mode.

=== Features ===
Maple Functionality (when connected to the controller):
* LCD Display Screen
* Buzzer Tone Generation
* 8-bit FAT Filesystem

Standalone Functionality (when battery-powered):
* Calendar

Unknown Support:
* Can VMU Buttons be accessed while plugged into the controller as with first-party VMUs?

=== Packaging ===
[[File:Front Box of Performance VMU.jpg|thumb|left|Front Packaging of the Performance VMU]]
[[File:Performance VMU Box Rear.jpg|thumb|center|Front Packaging of the Performance VMU]]

=== Hardware ===
[[File:Performance VMU Insides.png|thumb|center|Disassembled Performance VMU]]

=== Bugs/Extras ===
[[File:Performance VMU Calendar.jpg|thumb|center|Performance VMU Calendar]]

File:Performance VMU Calendar.jpg

2023-12-05T03:00:54Z

GyroVorbis:

Performance VMU Calendar

File:Performance VMU Insides.png

2023-12-05T02:59:06Z

GyroVorbis:

Disassembled Performance VMU

File:Performance VMU Box Rear.jpg

2023-12-05T02:53:42Z

GyroVorbis:

Rear Packaging of the Performance VMU

File:Performance VMU Box Front.jpg

2023-12-05T02:53:26Z

GyroVorbis: GyroVorbis moved page File:Performance VMU Box Front.jpg to File:Performance VMU Box Rear.jpg: Misnamed

#REDIRECT [[File:Performance VMU Box Rear.jpg]]

File:Performance VMU Box Rear.jpg

2023-12-05T02:53:26Z

GyroVorbis: GyroVorbis moved page File:Performance VMU Box Front.jpg to File:Performance VMU Box Rear.jpg: Misnamed

Front Packaging of the Performance VMU

File:Front Box of Performance VMU.jpg

2023-12-05T02:51:46Z

GyroVorbis:

Front Packaging of the Performance VMU

File:Performance VMU Box Rear.jpg

2023-12-05T02:49:33Z

GyroVorbis:

Front Packaging of the Performance VMU

Performance 3rd Party VMU

2023-12-05T02:47:49Z

GyroVorbis: Created page with "Performance 3rd Party VMU Despite popular belief, there is actually one known third-party Visual Memory Unit: the ultra-rare "Performance" VMU. While it lacks the full standalone GAME mode featured in the first-party VMUs, preventing it from playing minigames, it does provide equivalent functionality while plugged into the controller and even has a builtin firmware offering several features when used in standalone mode. === Features ==..."

File:Performance VMU.jpg

2023-12-05T02:35:21Z

GyroVorbis:

Performance 3rd Party VMU

Useful programming tips

2023-10-14T08:21:14Z

GyroVorbis: I'm sorry, but the way this is worded is just not quite correct. He's talking about aspects of the processor, but he's making it sound like modern software PARADIGMS and practices don't apply to DC, when in reality we have C++20 and even all of the async, concurrent crap used in multicore working just fine.

The Dreamcast's CPU, model SH7091, is virtually identical to the Renesas SH7750 series of SH4 CPUs. As such, anything that would normally apply to the SuperH-4 architecture applies here. Given that the SH4 is a processor from 1998, many hardware features that we have grown accustomed to on more recent x86 and ARM64 processors either do not apply or behave much more primitively on the SH4. For example, SH4 does not have branch prediction, speculative execution, or multiple cores, but it does have a 64-bit floating point unit,<ref>It's predominantly used for single-precision operations: it ''can'' do doubles, but that doesn't mean it's a great idea!</ref> a couple of 128-bit vector operations on 4x packed 32-bit floats,<ref>See fipr, ftrv: http://www.shared-ptr.com/sh_insns.html</ref> a memory management unit (MMU), and a direct memory access controller (DMAC).

In truth, at a very basic level the SH4 architecture is fundamentally not that different from these other, more mainstream architectures (in fact, ARM Thumb is based on SuperH<ref>https://lwn.net/Articles/647636/</ref>), so programming on the SH4 does not require much in the way of "re-learning" how to do things, especially since the Dreamcast uses it in little endian mode exclusively. Mainly, SH4 programming just requires paying a lot more attention to things that modern architectures have made very convenient, like data alignment, cache management, and pipelining.

The following page is a collection of programming tips and tricks to help with optimizing programs to make full use of the SH4 CPU. '''''This is not meant to be a substitute for reading the SH7750 series hardware and software manuals,''''' rather it should be seen more as an additional reference based on experiences working with the chip (and, in fact, certain terms and hardware-specific concepts assume familiarity with those manuals). Both manuals, "SH7750, SH7750S, SH7750R Group User's Manual: Hardware" and "SH-4 Software Manual," can be downloaded from the "Documents" section of any SH7750 series processor's product page on Renesas's website: https://www.renesas.com/eu/en/products/microcontrollers-microprocessors/superh/sh7750/sh7750r.html, and the SH4 C ABI on STMicroelectronics's website, "RM0197: SH-4 generic and C specific application binary interface," is incredibly handy, too--search for RM0197: https://www.st.com/content/st_com/en.html.

A very convenient SuperH assembly reference can be found here, as well: http://www.shared-ptr.com/sh_insns.html.

This page refers to the documents as follows:
* SH7750 Hardware Manual: "SH7750, SH7750S, SH7750R Group User's Manual: Hardware"
* SH7750 Software Manual: "SH-4 Software Manual"
* SH4 C ABI: "RM0197: SH-4 generic and C specific application binary interface"

== Alignment ==

(Refer to: SH7750 Hardware Manual, Section 5 "Exceptions")

The SH4 lives and dies by alignment, and very strictly requires data to be aligned according to its type in memory. Crashes will otherwise ensue.

Take the following example, which defines a packed structure aligned to 4 bytes:

<syntaxhighlight lang="c">
typedef struct __attribute__ ((packed, aligned(4))) {
unsigned char id[4];
unsigned int address;
unsigned int size;
unsigned char data[]; // Flexible array member
} command_t;
</syntaxhighlight>

Accessing data from this struct is pretty simple, it should just need a simple 4-byte access. This works because the struct is aligned to 4 bytes.

Doing this:
<syntaxhighlight lang="c">
unsigned int cmd_addr = ntohl(command->address);
unsigned int cmd_size = ntohl(command->size);
</syntaxhighlight>

Produces this output (GCC 9.2.0):
<syntaxhighlight lang="objdump">420 02e4 862F mov.l r8,@-r15
421 02e6 4365 mov r4,r5
422 02e8 962F mov.l r9,@-r15
423 02ea 0C75 add #12,r5
424 02ec 224F sts.l pr,@-r15
425 02ee 4159 mov.l @(4,r4),r9
426 02f0 4256 mov.l @(8,r4),r6
427 02f2 12D0 mov.l .L71,r0
428 02f4 9869 swap.b r9,r9
429 02f6 6866 swap.b r6,r6
430 02f8 11D4 mov.l .L72,r4
431 02fa 6966 swap.w r6,r6
432 02fc 9969 swap.w r9,r9
433 02fe 9869 swap.b r9,r9
434 0300 6868 swap.b r6,r8</syntaxhighlight>

But what if it weren't aligned to 4 bytes? Just this:
<syntaxhighlight lang="c">
typedef struct __attribute__ ((packed)) {
unsigned char id[4];
unsigned int address;
unsigned int size;
unsigned char data[]; // Flexible array member
} command_t;
</syntaxhighlight>

Accessing the data looks like this, in that case (GCC 9.2.0):
<syntaxhighlight lang="objdump">555 03f0 862F mov.l r8,@-r15
556 03f2 4365 mov r4,r5
557 03f4 962F mov.l r9,@-r15
558 03f6 0C75 add #12,r5
559 03f8 224F sts.l pr,@-r15
560 03fa 4484 mov.b @(4,r4),r0
561 03fc 0C63 extu.b r0,r3
562 03fe 4584 mov.b @(5,r4),r0
563 0400 0C61 extu.b r0,r1
564 0402 4684 mov.b @(6,r4),r0
565 0404 1861 swap.b r1,r1
566 0406 0C60 extu.b r0,r0
567 0408 3B21 or r3,r1
568 040a 2840 shll16 r0
569 040c 0B21 or r0,r1
570 040e 4784 mov.b @(7,r4),r0
571 0410 2840 shll16 r0
572 0412 1840 shll8 r0
573 0414 1B20 or r1,r0
574 0416 0869 swap.b r0,r9
575 0418 4884 mov.b @(8,r4),r0
576 041a 9969 swap.w r9,r9
577 041c 0C63 extu.b r0,r3
578 041e 4984 mov.b @(9,r4),r0
579 0420 9869 swap.b r9,r9
580 0422 0C62 extu.b r0,r2
581 0424 4A84 mov.b @(10,r4),r0
582 0426 2862 swap.b r2,r2
583 0428 0C60 extu.b r0,r0
584 042a 3B22 or r3,r2
585 042c 2840 shll16 r0
586 042e 0B22 or r0,r2
587 0430 4B84 mov.b @(11,r4),r0
588 0432 12D4 mov.l .L73,r4
589 0434 2840 shll16 r0
590 0436 1840 shll8 r0
591 0438 2B20 or r2,r0
592 043a 0860 swap.b r0,r0
593 043c 0961 swap.w r0,r1</syntaxhighlight>

All of this is just from this simple operation:
<syntaxhighlight lang="c">
unsigned int cmd_addr = ntohl(command->address);
unsigned int cmd_size = ntohl(command->size);
</syntaxhighlight>

What's going on here is GCC is avoiding an address alignment crash that would occur from accessing 1-byte-aligned data. This is because struct packing aligns to 1 byte, and GCC needs to do the following process to build an unsigned 4-byte integer from 1-byte accesses:
<syntaxhighlight lang="asm">
mov.b, zero-extend
mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add
mov.b, zero-extend, shift, add
</syntaxhighlight>

(Note: all the swap instructions come from ntohl, as this code is from a network driver that needs to byte swap data after receiving a network transmission.)

Considering that 1x mov.b takes the same amount of time as 1x mov.l, plus all the other operations that must be done to build the 4-byte data out of 1-byte accesses, it's easy to see how big the performance hit from mismanaging alignment can be!

== Cache Management ==

(Refer to: SH7750 Hardware Manual, Section 4 "Caches" and SH7750 Software Manual, Section 9 "Instruction Descriptions")

Unlike modern processors, where caches are several megabytes in size and can therefore hold entire programs, the SH4 in the Dreamcast only has a 16kB data cache and 8kB instruction cache. Consequently, cache management is very important in order to achieve maximum performance. As is always true of cache optimization, write-back memory mode is required to make much use of it (it's used everywhere by default in [[KallistiOS]] and enabled for P0/U0/P3--but not P1--in [[DreamHAL]]'s startup file).

Half of the data cache can also be used as a form of high-speed RAM (referred to as OCRAM), but in most cases programs should stick to using the full cache size for cache purposes. The SH4 uses a direct-mapped cache, meaning that there is only one cache entry for every 16kB memory chunk (half that when used in OCRAM mode) and cache trashing can happen if trying to do something like copy data from some address to a destination address that is an integer multiple of the cache size away from that address (e.g. source = address offset 8 and destination = address 16kB + 8).

The SH4 provides the following instructions in addition to the two 32-byte "store queues" (SQs) to make efficient use of the cache:
* '''movca.l:''' Store register data to cache, if there's a cache miss just allocate a cache block and write to it without first reading that cache block from memory
* '''ocbp:''' Purge cache block; write back cache block and invalidate it
* '''ocbi:''' Invalidate cache block without writing it back
* '''ocbwb:''' Write cache block back to external memory, and keep it in the cache

== C Function Register Allocation ==

(Refer to: SH4 C ABI)

(Note: when using GCC 9.x at various optimization levels, like -O3, it tries its best to coalesce output code into this format wherever it can. Of course, if GCC is able to inline a function, parameter-passing becomes a moot point.)

The SH4 C ABI specifies that 4 integers (r4-r7) and 8 floats (fr4-fr11) can be passed in registers as function call arguments, and that r0-r3 and fr0-fr3 are also call-clobbered. Passing arguments in registers means that functions can take 4 integers and 8 floats without forcing arguments to be pushed on the stack, saving the cycle penalties that would otherwise occur from stack pushes and associated memory accesses. The call-clobbering of r0-r3 and fr0-fr3 means that those can be used as 4 integer local variables and 4 float variables, as well. Additionally, any of these registers not used for parameters can be repurposed as local variables, so if one only needs to pass in 4 floats to a function, one can then define 4 more local variable floats on top of the 4 we get from fr0-fr3 and they will just use the unused registers.

== Pipelining and Instruction-Level Parallelism ==

(Refer to: SH7750 Hardware Manual, Section 8 "Pipelining")

This section is really only relevant when writing assembly. If you write code in a high-level language like C/C++, compilers [try to] take of this for you and there isn't much you can do about it.

Because of the SH4's dual-issue superscalar design, the CPU preloads two instructions at once, and under the right circumstances these instructions can be executed in parallel. The SH4 architecture organizes various instructions into "instruction groups," and parallel execution primarily occurs when two instructions of different groups are issued together. There are a variety of special cases to this rule of thumb, however, and more advanced code can be structured to take advantage of these properties.

For example, if the two instructions are of different groups but have a dependency chain, the second instruction will stall into the next cycle, and there is also the fact that CO group instructions do not parallelize with anything. Conversely, there are special cases like 0-cycle instructions that can execute in parallel ''despite'' having dependency chains (e.g. a "mov Rn, Rm" followed by an "add #imm8, Rm"), and MT group instructions that can parallelize with other MT group instructions (unless there's a non-special-case dependency chain).

Appropriate usage of instruction-level parallelism is the only way to achieve >200 MIPS (millions of instructions per second) on a 200MHz SH4.

== References ==

File:VMU Dev Lesson 1.png

2023-07-28T11:55:04Z

GyroVorbis:

Screenshot from Marble_Grainite's first VMU Dev Lesson

VMU emulators

2023-07-13T11:27:27Z

GyroVorbis: /* ElysianVMU */

VMU Emulators are software applications that allow you to run enjoy VMU ROMs, games, applications, and animations without needing the actual VMU hardware.

=ElysianVMU=
ElysianVMU is a cross-platform feature-rich emulator being developed alongside the Elysian Shadows Toolkit, with the goal of bringing Dreamcast-exclusive VMU content to all platforms supported by their engine. The team has decided to release the emulator to the Dreamcast scene. It serves as a gaming platform, VMU filesystem manager, and even includes development tools targeted at helping developers write custom VMU software.
* Official Page : [https://evmu.elysianshadows.com https://evmu.elysianshadows.com]
* Core Source Code : [https://github.com/gyrovorbis/libevmu GitHub]
* Documentation : [http://vmu.elysianshadows.com/index.html http://vmu.elysianshadows.com/index.html]
* Developer by : [[User:GyroVorbis|Falco Girgis]], Elysian Shadows Team
* Status : Active
* Compatibility: Very Good
* Platform(s) :
** Full Support: Windows, MacOS, Linux, Web, PSP, Raspberry Pi
** Partial Support: GameCube
** Future Support: iOS, Android, Dreamcast
* Features
** Supported File Formats:
*** ROM Images: .VMI/.VMS, .DCI,
*** Flash Images: .DCM, .VMU
*** Other: .BIN (bios), .LCD (VMU Animator)
** Emulation
*** Save/Load State
*** Japanese + US Bios Support
*** Accurate Audio
*** Gamepad/Joystick Support
*** Analog Stick Support
*** Fullscreen and Pixel-Perfect Scaling Modes
*** Physically Accurate LCD Emulation (pixel ghosting, emulated grayscale)
*** Fast-Forward
*** Low-battery Emulation
*** Serial Communications
**** VMU-to-DC (using Maple over TCP/IP)
***** ESTk/Elysian Shadows Engine support (full)
***** DC Emulator support (WIP, pending someone willing to collaborate)
**** VMU-to-VMU
***** TCP/IP (partial support, WIP)
***** Serial/GPIO pins (Raspberry Pi only, WIP)
** Filesystem Tools
*** File Filesystem Manager
*** VMU Animator File Playback
*** VMU Icon Ripping
*** VMU EyeCatch Ripping
*** Framebuffer Screenshot Capture
*** Record to Animated GIF
*** Convert/Export between file formats
*** Modify Volume Icon + Color
*** Jet Set (Grind) Radio Custom Graffiti Tool (WIP)
*** Lock/Unlock Extra Blocks
*** Defragmenter
*** File Checking/Repair/Debugging
** Developer Tools
*** Frame-by-Frame Execution
*** Real-Time Memory Browser and Hex Editor (RAM/Flash)
*** Invalid Hardware Operation/Warning Log
*** Buzzer Tool (for audio composition/debugging) (WIP)

=SoftVMS=
The original VMU emulator, written by the man who helped to reverse engineer the platform and kickstart the homebrew scene.
* Official Page : [https://www.zophar.net/consoles/dreamcast/vms/softvms.html](https://www.zophar.net/consoles/dreamcast/vms/softvms.html)
* Developer : Marcus Comstedt
* Status : Inactive
* Compatibility: Good
* Platform(s) : Windows

=VeMUlator PRO=
* Google Play : [https://apkhome.net/vemulator-pro-dreamcast-vmu-emulator-0-7/]
* Developer : MJaoune Software
* Compatibility: Unknown
* Platform(s) : Android

=Visual Memory Emulator=
* Google Play : [https://play.google.com/store/apps/details?id=com.nuritsubushi.vmemu]
* Developer : Kum
* Compatibility : Unknown
* Platform(s) : Android

=DirectVMS=
*Developer : [http://www.fallenrealm.com/directvms/index.html Fallen Realm]
*Platforms: Windows
* Latest release: 1.8, 09/24/2000
* Download [http://www.dcemulation.org/files/pcemu/DirectVMUexec.zip Binary], [http://www.dcemulation.org/files/pcemu/DirectVMUsource.zip Source Code]

VMU emulators

2023-07-13T11:27:06Z

GyroVorbis: /* ElysianVMU */

VMU Emulators are software applications that allow you to run enjoy VMU ROMs, games, applications, and animations without needing the actual VMU hardware.

=ElysianVMU=
ElysianVMU is a cross-platform feature-rich emulator being developed alongside the Elysian Shadows Toolkit, with the goal of bringing Dreamcast-exclusive VMU content to all platforms supported by their engine. The team has decided to release the emulator to the Dreamcast scene. It serves as a gaming platform, VMU filesystem manager, and even includes development tools targeted at helping developers write custom VMU software.
* Official Page : [https://evmu.elysianshadows.com https://evmu.elysianshadows.com]
* Core Source Code : [https://github.com/gyrovorbis/libevmu|GitHub]
* Documentation : [http://vmu.elysianshadows.com/index.html http://vmu.elysianshadows.com/index.html]
* Developer by : [[User:GyroVorbis|Falco Girgis]], Elysian Shadows Team
* Status : Active
* Compatibility: Very Good
* Platform(s) :
** Full Support: Windows, MacOS, Linux, Web, PSP, Raspberry Pi
** Partial Support: GameCube
** Future Support: iOS, Android, Dreamcast
* Features
** Supported File Formats:
*** ROM Images: .VMI/.VMS, .DCI,
*** Flash Images: .DCM, .VMU
*** Other: .BIN (bios), .LCD (VMU Animator)
** Emulation
*** Save/Load State
*** Japanese + US Bios Support
*** Accurate Audio
*** Gamepad/Joystick Support
*** Analog Stick Support
*** Fullscreen and Pixel-Perfect Scaling Modes
*** Physically Accurate LCD Emulation (pixel ghosting, emulated grayscale)
*** Fast-Forward
*** Low-battery Emulation
*** Serial Communications
**** VMU-to-DC (using Maple over TCP/IP)
***** ESTk/Elysian Shadows Engine support (full)
***** DC Emulator support (WIP, pending someone willing to collaborate)
**** VMU-to-VMU
***** TCP/IP (partial support, WIP)
***** Serial/GPIO pins (Raspberry Pi only, WIP)
** Filesystem Tools
*** File Filesystem Manager
*** VMU Animator File Playback
*** VMU Icon Ripping
*** VMU EyeCatch Ripping
*** Framebuffer Screenshot Capture
*** Record to Animated GIF
*** Convert/Export between file formats
*** Modify Volume Icon + Color
*** Jet Set (Grind) Radio Custom Graffiti Tool (WIP)
*** Lock/Unlock Extra Blocks
*** Defragmenter
*** File Checking/Repair/Debugging
** Developer Tools
*** Frame-by-Frame Execution
*** Real-Time Memory Browser and Hex Editor (RAM/Flash)
*** Invalid Hardware Operation/Warning Log
*** Buzzer Tool (for audio composition/debugging) (WIP)

=SoftVMS=
The original VMU emulator, written by the man who helped to reverse engineer the platform and kickstart the homebrew scene.
* Official Page : [https://www.zophar.net/consoles/dreamcast/vms/softvms.html](https://www.zophar.net/consoles/dreamcast/vms/softvms.html)
* Developer : Marcus Comstedt
* Status : Inactive
* Compatibility: Good
* Platform(s) : Windows

=VeMUlator PRO=
* Google Play : [https://apkhome.net/vemulator-pro-dreamcast-vmu-emulator-0-7/]
* Developer : MJaoune Software
* Compatibility: Unknown
* Platform(s) : Android

=Visual Memory Emulator=
* Google Play : [https://play.google.com/store/apps/details?id=com.nuritsubushi.vmemu]
* Developer : Kum
* Compatibility : Unknown
* Platform(s) : Android

=DirectVMS=
*Developer : [http://www.fallenrealm.com/directvms/index.html Fallen Realm]
*Platforms: Windows
* Latest release: 1.8, 09/24/2000
* Download [http://www.dcemulation.org/files/pcemu/DirectVMUexec.zip Binary], [http://www.dcemulation.org/files/pcemu/DirectVMUsource.zip Source Code]

File:Dead or alive2.gif

2023-06-23T18:28:09Z

GyroVorbis:

Fancy Effects from Dead or Alive 2

Development

2023-05-26T20:13:13Z

GyroVorbis:

=== Getting started ===
* [[Getting Started with Dreamcast development]] -- start here!
====Ready-to-use environments====
* [[Docker images]]
* [[DreamSDK]] (Windows only)
====[[Building the required toolchains for Sega Dreamcast development]]====

====[[KallistiOS]]====
* [[Building KOS on Linux mint (or Ubuntu)]]
* [[Building KOS under Windows Subsystem for Linux (Windows 10 only)]]
* [[Building KOS on macOS]]
* [[Building KOS on Cygwin]]
* [[Building KOS on MinGW/MSYS]]
* [[Building KOS on MinGW-w64/MSYS2]]
* [https://kos-docs.dreamcast.wiki/ KallistiOS Doxygen documentation]

====Other====
* [[Using Ruby for Sega Dreamcast development]] (experimental)

=== Build & test ===
* [[Building your project]]
* [[Emulators]]
* [[Broadband adapter]] / [[LAN adapter]]
** [[Using dcload-ip with Linux]]
** [[Using dcload-ip with Windows Subsystem for Linux|Using dcload-ip with Windows 10]] (via Windows Subsystem for Linux)
* [[Coder's cable]]

=== Environments and IDEs ===
* [[Qt Creator Dreamcast Development Environment]]
* [[CLion Debugging]]
* [[Visual Studio Code Debugging]]

=== Tools & utilities ===
* [[Debugging throught GNU Debugger (GDB) and dcload/dc-tool]]
* [[Using dcprof]]

=== Releasing your project ===
* Plain files
* Disc image
* Selfboot Inducer package

=== Engines ===
* [[Simulant]]
** [[Windows WSL2 Setup]]
** [[Generate profiling data]]

=== General ===
* [[Filesystem]]
* [[Romdisk Swapping]]
* [https://mc.pp.se/dc/hw.html Marcus Comstedt's Dreamcast Hardware Reference]

=== Graphics ===
* [[Texture Formats]]
* [[Graphics APIs]]
* [[Paletted Textures]]
* [[2D Rendering Without PVR]]
* [[Twiddling]]

* PVR
** [[PowerVR Introduction]]
** [[PVR Spritesheets]]
* [[GLdc]]
** [[Drawing 2D sprites using GLdc]]
** [[Drawing 3D shapes using GLdc]]
** [https://hkowsoftware.com/articles/gldc-vertex-formats-from-vec3f-to-fastpath-to-map_buffer/ GLdc Vertex Formats: From vec3f to fastpath to map_buffer]
* Others
** [http://www.numechanix.com/blog/index.php/2015/10/03/20/ Procedural texture]
** [[Notes on fillrate and drawing large textures]]
** [[KMG Textures]]
** [[Loading PNG images as OpenGL textures]]

=== Audio ===
* [[Playing SFX]]
* [[Streaming audio]]

=== Maple ===
* Controller input

=== VMU ===
* [[File Types]]
* [[Save/Load file]]
* [[Show icon]]
* [[Play tone]]
* [[VMU_development|Game Development]]

=== Optimization ===
* [[GCC-SH4 tips]]
* [[SH4 in Compiler Explorer]]
* [[Fast SH4 Vertex Processing]]
* [[Useful programming tips]]
* [[Efficient usage of the Dreamcast RAM]]
* Registers
* DMA
* TA
* PVR
=== Website Development ===
*[[Development Resources]]

=== Random Snippets ===
* [[Objdump]]

Development

2023-05-26T20:12:09Z

GyroVorbis:

=== Getting started ===
* [[Getting Started with Dreamcast development]] -- start here!
====Ready-to-use environments====
* [[Docker images]]
* [[DreamSDK]] (Windows only)
====[[Building the required toolchains for Sega Dreamcast development]]====

====[[KallistiOS]]====
* [[Building KOS on Linux mint (or Ubuntu)]]
* [[Building KOS under Windows Subsystem for Linux (Windows 10 only)]]
* [[Building KOS on macOS]]
* [[Building KOS on Cygwin]]
* [[Building KOS on MinGW/MSYS]]
* [[Building KOS on MinGW-w64/MSYS2]]
* [https://kos-docs.dreamcast.wiki/ KallistiOS Doxygen documentation]

====Other====
* [[Using Ruby for Sega Dreamcast development]] (experimental)

=== Build & test ===
* [[Building your project]]
* [[Emulators]]
* [[Broadband adapter]] / [[LAN adapter]]
** [[Using dcload-ip with Linux]]
** [[Using dcload-ip with Windows Subsystem for Linux|Using dcload-ip with Windows 10]] (via Windows Subsystem for Linux)
* [[Coder's cable]]

=== Environments and IDEs ===
* [[Qt Creator Dreamcast Development Environment]]

=== Debugging & profiling ===
* [[Debugging throught GNU Debugger (GDB) and dcload/dc-tool]]
* [[Using dcprof]]
* [[CLion Debugging]]
* [[Visual Studio Code Debugging]]

=== Releasing your project ===
* Plain files
* Disc image
* Selfboot Inducer package

=== Engines ===
* [[Simulant]]
** [[Windows WSL2 Setup]]
** [[Generate profiling data]]

=== General ===
* [[Filesystem]]
* [[Romdisk Swapping]]
* [https://mc.pp.se/dc/hw.html Marcus Comstedt's Dreamcast Hardware Reference]

=== Graphics ===
* [[Texture Formats]]
* [[Graphics APIs]]
* [[Paletted Textures]]
* [[2D Rendering Without PVR]]
* [[Twiddling]]

* PVR
** [[PowerVR Introduction]]
** [[PVR Spritesheets]]
* [[GLdc]]
** [[Drawing 2D sprites using GLdc]]
** [[Drawing 3D shapes using GLdc]]
** [https://hkowsoftware.com/articles/gldc-vertex-formats-from-vec3f-to-fastpath-to-map_buffer/ GLdc Vertex Formats: From vec3f to fastpath to map_buffer]
* Others
** [http://www.numechanix.com/blog/index.php/2015/10/03/20/ Procedural texture]
** [[Notes on fillrate and drawing large textures]]
** [[KMG Textures]]
** [[Loading PNG images as OpenGL textures]]

=== Audio ===
* [[Playing SFX]]
* [[Streaming audio]]

=== Maple ===
* Controller input

=== VMU ===
* [[File Types]]
* [[Save/Load file]]
* [[Show icon]]
* [[Play tone]]
* [[VMU_development|Game Development]]

=== Optimization ===
* [[GCC-SH4 tips]]
* [[SH4 in Compiler Explorer]]
* [[Fast SH4 Vertex Processing]]
* [[Useful programming tips]]
* [[Efficient usage of the Dreamcast RAM]]
* Registers
* DMA
* TA
* PVR
=== Website Development ===
*[[Development Resources]]

=== Random Snippets ===
* [[Objdump]]

User:GyroVorbis

2023-05-26T01:30:53Z

GyroVorbis:

My name is Falco Girgis.

I am the infamous lead developer of Elysian Shadows, ESTk, and ElysianVMU. I am widely hated for being late on Kickstarter, but THE PROJECT IS MY LIFE, AND I'M WORKING ON IT and fully intend to make it up to all of you. :)

I've recently begun to atone for my sins by giving back to the Dreamcast community as a KOS developer and writing content here for this wiki. Feel free to come hang out with me on Discord or follow me on Twitter or something. I'm happy to help!

* GitHub: https://github.com/gyrovorbis
* Discord: https://discord.gg/SX2txgr
* Twitter: https://twitter.com/falco_girgis
* LinkedIn: https://www.linkedin.com/feed/

User:GyroVorbis

2023-05-26T01:29:19Z

GyroVorbis:

My name is Falco Girgis.

I am the infamous lead developer of Elysian Shadows, ESTk, and ElysianVMU. I am widely hated for being late on Kickstarter, but THE PROJECT IS MY LIFE, AND I'M WORKING ON IT and fully intend to make it up to all of you. :)

I've recently begun to atone for my sins by giving back to the Dreamcast community as a KOS developer and writing content here for this wiki.

* GitHub: https://github.com/gyrovorbis
* Twitter: https://twitter.com/falco_girgis
* Discord: https://discord.gg/SX2txgr
* LinkedIn: https://www.linkedin.com/feed/