i386 (x86-32) machine code, 8 bytes (9B for unsigned)
amd64 (x86-64) machine code, 9 bytes (10B for unsigned, or 14B 13B for 64b integers signed or unsigned)
Inputs are 32bit non-zero signed integers in eax and ecx. Output in eax.
## 32bit code, signed integers: eax, ecx
08048420 <gcd0>:
8048420: 99 cdq ; shorter than xor edx,edx
8048421: f7 f9 idiv ecx
8048423: 92 xchg edx,eax ; there's a one-byte encoding for xchg eax,r32. So this is shorter but slower than a mov
8048424: 91 xchg ecx,eax ; eax = divisor(from ecx), ecx = remainder(from edx), edx = quotient(from eax) which we discard
8048425: 41 inc ecx ; saves 1B vs. test/jnz in 32bit mode
8048426: e2 f8 loop 8048420 <gcd0>
; result in eax: gcd(a,0) = a
Mike Shlanta's answer was the starting point for this. My loop does the same thing as his, but for signed integers because cdq is one byter shorter than xor edx,edx. And yes, it does work correctly with one or both inputs negative. Mike's version will run faster and take less space in the uop cache (xchg is 3 uops on Intel CPUs, and loop is really slow on most CPUs), but this version wins at machine-code size.
edit: oops, I didn't notice the question required unsigned 32bit. Going back to xor edx,edx instead of cdq would cost one byte. div is the same size as idiv, and everything else can stay the same (xchg for data movement and inc/loop still work.)
Interestingly, for 64bit operand-size (rax and rcx), signed and unsigned versions are the same size. The signed version needs a REX prefix for cqo, so it's 2B, but the unsigned version can still use 2B xor edx,edx.
In 64bit code, inc ecx is 2B: the single-byte inc r32 and dec r32 opcodes were repurposed as REX prefixes. inc/loop doesn't save any code-size in 64bit mode, so you might as well test/jnz. Operating on 64bit integers adds another one byte per instruction in REX prefixes, except for loop or jnz. It's possible for the remainder to have all zeros in the low 32b (e.g. gcd((2^32), (2^32 + 1))), so we need to test the whole rcx and can't save a byte with test ecx,ecx. However, the slower jrcxz insn is only 2B:
## 64bit code, unsigned 64 integers: rax, rcx
0000000000400610 <gcd_u64>:
400610: 31 d2 xor edx,edx ; same length as cqo
400612: 48 f7 f1 div rcx
400615: 48 92 xchg rdx,rax
400617: 48 91 xchg rcx,rax
400619: e3 02 jrcxz 40061d <gcd_u64_end> ; saves 1B over test rcx,rcx (3B) / jnz (2B)
40061b: eb f3 jmp 400610 <gcd_u64>
000000000040061d <gcd_u64_end>:
## 0xD = 13 bytes of code
## result in rax: gcd(a,0) = a
Full runnable test program including a main that runs printf("...", gcd(atoi(argv[1]), atoi(argv[2])) ); source and asm output on the Godbolt Compiler Explorer, for the 32 and 64b versions. Tested and working for 32bit (-m32), 64bit (-m64), and the x32 ABI (-mx32).
GNU C inline asm source for the 32bit version (compile with gcc -m32 -masm=intel)
int gcd(int a, int b) {
asm (// ".intel_syntax noprefix\n"
".p2align 4\n" // align to make size-counting easier
"gcd0: cdq\n\t" // sign extend eax into edx:eax. One byte shorter than xor edx,edx
" idiv ecx\n"
" xchg eax, edx\n" // there's a one-byte encoding for xchg eax,r32. So this is shorter but slower than a mov
" xchg eax, ecx\n" // eax = divisor(ecx), ecx = remainder(edx), edx = garbage that we will clear later
" inc ecx\n" // saves 1B vs. test/jnz in 32bit mode, none in 64b mode
" loop gcd0\n"
"gcd0_end:\n"
: /* outputs */ "+a" (a), "+c"(b)
: /* inputs */ // given as read-write outputs
: /* clobbers */ "edx"
);
return a;
}
Normally I'd write a whole function in asm, but GNU C inline asm seems to be the best way to include a snippet which can have in/outputs in whatever regs we choose. As you can see, GNU C inline asm syntax makes asm ugly and noisy. It's also a really difficult way to learn asm.
It would actually compile and work in .att_syntax noprefix mode, because all the insns used are either single/no operand or xchg. Not really a useful observation.