1*61046927SAndroid Build Coastguard Worker.section #gk110_builtin_code 2*61046927SAndroid Build Coastguard Worker// DIV U32 3*61046927SAndroid Build Coastguard Worker// 4*61046927SAndroid Build Coastguard Worker// UNR recurrence (q = a / b): 5*61046927SAndroid Build Coastguard Worker// look for z such that 2^32 - b <= b * z < 2^32 6*61046927SAndroid Build Coastguard Worker// then q - 1 <= (a * z) / 2^32 <= q 7*61046927SAndroid Build Coastguard Worker// 8*61046927SAndroid Build Coastguard Worker// INPUT: $r0: dividend, $r1: divisor 9*61046927SAndroid Build Coastguard Worker// OUTPUT: $r0: result, $r1: modulus 10*61046927SAndroid Build Coastguard Worker// CLOBBER: $r2 - $r3, $p0 - $p1 11*61046927SAndroid Build Coastguard Worker// SIZE: 22 / 14 * 8 bytes 12*61046927SAndroid Build Coastguard Worker// 13*61046927SAndroid Build Coastguard Workergk110_div_u32: 14*61046927SAndroid Build Coastguard Worker sched 0x28 0x04 0x28 0x04 0x28 0x28 0x28 15*61046927SAndroid Build Coastguard Worker bfind u32 $r2 $r1 16*61046927SAndroid Build Coastguard Worker xor b32 $r2 $r2 0x1f 17*61046927SAndroid Build Coastguard Worker mov b32 $r3 0x1 18*61046927SAndroid Build Coastguard Worker shl b32 $r2 $r3 clamp $r2 19*61046927SAndroid Build Coastguard Worker cvt u32 $r1 neg u32 $r1 20*61046927SAndroid Build Coastguard Worker mul $r3 u32 $r1 u32 $r2 21*61046927SAndroid Build Coastguard Worker add $r2 (mul high u32 $r2 u32 $r3) $r2 22*61046927SAndroid Build Coastguard Worker sched 0x28 0x28 0x28 0x28 0x28 0x28 0x28 23*61046927SAndroid Build Coastguard Worker mul $r3 u32 $r1 u32 $r2 24*61046927SAndroid Build Coastguard Worker add $r2 (mul high u32 $r2 u32 $r3) $r2 25*61046927SAndroid Build Coastguard Worker mul $r3 u32 $r1 u32 $r2 26*61046927SAndroid Build Coastguard Worker add $r2 (mul high u32 $r2 u32 $r3) $r2 27*61046927SAndroid Build Coastguard Worker mul $r3 u32 $r1 u32 $r2 28*61046927SAndroid Build Coastguard Worker add $r2 (mul high u32 $r2 u32 $r3) $r2 29*61046927SAndroid Build Coastguard Worker mul $r3 u32 $r1 u32 $r2 30*61046927SAndroid Build Coastguard Worker sched 0x04 0x28 0x04 0x28 0x28 0x2c 0x04 31*61046927SAndroid Build Coastguard Worker add $r2 (mul high u32 $r2 u32 $r3) $r2 32*61046927SAndroid Build Coastguard Worker mov b32 $r3 $r0 33*61046927SAndroid Build Coastguard Worker mul high $r0 u32 $r0 u32 $r2 34*61046927SAndroid Build Coastguard Worker cvt u32 $r2 neg u32 $r1 35*61046927SAndroid Build Coastguard Worker add $r1 (mul u32 $r1 u32 $r0) $r3 36*61046927SAndroid Build Coastguard Worker set $p0 0x1 ge u32 $r1 $r2 37*61046927SAndroid Build Coastguard Worker $p0 sub b32 $r1 $r1 $r2 38*61046927SAndroid Build Coastguard Worker sched 0x28 0x2c 0x04 0x20 0x2e 0x28 0x20 39*61046927SAndroid Build Coastguard Worker $p0 add b32 $r0 $r0 0x1 40*61046927SAndroid Build Coastguard Worker $p0 set $p0 0x1 ge u32 $r1 $r2 41*61046927SAndroid Build Coastguard Worker $p0 sub b32 $r1 $r1 $r2 42*61046927SAndroid Build Coastguard Worker $p0 add b32 $r0 $r0 0x1 43*61046927SAndroid Build Coastguard Worker ret 44*61046927SAndroid Build Coastguard Worker 45*61046927SAndroid Build Coastguard Worker// DIV S32, like DIV U32 after taking ABS(inputs) 46*61046927SAndroid Build Coastguard Worker// 47*61046927SAndroid Build Coastguard Worker// INPUT: $r0: dividend, $r1: divisor 48*61046927SAndroid Build Coastguard Worker// OUTPUT: $r0: result, $r1: modulus 49*61046927SAndroid Build Coastguard Worker// CLOBBER: $r2 - $r3, $p0 - $p3 50*61046927SAndroid Build Coastguard Worker// 51*61046927SAndroid Build Coastguard Workergk110_div_s32: 52*61046927SAndroid Build Coastguard Worker set $p2 0x1 lt s32 $r0 0x0 53*61046927SAndroid Build Coastguard Worker set $p3 0x1 lt s32 $r1 0x0 xor $p2 54*61046927SAndroid Build Coastguard Worker sched 0x20 0x28 0x28 0x04 0x28 0x04 0x28 55*61046927SAndroid Build Coastguard Worker cvt s32 $r0 abs s32 $r0 56*61046927SAndroid Build Coastguard Worker cvt s32 $r1 abs s32 $r1 57*61046927SAndroid Build Coastguard Worker bfind u32 $r2 $r1 58*61046927SAndroid Build Coastguard Worker xor b32 $r2 $r2 0x1f 59*61046927SAndroid Build Coastguard Worker mov b32 $r3 0x1 60*61046927SAndroid Build Coastguard Worker shl b32 $r2 $r3 clamp $r2 61*61046927SAndroid Build Coastguard Worker cvt u32 $r1 neg u32 $r1 62*61046927SAndroid Build Coastguard Worker sched 0x28 0x28 0x28 0x28 0x28 0x28 0x28 63*61046927SAndroid Build Coastguard Worker mul $r3 u32 $r1 u32 $r2 64*61046927SAndroid Build Coastguard Worker add $r2 (mul high u32 $r2 u32 $r3) $r2 65*61046927SAndroid Build Coastguard Worker mul $r3 u32 $r1 u32 $r2 66*61046927SAndroid Build Coastguard Worker add $r2 (mul high u32 $r2 u32 $r3) $r2 67*61046927SAndroid Build Coastguard Worker mul $r3 u32 $r1 u32 $r2 68*61046927SAndroid Build Coastguard Worker add $r2 (mul high u32 $r2 u32 $r3) $r2 69*61046927SAndroid Build Coastguard Worker mul $r3 u32 $r1 u32 $r2 70*61046927SAndroid Build Coastguard Worker sched 0x28 0x28 0x04 0x28 0x04 0x28 0x28 71*61046927SAndroid Build Coastguard Worker add $r2 (mul high u32 $r2 u32 $r3) $r2 72*61046927SAndroid Build Coastguard Worker mul $r3 u32 $r1 u32 $r2 73*61046927SAndroid Build Coastguard Worker add $r2 (mul high u32 $r2 u32 $r3) $r2 74*61046927SAndroid Build Coastguard Worker mov b32 $r3 $r0 75*61046927SAndroid Build Coastguard Worker mul high $r0 u32 $r0 u32 $r2 76*61046927SAndroid Build Coastguard Worker cvt u32 $r2 neg u32 $r1 77*61046927SAndroid Build Coastguard Worker add $r1 (mul u32 $r1 u32 $r0) $r3 78*61046927SAndroid Build Coastguard Worker sched 0x2c 0x04 0x28 0x2c 0x04 0x28 0x20 79*61046927SAndroid Build Coastguard Worker set $p0 0x1 ge u32 $r1 $r2 80*61046927SAndroid Build Coastguard Worker $p0 sub b32 $r1 $r1 $r2 81*61046927SAndroid Build Coastguard Worker $p0 add b32 $r0 $r0 0x1 82*61046927SAndroid Build Coastguard Worker $p0 set $p0 0x1 ge u32 $r1 $r2 83*61046927SAndroid Build Coastguard Worker $p0 sub b32 $r1 $r1 $r2 84*61046927SAndroid Build Coastguard Worker $p0 add b32 $r0 $r0 0x1 85*61046927SAndroid Build Coastguard Worker $p3 cvt s32 $r0 neg s32 $r0 86*61046927SAndroid Build Coastguard Worker sched 0x04 0x2e 0x28 0x04 0x28 0x28 0x28 87*61046927SAndroid Build Coastguard Worker $p2 cvt s32 $r1 neg s32 $r1 88*61046927SAndroid Build Coastguard Worker ret 89*61046927SAndroid Build Coastguard Worker 90*61046927SAndroid Build Coastguard Worker// RCP F64 91*61046927SAndroid Build Coastguard Worker// 92*61046927SAndroid Build Coastguard Worker// INPUT: $r0d 93*61046927SAndroid Build Coastguard Worker// OUTPUT: $r0d 94*61046927SAndroid Build Coastguard Worker// CLOBBER: $r2 - $r9, $p0 95*61046927SAndroid Build Coastguard Worker// 96*61046927SAndroid Build Coastguard Worker// The core of RCP and RSQ implementation is Newton-Raphson step, which is 97*61046927SAndroid Build Coastguard Worker// used to find successively better approximation from an imprecise initial 98*61046927SAndroid Build Coastguard Worker// value (single precision rcp in RCP and rsqrt64h in RSQ). 99*61046927SAndroid Build Coastguard Worker// 100*61046927SAndroid Build Coastguard Workergk110_rcp_f64: 101*61046927SAndroid Build Coastguard Worker // Step 1: classify input according to exponent and value, and calculate 102*61046927SAndroid Build Coastguard Worker // result for 0/inf/nan. $r2 holds the exponent value, which starts at 103*61046927SAndroid Build Coastguard Worker // bit 52 (bit 20 of the upper half) and is 11 bits in length 104*61046927SAndroid Build Coastguard Worker ext u32 $r2 $r1 0xb14 105*61046927SAndroid Build Coastguard Worker add b32 $r3 $r2 0xffffffff 106*61046927SAndroid Build Coastguard Worker joinat #rcp_rejoin 107*61046927SAndroid Build Coastguard Worker // We want to check whether the exponent is 0 or 0x7ff (i.e. NaN, inf, 108*61046927SAndroid Build Coastguard Worker // denorm, or 0). Do this by subtracting 1 from the exponent, which will 109*61046927SAndroid Build Coastguard Worker // mean that it's > 0x7fd in those cases when doing unsigned comparison 110*61046927SAndroid Build Coastguard Worker set b32 $p0 0x1 gt u32 $r3 0x7fd 111*61046927SAndroid Build Coastguard Worker // $r3: 0 for norms, 0x36 for denorms, -1 for others 112*61046927SAndroid Build Coastguard Worker mov b32 $r3 0x0 113*61046927SAndroid Build Coastguard Worker sched 0x2f 0x04 0x2d 0x2b 0x2f 0x28 0x28 114*61046927SAndroid Build Coastguard Worker join (not $p0) nop 115*61046927SAndroid Build Coastguard Worker // Process all special values: NaN, inf, denorm, 0 116*61046927SAndroid Build Coastguard Worker mov b32 $r3 0xffffffff 117*61046927SAndroid Build Coastguard Worker // A number is NaN if its abs value is greater than or unordered with inf 118*61046927SAndroid Build Coastguard Worker set $p0 0x1 gtu f64 abs $r0d 0x7ff0000000000000 119*61046927SAndroid Build Coastguard Worker (not $p0) bra #rcp_inf_or_denorm_or_zero 120*61046927SAndroid Build Coastguard Worker // NaN -> NaN, the next line sets the "quiet" bit of the result. This 121*61046927SAndroid Build Coastguard Worker // behavior is both seen on the CPU and the blob 122*61046927SAndroid Build Coastguard Worker join or b32 $r1 $r1 0x80000 123*61046927SAndroid Build Coastguard Workerrcp_inf_or_denorm_or_zero: 124*61046927SAndroid Build Coastguard Worker and b32 $r4 $r1 0x7ff00000 125*61046927SAndroid Build Coastguard Worker // Other values with nonzero in exponent field should be inf 126*61046927SAndroid Build Coastguard Worker set b32 $p0 0x1 eq s32 $r4 0x0 127*61046927SAndroid Build Coastguard Worker sched 0x2b 0x04 0x2f 0x2d 0x2b 0x2f 0x20 128*61046927SAndroid Build Coastguard Worker $p0 bra #rcp_denorm_or_zero 129*61046927SAndroid Build Coastguard Worker // +/-Inf -> +/-0 130*61046927SAndroid Build Coastguard Worker xor b32 $r1 $r1 0x7ff00000 131*61046927SAndroid Build Coastguard Worker join mov b32 $r0 0x0 132*61046927SAndroid Build Coastguard Workerrcp_denorm_or_zero: 133*61046927SAndroid Build Coastguard Worker set $p0 0x1 gtu f64 abs $r0d 0x0 134*61046927SAndroid Build Coastguard Worker $p0 bra #rcp_denorm 135*61046927SAndroid Build Coastguard Worker // +/-0 -> +/-Inf 136*61046927SAndroid Build Coastguard Worker join or b32 $r1 $r1 0x7ff00000 137*61046927SAndroid Build Coastguard Workerrcp_denorm: 138*61046927SAndroid Build Coastguard Worker // non-0 denorms: multiply with 2^54 (the 0x36 in $r3), join with norms 139*61046927SAndroid Build Coastguard Worker mul rn f64 $r0d $r0d 0x4350000000000000 140*61046927SAndroid Build Coastguard Worker sched 0x2f 0x28 0x2b 0x28 0x28 0x04 0x28 141*61046927SAndroid Build Coastguard Worker join mov b32 $r3 0x36 142*61046927SAndroid Build Coastguard Workerrcp_rejoin: 143*61046927SAndroid Build Coastguard Worker // All numbers with -1 in $r3 have their result ready in $r0d, return them 144*61046927SAndroid Build Coastguard Worker // others need further calculation 145*61046927SAndroid Build Coastguard Worker set b32 $p0 0x1 lt s32 $r3 0x0 146*61046927SAndroid Build Coastguard Worker $p0 bra #rcp_end 147*61046927SAndroid Build Coastguard Worker // Step 2: Before the real calculation goes on, renormalize the values to 148*61046927SAndroid Build Coastguard Worker // range [1, 2) by setting exponent field to 0x3ff (the exponent of 1) 149*61046927SAndroid Build Coastguard Worker // result in $r6d. The exponent will be recovered later. 150*61046927SAndroid Build Coastguard Worker ext u32 $r2 $r1 0xb14 151*61046927SAndroid Build Coastguard Worker and b32 $r7 $r1 0x800fffff 152*61046927SAndroid Build Coastguard Worker add b32 $r7 $r7 0x3ff00000 153*61046927SAndroid Build Coastguard Worker mov b32 $r6 $r0 154*61046927SAndroid Build Coastguard Worker sched 0x2b 0x04 0x28 0x28 0x2a 0x2b 0x2e 155*61046927SAndroid Build Coastguard Worker // Step 3: Convert new value to float (no overflow will occur due to step 156*61046927SAndroid Build Coastguard Worker // 2), calculate rcp and do newton-raphson step once 157*61046927SAndroid Build Coastguard Worker cvt rz f32 $r5 f64 $r6d 158*61046927SAndroid Build Coastguard Worker rcp f32 $r4 $r5 159*61046927SAndroid Build Coastguard Worker mov b32 $r0 0xbf800000 160*61046927SAndroid Build Coastguard Worker fma rn f32 $r5 $r4 $r5 $r0 161*61046927SAndroid Build Coastguard Worker fma rn f32 $r0 neg $r4 $r5 $r4 162*61046927SAndroid Build Coastguard Worker // Step 4: convert result $r0 back to double, do newton-raphson steps 163*61046927SAndroid Build Coastguard Worker cvt f64 $r0d f32 $r0 164*61046927SAndroid Build Coastguard Worker cvt f64 $r6d f64 neg $r6d 165*61046927SAndroid Build Coastguard Worker sched 0x2e 0x29 0x29 0x29 0x29 0x29 0x29 166*61046927SAndroid Build Coastguard Worker cvt f64 $r8d f32 0x3f800000 167*61046927SAndroid Build Coastguard Worker // 4 Newton-Raphson Steps, tmp in $r4d, result in $r0d 168*61046927SAndroid Build Coastguard Worker // The formula used here (and above) is: 169*61046927SAndroid Build Coastguard Worker // RCP_{n + 1} = 2 * RCP_{n} - x * RCP_{n} * RCP_{n} 170*61046927SAndroid Build Coastguard Worker // The following code uses 2 FMAs for each step, and it will basically 171*61046927SAndroid Build Coastguard Worker // looks like: 172*61046927SAndroid Build Coastguard Worker // tmp = -src * RCP_{n} + 1 173*61046927SAndroid Build Coastguard Worker // RCP_{n + 1} = RCP_{n} * tmp + RCP_{n} 174*61046927SAndroid Build Coastguard Worker fma rn f64 $r4d $r6d $r0d $r8d 175*61046927SAndroid Build Coastguard Worker fma rn f64 $r0d $r0d $r4d $r0d 176*61046927SAndroid Build Coastguard Worker fma rn f64 $r4d $r6d $r0d $r8d 177*61046927SAndroid Build Coastguard Worker fma rn f64 $r0d $r0d $r4d $r0d 178*61046927SAndroid Build Coastguard Worker fma rn f64 $r4d $r6d $r0d $r8d 179*61046927SAndroid Build Coastguard Worker fma rn f64 $r0d $r0d $r4d $r0d 180*61046927SAndroid Build Coastguard Worker sched 0x29 0x20 0x28 0x28 0x28 0x28 0x28 181*61046927SAndroid Build Coastguard Worker fma rn f64 $r4d $r6d $r0d $r8d 182*61046927SAndroid Build Coastguard Worker fma rn f64 $r0d $r0d $r4d $r0d 183*61046927SAndroid Build Coastguard Worker // Step 5: Exponent recovery and final processing 184*61046927SAndroid Build Coastguard Worker // The exponent is recovered by adding what we added to the exponent. 185*61046927SAndroid Build Coastguard Worker // Suppose we want to calculate rcp(x), but we have rcp(cx), then 186*61046927SAndroid Build Coastguard Worker // rcp(x) = c * rcp(cx) 187*61046927SAndroid Build Coastguard Worker // The delta in exponent comes from two sources: 188*61046927SAndroid Build Coastguard Worker // 1) The renormalization in step 2. The delta is: 189*61046927SAndroid Build Coastguard Worker // 0x3ff - $r2 190*61046927SAndroid Build Coastguard Worker // 2) (For the denorm input) The 2^54 we multiplied at rcp_denorm, stored 191*61046927SAndroid Build Coastguard Worker // in $r3 192*61046927SAndroid Build Coastguard Worker // These 2 sources are calculated in the first two lines below, and then 193*61046927SAndroid Build Coastguard Worker // added to the exponent extracted from the result above. 194*61046927SAndroid Build Coastguard Worker // Note that after processing, the new exponent may >= 0x7ff (inf) 195*61046927SAndroid Build Coastguard Worker // or <= 0 (denorm). Those cases will be handled respectively below 196*61046927SAndroid Build Coastguard Worker subr b32 $r2 $r2 0x3ff 197*61046927SAndroid Build Coastguard Worker add b32 $r4 $r2 $r3 198*61046927SAndroid Build Coastguard Worker ext u32 $r3 $r1 0xb14 199*61046927SAndroid Build Coastguard Worker // New exponent in $r3 200*61046927SAndroid Build Coastguard Worker add b32 $r3 $r3 $r4 201*61046927SAndroid Build Coastguard Worker add b32 $r2 $r3 0xffffffff 202*61046927SAndroid Build Coastguard Worker sched 0x28 0x2b 0x28 0x2b 0x28 0x28 0x2b 203*61046927SAndroid Build Coastguard Worker // (exponent-1) < 0x7fe (unsigned) means the result is in norm range 204*61046927SAndroid Build Coastguard Worker // (same logic as in step 1) 205*61046927SAndroid Build Coastguard Worker set b32 $p0 0x1 lt u32 $r2 0x7fe 206*61046927SAndroid Build Coastguard Worker (not $p0) bra #rcp_result_inf_or_denorm 207*61046927SAndroid Build Coastguard Worker // Norms: convert exponents back and return 208*61046927SAndroid Build Coastguard Worker shl b32 $r4 $r4 clamp 0x14 209*61046927SAndroid Build Coastguard Worker add b32 $r1 $r4 $r1 210*61046927SAndroid Build Coastguard Worker bra #rcp_end 211*61046927SAndroid Build Coastguard Workerrcp_result_inf_or_denorm: 212*61046927SAndroid Build Coastguard Worker // New exponent >= 0x7ff means that result is inf 213*61046927SAndroid Build Coastguard Worker set b32 $p0 0x1 ge s32 $r3 0x7ff 214*61046927SAndroid Build Coastguard Worker (not $p0) bra #rcp_result_denorm 215*61046927SAndroid Build Coastguard Worker sched 0x20 0x25 0x28 0x2b 0x23 0x25 0x2f 216*61046927SAndroid Build Coastguard Worker // Infinity 217*61046927SAndroid Build Coastguard Worker and b32 $r1 $r1 0x80000000 218*61046927SAndroid Build Coastguard Worker mov b32 $r0 0x0 219*61046927SAndroid Build Coastguard Worker add b32 $r1 $r1 0x7ff00000 220*61046927SAndroid Build Coastguard Worker bra #rcp_end 221*61046927SAndroid Build Coastguard Workerrcp_result_denorm: 222*61046927SAndroid Build Coastguard Worker // Denorm result comes from huge input. The greatest possible fp64, i.e. 223*61046927SAndroid Build Coastguard Worker // 0x7fefffffffffffff's rcp is 0x0004000000000000, 1/4 of the smallest 224*61046927SAndroid Build Coastguard Worker // normal value. Other rcp result should be greater than that. If we 225*61046927SAndroid Build Coastguard Worker // set the exponent field to 1, we can recover the result by multiplying 226*61046927SAndroid Build Coastguard Worker // it with 1/2 or 1/4. 1/2 is used if the "exponent" $r3 is 0, otherwise 227*61046927SAndroid Build Coastguard Worker // 1/4 ($r3 should be -1 then). This is quite tricky but greatly simplifies 228*61046927SAndroid Build Coastguard Worker // the logic here. 229*61046927SAndroid Build Coastguard Worker set b32 $p0 0x1 ne u32 $r3 0x0 230*61046927SAndroid Build Coastguard Worker and b32 $r1 $r1 0x800fffff 231*61046927SAndroid Build Coastguard Worker // 0x3e800000: 1/4 232*61046927SAndroid Build Coastguard Worker $p0 cvt f64 $r6d f32 0x3e800000 233*61046927SAndroid Build Coastguard Worker sched 0x2f 0x28 0x2c 0x2e 0x2a 0x20 0x27 234*61046927SAndroid Build Coastguard Worker // 0x3f000000: 1/2 235*61046927SAndroid Build Coastguard Worker (not $p0) cvt f64 $r6d f32 0x3f000000 236*61046927SAndroid Build Coastguard Worker add b32 $r1 $r1 0x00100000 237*61046927SAndroid Build Coastguard Worker mul rn f64 $r0d $r0d $r6d 238*61046927SAndroid Build Coastguard Workerrcp_end: 239*61046927SAndroid Build Coastguard Worker ret 240*61046927SAndroid Build Coastguard Worker 241*61046927SAndroid Build Coastguard Worker// RSQ F64 242*61046927SAndroid Build Coastguard Worker// 243*61046927SAndroid Build Coastguard Worker// INPUT: $r0d 244*61046927SAndroid Build Coastguard Worker// OUTPUT: $r0d 245*61046927SAndroid Build Coastguard Worker// CLOBBER: $r2 - $r9, $p0 - $p1 246*61046927SAndroid Build Coastguard Worker// 247*61046927SAndroid Build Coastguard Workergk110_rsq_f64: 248*61046927SAndroid Build Coastguard Worker // Before getting initial result rsqrt64h, two special cases should be 249*61046927SAndroid Build Coastguard Worker // handled first. 250*61046927SAndroid Build Coastguard Worker // 1. NaN: set the highest bit in mantissa so it'll be surely recognized 251*61046927SAndroid Build Coastguard Worker // as NaN in rsqrt64h 252*61046927SAndroid Build Coastguard Worker set $p0 0x1 gtu f64 abs $r0d 0x7ff0000000000000 253*61046927SAndroid Build Coastguard Worker $p0 or b32 $r1 $r1 0x00080000 254*61046927SAndroid Build Coastguard Worker and b32 $r2 $r1 0x7fffffff 255*61046927SAndroid Build Coastguard Worker sched 0x27 0x20 0x28 0x2c 0x25 0x28 0x28 256*61046927SAndroid Build Coastguard Worker // 2. denorms and small normal values: using their original value will 257*61046927SAndroid Build Coastguard Worker // lose precision either at rsqrt64h or the first step in newton-raphson 258*61046927SAndroid Build Coastguard Worker // steps below. Take 2 as a threshold in exponent field, and multiply 259*61046927SAndroid Build Coastguard Worker // with 2^54 if the exponent is smaller or equal. (will multiply 2^27 260*61046927SAndroid Build Coastguard Worker // to recover in the end) 261*61046927SAndroid Build Coastguard Worker ext u32 $r3 $r1 0xb14 262*61046927SAndroid Build Coastguard Worker set b32 $p1 0x1 le u32 $r3 0x2 263*61046927SAndroid Build Coastguard Worker or b32 $r2 $r0 $r2 264*61046927SAndroid Build Coastguard Worker $p1 mul rn f64 $r0d $r0d 0x4350000000000000 265*61046927SAndroid Build Coastguard Worker rsqrt64h f32 $r5 $r1 266*61046927SAndroid Build Coastguard Worker // rsqrt64h will give correct result for 0/inf/nan, the following logic 267*61046927SAndroid Build Coastguard Worker // checks whether the input is one of those (exponent is 0x7ff or all 0 268*61046927SAndroid Build Coastguard Worker // except for the sign bit) 269*61046927SAndroid Build Coastguard Worker set b32 $r6 ne u32 $r3 0x7ff 270*61046927SAndroid Build Coastguard Worker and b32 $r2 $r2 $r6 271*61046927SAndroid Build Coastguard Worker sched 0x28 0x2b 0x20 0x27 0x28 0x2e 0x28 272*61046927SAndroid Build Coastguard Worker set b32 $p0 0x1 ne u32 $r2 0x0 273*61046927SAndroid Build Coastguard Worker $p0 bra #rsq_norm 274*61046927SAndroid Build Coastguard Worker // For 0/inf/nan, make sure the sign bit agrees with input and return 275*61046927SAndroid Build Coastguard Worker and b32 $r1 $r1 0x80000000 276*61046927SAndroid Build Coastguard Worker mov b32 $r0 0x0 277*61046927SAndroid Build Coastguard Worker or b32 $r1 $r1 $r5 278*61046927SAndroid Build Coastguard Worker ret 279*61046927SAndroid Build Coastguard Workerrsq_norm: 280*61046927SAndroid Build Coastguard Worker // For others, do 4 Newton-Raphson steps with the formula: 281*61046927SAndroid Build Coastguard Worker // RSQ_{n + 1} = RSQ_{n} * (1.5 - 0.5 * x * RSQ_{n} * RSQ_{n}) 282*61046927SAndroid Build Coastguard Worker // In the code below, each step is written as: 283*61046927SAndroid Build Coastguard Worker // tmp1 = 0.5 * x * RSQ_{n} 284*61046927SAndroid Build Coastguard Worker // tmp2 = -RSQ_{n} * tmp1 + 0.5 285*61046927SAndroid Build Coastguard Worker // RSQ_{n + 1} = RSQ_{n} * tmp2 + RSQ_{n} 286*61046927SAndroid Build Coastguard Worker mov b32 $r4 0x0 287*61046927SAndroid Build Coastguard Worker sched 0x2f 0x29 0x29 0x29 0x29 0x29 0x29 288*61046927SAndroid Build Coastguard Worker // 0x3f000000: 1/2 289*61046927SAndroid Build Coastguard Worker cvt f64 $r8d f32 0x3f000000 290*61046927SAndroid Build Coastguard Worker mul rn f64 $r2d $r0d $r8d 291*61046927SAndroid Build Coastguard Worker mul rn f64 $r0d $r2d $r4d 292*61046927SAndroid Build Coastguard Worker fma rn f64 $r6d neg $r4d $r0d $r8d 293*61046927SAndroid Build Coastguard Worker fma rn f64 $r4d $r4d $r6d $r4d 294*61046927SAndroid Build Coastguard Worker mul rn f64 $r0d $r2d $r4d 295*61046927SAndroid Build Coastguard Worker fma rn f64 $r6d neg $r4d $r0d $r8d 296*61046927SAndroid Build Coastguard Worker sched 0x29 0x29 0x29 0x29 0x29 0x29 0x29 297*61046927SAndroid Build Coastguard Worker fma rn f64 $r4d $r4d $r6d $r4d 298*61046927SAndroid Build Coastguard Worker mul rn f64 $r0d $r2d $r4d 299*61046927SAndroid Build Coastguard Worker fma rn f64 $r6d neg $r4d $r0d $r8d 300*61046927SAndroid Build Coastguard Worker fma rn f64 $r4d $r4d $r6d $r4d 301*61046927SAndroid Build Coastguard Worker mul rn f64 $r0d $r2d $r4d 302*61046927SAndroid Build Coastguard Worker fma rn f64 $r6d neg $r4d $r0d $r8d 303*61046927SAndroid Build Coastguard Worker fma rn f64 $r4d $r4d $r6d $r4d 304*61046927SAndroid Build Coastguard Worker sched 0x29 0x20 0x28 0x2e 0x00 0x00 0x00 305*61046927SAndroid Build Coastguard Worker // Multiply 2^27 to result for small inputs to recover 306*61046927SAndroid Build Coastguard Worker $p1 mul rn f64 $r4d $r4d 0x41a0000000000000 307*61046927SAndroid Build Coastguard Worker mov b32 $r1 $r5 308*61046927SAndroid Build Coastguard Worker mov b32 $r0 $r4 309*61046927SAndroid Build Coastguard Worker ret 310*61046927SAndroid Build Coastguard Worker 311*61046927SAndroid Build Coastguard Worker.section #gk110_builtin_offsets 312*61046927SAndroid Build Coastguard Worker.b64 #gk110_div_u32 313*61046927SAndroid Build Coastguard Worker.b64 #gk110_div_s32 314*61046927SAndroid Build Coastguard Worker.b64 #gk110_rcp_f64 315*61046927SAndroid Build Coastguard Worker.b64 #gk110_rsq_f64 316