Publication: 2024-07-27
Interpreting decimal strings into floats (part 3)
In the previous blog posts, we converted a decimal string into a tuple \((s, m, p)\), such that the approximated value \(s \times m \times 2^p\) is precise up to 61 bits. In this last blog post, I will describe the last step: encoding our tuple in a 32-bit or 64-bit floating point numbers. Finally, at the end of this article, I share some implementation results.
The encoding of floats
The 32-bit and 64-bit floats our CPUs manipulate are defined in the IEEE754 standard (100$ access... but fortunately, Wikipedia has extensive documentation on the format). Floats have three different forms:
32-bit float | 64-bit float | |
---|---|---|
Subnormal | \(s \times {m'}\times 2^{-149}\) | \(s \times m' \times 2^{-1074}\) |
Normal | \(s \times \left(1 + \frac{m'}{2^{23}}\right)\times 2^{p' - 127}\) | \(s \times \left(1 + \frac{m'}{2^{52}}\right)\times 2^{p'-1023}\) |
Special | \(\{-\infin,+\infin, \texttt{NaN}\}\) | \(\{-\infin,+\infin, \texttt{NaN}\}\) |
\(s\) range | \(\{-1, +1\}\) (1 bit) | \(\{-1,+1\}\) (1 bit) |
\(m'\) range | \([0 ; 2^{23}-1]\) (23 bits) | \([0;2^{52}-1]\) (52 bits) |
\(p'\) range | \([1 ; 2^8 - 2]\) (8 bits) | \([1;2^{11} - 2]\) (11 bits) |
The binary encoding is the concatenation of \((s, p', m')\), from most-significant to least-significant bits. The form is encoded using the extra values available in \(p'\). Subnormal form is encoded using \(p' = 0\). Special values are encoded with \(2^8-1\) for 32-bit floats and \(2^{11}-1\) for 64-bit floats. Zero is encoded as a subnormal number with \(m' = 0\).
To encode a value, we must first determine which form to use. Each form has minimal and maximal numbers which they can encode.
32-bit float | 64-bit float | |
---|---|---|
Minimal subnormal | \(\pm 0\) | \(\pm 0\) |
Maximal subnormal | \(\pm (1 - 2^{-23}) \times 2^{-126}\) | \(\pm (1 - 2^{-52}) \times 2^{-1022}\) |
Threshold subnormal-to-normal | \(\pm (1 - 2^{-24}) \times 2^{-126}\) | \(\pm (1 - 2^{-53}) \times 2^{-1022}\) |
Minimal normal | \(\pm 2^{-126}\) | \(\pm 2^{-1022}\) |
Maximal normal | \(\pm (2 - 2^{-23})\times 2^{127}\) | \(\pm(2 - 2^{-52}) \times 2^{1023}\) |
Threshold normal-to-infinity | \(\pm (2 - 2^{-24})\times 2^{127}\) | \(\pm(2 - 2^{-53}) \times 2^{1023}\) |
Infinity | \(\pm \infin\) | \(\pm \infin\) |
Thus, we need to compare the absolute value of our number \(m \times 2^p\) to these thresholds to determine the form to use.
Comparing two numbers expressed as (m,p)
First, let's express the thresholds \(T\) as tuples \((m_T , p_T) = m_T \times 2^{p_T}\), such that \(2^{63} \le m_T < 2^{64}\) :
Threshold | Expression | \(T = m_T \times 2^{p_T}\) |
---|---|---|
32-bit subnormal-to-normal | \((1 - 2^{-24}) \times 2^{-126}\) | \((2^{64}-2^{40}) \times 2^{-190}\) |
64-bit subnormal-to-normal | \((1 - 2^{-53}) \times 2^{-1022}\) | \((2^{64} - 2^{11}) \times 2^{-1086}\) |
32-bit normal-to-infinity | \((2 - 2^{-24})\times 2^{127}\) | \((2^{64} - 2^{39}) \times 2^{64}\) |
64-bit normal-to-infinity | \((2 - 2^{-53}) \times 2^{1023}\) | \((2^{64}-2^{10}) \times 2^{960}\) |
In the previous blog post, we already found an approximation to the result number expressed as \(N' = (m,p) = m \times 2^p\) with \(2^{63} \le m < 2^{64}\). To compare \(N'\) and \(T\), we can distinguish five cases :
So, we can compare \(N'\) and \(T\) by first comparing \(p\) and \(p_T\), and if there is an equality, compare \(m\) and \(m_T\) too, each comparison being a simple integer operation.
Encoding (s,m,p) into the floats
Depending on the result of the comparisons with the thresholds described before, we can encode the number using either subnormal, normal, or infinite form. We need to find \((m',p')\) such that the corresponding expression in the first table of this article is equal to \(m \times 2^{p}\) (the sign \(s\) being simply passed on without difficulties).
Form | Expression | Solution |
---|---|---|
32-bit subnormal | \(m' \times 2^{-149}\) | \(m' = \frac{m}{2^{-p-149}}\) \(p' = 0\) (subnormal) |
64-bit subnormal | \(m' \times 2^{-1074}\) | \(m' = \frac{m}{2^{-p-1074}}\) \(p' = 0\) (subnormal) |
32-bit normal | \(\left(1 + \frac{m'}{2^{23}}\right)\times 2^{p' - 127}\) | \(m' = \frac{m}{2^{40}}- 2^{23}\) \(p' = p + 190\) |
64-bit normal | \(\left(1 + \frac{m'}{2^{52}}\right)\times 2^{p'-1023}\) | \(m' = \frac{m}{2^{11}} - 2^{52}\) \(p' = p + 1086\) |
32-bit infinity | \(\infin\) | \(m' = 0\) (infinity) \(p' = 255\) (infinity) |
64-bit infinity | \(\infin\) | \(m' = 0\) (infinity) \(p' = 2047\) (infinity) |
Note that in both subnormal cases, \(-p-149\) and \(-p-1074\) are positive quantities.
The division by a power-of-two, transcribed to integer arithmetics, can be implemented as a bitwise shift-right. The operation is lossy: rounding is needed and will need to be accounted for. The code in C becomes, for 32-bit floats:
// Compute m <-- m / 2^r, rounded
static void _jvDivPow2(uint64_t* m, int64_t r)
{
if (r >= 64) {
*m = 0;
return;
}
if (r == 0) {
return;
}
*m >>= r - 1;
*m += 1; // Rounding
*m >>= 1;
}
One edge case exists: if \(m \ge 2^{64} - 2^{r-1}\), the rounded result is \(2^{64-r}\), which needs one more bit to be encoded that expected by the formulas. This results in a specific case in the caller code.
The encoding of 32-bit floats becomes the following code. The 64-bit floats logic is the same, only the numeric constants differ.
const uint64_t OVERFLOW_SUBNORMAL_M = (UINT64_MAX - POW2(40) + 1);
const uint64_t OVERFLOW_NORMAL_M = (UINT64_MAX - POW2(39) + 1);
uint32_t val = 0;
if (p < -190 || (p == -190 && m <= OVERFLOW_SUBNORMAL_M)) {
// Subnormal form: find m'*2^-149 = m*2^p
// -> m' = m * 2^(p+149) (but p <= -190)
_jvDivPow2(&m, -p - 149);
val = (uint32_t)m;
}
else if (p < 64 || (p == 64 && m <= OVERFLOW_NORMAL_M)) {
// Normal form: find (1 + m'/2^23)*2^(p'-127) = m*2^p
_jvDivPow2(&m, 40);
if (m >= POW2(24)) {
// Cannot encode as-is: use the next exponent p+1.
p += 1;
m >>= 1;
}
m -= POW2(23);
p += 190;
val = (uint32_t)m;
val |= (uint32_t)p << 23;
}
else {
val = (uint64_t)255 << 23;
}
val |= (uint32_t)(bNegative) << 31;
*(uint32_t*)pDest = val;
Implementation results
For my tests, the test data collected by Nigel Tao has been of tremendous help: it is a suite of plain-text files agglomerating the test suites of multiple open-source projects implementing float parsers: google-wuffs, rapidjson, freetype, etc. My test program source code is readable online. Here are the total results:
total_nb_f32_correct = 5295186
total_nb_f32_1ulp_away = 4821
total_nb_f32_bad = 0
total_nb_f64_correct = 3340643
total_nb_f64_1ulp_away = 1959364
total_nb_f64_bad = 0
Each test case counts towards one of three categories:
-
correct
if the result is bitwise identical to the expected result. -
1ulp_away
if the result only differs due to the least significant bit (ULP = Unit in the Last Place). -
bad
if the result differs more than the least significant bit.
I am satisfied with the 0 bad
results, which means my implementation is usable in most contexts. The 1ulp_away
errors are unfortunate, but given that I do not implement arbitrary precision, I expected this could occur. For instance, my implementation rounds ties away-from-zero. The default behavior suggested by IEEE754, and seemingly used in the test suites, is to round ties to even. In my implementation, it seems hard to know whether we have exactly a tie, because the least-significant bits in the 64-bit intermediate result \(m\) are not exact.
In terms of performance, I have compared using the same test program, but replacing my implementation by a dummy function which does nothing, and an adapter function using MSVC's strtof/d implementation. The dummy function shows the cost of the pure IO code. I have tested it on my Windows 11 laptop, which has CPU AMD Ryzen 5 7520U. The numbers in the table are the best timing found on 5 consecutive runs. In each run, \(10\ 600\ 014\) floats are parsed (half into 32-bit, half into 64-bit).
Dummy (Debug) | jvParseFloat (Debug) | strtof/d (Debug) | Dummy (Release) | jvParseFloat (Release) | strtof/d (Release) | |||
---|---|---|---|---|---|---|---|---|
Total run time | 7.11 s | 13.76 s | 20.75 s | 1.98 s | 2.47 s | 5.11 s | ||
Run time relative to dummy | 0 s | 6.65 s | 13.64 s | 0 s | 0.49 s | 3.13 s | ||
Average time per float parsed | 0 us/call | 0.63 us/call | 1.29 us/call | 0 us/call | 0.05 us/call | 0.29 us/call |
My implementation is noticeably faster than strtof/strtod. The comparison is a bit unfair though, given that strtof cannot afford to be 1 ULP away. The comparison should consider whether the user needs exact precision.
The presence of run time measured on debug builds may surprise you, but debug performances are appreciable during the development of big projects such as video games, so I tried to not make it prohibitive.
Edit 31/07: I did the same benckmark on fast_float, used by GCC and Webkit. I obtain:
- Debug: 0.8 us/call (total run time = 15.70 s)
- Release: 0.05 us/call (total run time = 2.48 s)
So fast_float has similar Release performance, and a bit worse for Debug, but it is always accurate. fast_float uses 10.8 KiB though, mainly because it stores all powers of 5 from \(5^{-342}\) to \(5^{308}\). Finally, fast_float is a C++11 codebase, while mine is C99.
Conclusion
This blog series is finished: I have achieved a satisfying implementation for the common problem of parsing strings into floating-point numbers. I also have sufficient confidence to use it when justified.
My implementation is available in the public domain. If you find it useful, feel free to send me an email. 😊
Thanks for reading, have a nice day!