Bug 1 ===== basic_ostream_formatter and fmt::streamed are locale-dependent -------------------------------------------------------------- `basic_ostream_formatter` and `fmt::streamed` leak the currently active C++ locale to `operator<<` because the locale that gets imbued in a `basic_ostream` by default is a default constructed locale (which is a copy of the active global locale). See . A correct `basic_ostream` `operator<<` is required to look at the stream locale in order to decide how it should format things. This conflicts with documentation which claims that {fmt} is locale-independent by default unless an explicit locale is provided and/or the `L` format specifier is used. In order to always achieve locale-independence, {fmt} should thus `imbue(std::locale::classic())` if no explicit locale is provided. Test case: : ``` #include #include #include #include struct inspect_locale { friend std::ostream & operator<<(std::ostream & os, const inspect_locale &) { os << os.getloc().name(); return os; } }; template <> struct fmt::formatter : fmt::ostream_formatter { }; int main() { { std::cout << fmt::format("{}", inspect_locale{}) << std::endl; } { std::locale::global(std::locale("en_US.UTF-8")); std::cout << fmt::format("{}", inspect_locale{}) << std::endl; } return 0; } ``` Bug 2 ===== {fmt} and/or ostream_formatter/fmt::streamed are very confused about text encoding ---------------------------------------------------------------------------------- In the default recommended use of {fmt}, which means using `char` as the character type, `ostream_formatter` is confused about text encoding. It copies characters written to the `ostream` verbatim into the output buffer. However, that mixes up text encodings. {fmt} documents itself as using "`char`/UTF-8" (see ), however `operator<<` needs to look at the locale that is imbued in the `ostream` and then use the locale's narrow character encoding. In the default case (when the process did not set a locale (like e.g. via `std::locale::global(std::locale(""))`), this will be the `std::locale::classic()` or `"C"` locale, which in the general case does not use UTF-8 encoding. If the process configured the system-configured locale (or any other locale), it can be basically anything (and in particular is a locale that does not use UTF-8 in almost all Windows applications). Test case: : ``` #include #include #include #include #include #include struct lol_type { friend std::ostream & operator<<(std::ostream & os, const lol_type &) { const std::codecvt & conv = std::use_facet>(os.getloc()); const std::wstring in = L"löl"; std::vector tmp((conv.max_length() * in.length()) + 1, '\0'); std::string out; { const wchar_t * from_next = nullptr; char * to_next = nullptr; std::mbstate_t state{}; from_next = in.data(); to_next = tmp.data(); // skip codecvt error handling, not relevant to demonstrate the problem conv.out(state, in.data(), in.data() + in.length(), from_next, tmp.data(), tmp.data() + tmp.size(), to_next); out = std::string(tmp.data(), to_next); } /* os << out; */ /* // with en_US.ISO-8859-1 locale, this would be */ os << "l\xf6l"; return os; } }; template <> struct fmt::formatter : fmt::ostream_formatter { }; int main() { // compiler explorer does not provide that locale /* std::locale::global(std::locale("en_US.ISO-8859-1")); */ std::string s = fmt::format("{} {}", lol_type{}, "löl"); for (const auto c : s) { // output is not in any valid encoding std::cout << std::hex << std::setfill('0') << std::setw(2) << static_cast(static_cast(c)) << " "; } std::cout << std::endl; return 0; } ``` Possible ways forward: 1. imbue a locale which uses the intended (as documented) UTF-8 encoding. However, I doubt that is possible to do portably in practise as the locale names are platform-dependent and {fmt} does not know which are available. There might not even be any UTF-8 locale available at all on a given system. Providing only a {fmt}-tailored `codecvt` facet might also work. 2. imbue the `std::locale::classic()` locale (when no explicit locale was given) and transcode the output to UTF-8 (which might imply a roundtrip through the wide locale encoding). This has the disadvantage of limiting the available character set to what the `"C"` locale supports. 3. Leave the locale alone and transcode from whatever is the current active locale's narrow encoding. This however also has the disadvantage of limiting the available character set to what is representable in the locale's narrow character set (which might be less than full Unicode). 4. imbue the `std::locale::classic()` locale (when no explicit locale was given) and statically check for `operator<<` availability for `basic_ostream` using the Unicode character types `char8_t`/`char16_t`/`char32_t` or the wide locale character set via `wchar_t`, use those to format the value, and transcode the output. This requires defining some character type preference precedence. 5. `fmt/chrono.h` appears to already try to handle encoding, so maybe at the very least the same treatment should be brought to `ostream_formatter`/`fmt::streamed`. I have not reviewed whether what `chrono.h` does is sane/correct. 6. Remove the claim that {fmt} is "`char`/UTF-8" from the documentation, and keep all Mojibake support. 7. Remove `fmt::ostream_formatter`/`fmt::streamed` and pretend *{fmt} had always been at war with ostream*. I strongly prefer options 4 or 6. Now, this is not at all easy to fix, and IMHO shows the very fundamental design flaw in {fmt} of using a kitchen-sink character type as the recommended default and trying to enforce its encoding by documentation only. This encoding confusion bug would highly likely have been caught during development if {fmt} had choosen to use an encoding type safe character type (like `char8_t` or `char32_t`) as its default/recommended character type (test case: : ``` #include #include #include #include #include struct compiletime_encoding_fuckup_detection_demonstrator { friend std::ostream & operator<<(std::ostream & os, const compiletime_encoding_fuckup_detection_demonstrator &) { os << "l\xf6l"; return os; } friend std::wostream & operator<<(std::wostream & os, const compiletime_encoding_fuckup_detection_demonstrator &) { os << L"l\U000000f6l"; return os; } }; template struct fmt::formatter : fmt::basic_ostream_formatter { }; int main() { std::string yolostring = fmt::format("{}", compiletime_encoding_fuckup_detection_demonstrator{}); std::wstring widestring = fmt::format(L"{}", compiletime_encoding_fuckup_detection_demonstrator{}); //std::u32string utf32string = fmt::format(U"{}", compiletime_encoding_fuckup_detection_demonstrator{}); //std::u16string utf16string = fmt::format(u"{}", compiletime_encoding_fuckup_detection_demonstrator{}); //std::u8string utf8string = fmt::format(u8"{}", compiletime_encoding_fuckup_detection_demonstrator{}); return 0; } ``` ). Bug 3 ===== The L"lölcálê" bug ------------------ ``` manx@appendix:~/tmp$ cat fmtlöl1.cpp && g++ -std=c++20 -O3 -Wall -Wextra -Wpedantic fmtlöl1.cpp -lfmt -o fmtlöl1 && echo "" && ./fmtlöl1 #include #include #include int main() { fmt::print(L"{}\n", L"löl"); std::locale::global(std::locale("")); fmt::print(L"{}\n", L"löl"); return 0; } l?l lol manx@appendix:~/tmp$ cat fmtlöl2.cpp && g++ -std=c++20 -O3 -Wall -Wextra -Wpedantic fmtlöl2.cpp -lfmt -o fmtlöl2 && echo "" && ./fmtlöl2 #include #include int main() { fmt::print("{}\n", "löl"); fmt::print(L"{}\n", L"löl"); return 0; } löl terminate called after throwing an instance of 'std::system_error' what(): cannot write to file: Success Aborted manx@appendix:~/tmp$ cat fmtlöl3.cpp && g++ -std=c++20 -O3 -Wall -Wextra -Wpedantic fmtlöl3.cpp -lfmt -o fmtlöl3 && echo "" && ./fmtlöl3 #include #include int main() { fmt::print(L"{}\n", L"löl"); fmt::print("{}\n", "löl"); return 0; } l?l terminate called after throwing an instance of 'std::system_error' what(): cannot write to file: Success Aborted manx@appendix:~/tmp$ cat fmtlöl4.cpp && g++ -std=c++20 -O3 -Wall -Wextra -Wpedantic fmtlöl4.cpp -lfmt -o fmtlöl4 && echo "" && ./fmtlöl4 #include #include #include int main() { std::locale::global(std::locale("")); fmt::print(L"{}\n", L"löl"); return 0; } löl manx@appendix:~/tmp$ cat fmtlöl5.cpp && g++ -std=c++20 -O3 -Wall -Wextra -Wpedantic fmtlöl5.cpp -lfmt -o fmtlöl5 && echo "" && ./fmtlöl5 #include #include #include int main() { std::locale::global(std::locale("")); fmt::print(L"{}\n", L"löl"); fmt::print("{}\n", "löl"); return 0; } löl terminate called after throwing an instance of 'std::system_error' what(): cannot write to file: No such file or directory Aborted manx@appendix:~/tmp$ cat löl.cpp && g++ -std=c++20 -O3 -Wall -Wextra -Wpedantic löl.cpp -o löl && echo "" && ./löl #include #include #include #include int main() { std::wstring lol = L"löl"; for (const auto wc : lol) { static_assert(sizeof(wchar_t) == 4); std::cout << std::hex << std::setfill('0') << std::setw(8) << static_cast(wc) << " "; } std::cout << std::endl; return 0; } 0000006c 000000f6 0000006c manx@appendix:~/tmp$ set | grep '^LANG' LANG=en_US.UTF-8 LANGUAGE=en_US:en manx@appendix:~/tmp$ pkg-config --modversion fmt 8.1.1 manx@appendix:~/tmp$ gcc --version gcc (Debian 12.1.0-8) 12.1.0 Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. manx@appendix:~/tmp$ lsb_release --id --release Distributor ID: Debian Release: testing manx@appendix:~/tmp$ ``` ¿LÖL? 🎆 ¯\\\_(ツ)\_/¯ 😂