Examples#
View UTF-32 to UTF-16#
An example showing conversion of an array UTF-32 encoded code points can be converted to UTF-16 using the C++ 20 ranges interface.
#include <algorithm>
#include <array>
#include <iostream>
#include <iterator>
#include <vector>
#include "icubaby/icubaby.hpp"
// The ICUBABY_HAVE_RANGES and ICUBABY_HAVE_CONCEPTS macros are true if the
// corresponding features are available in both the compiler and standard
// library.
#if ICUBABY_HAVE_RANGES && ICUBABY_HAVE_CONCEPTS
int main () {
// The code points to be converted. Here just a single U+1F600 GRINNING FACE
// emoji but obviously the array could contain many more code points.
auto const input = std::array{char32_t{0x1F600}};
// Take the 'input' container and send it lazily through a UTF-32 to UTF-16
// transcoder.
auto r = input | icubaby::views::transcode<char32_t, char16_t>;
// A vector to contain the output UTF=16 code units.
std::vector<char16_t> out;
// Copy from the input range to the output vector.
std::ranges::copy (r, std::back_inserter (out));
}
#else
int main () {
std::cout << "Sorry, icubaby C++ 20 ranges aren't supported by your build.\n";
}
#endif // ICUBABY_HAVE_RANGES && ICUBABY_HAVE_CONCEPTS
Iterator and Algorithm#
The icubaby::iterator
class offers a familiar output iterator for using a
transcoder. Each code unit from the input encoding is written to the iterator and this in
turn writes the output encoding to a second iterator. This enables use of standard
algorithms such as std::copy()
with the library.
#include <iostream>
#include <iterator>
#include <string_view>
#include <vector>
#include "icubaby/icubaby.hpp"
namespace {
// Dump the number of code units and code points within the supplied container.
template <typename Container>
void describe (Container const& container, std::string_view const encoding) {
auto const code_units = container.size ();
auto const code_points =
icubaby::length (std::begin (container), std::end (container));
std::cout << encoding << " is " << code_units << ' '
<< (code_units == 1U ? "code unit" : "code units") << " and "
<< code_points << ' '
<< (code_points == 1U ? "code point" : "code points") << '\n';
}
} // end anonymous namespace
int main () {
// The input: we start with a vector of UTF-8 code units. In this case a
// single U+1F600 GRINNING FACE code point. Note that we use icubaby::char8
// here for compatibility with C++ 17. If you always use C++ 20 or later, you
// can bypass this type and simply use char8_t.
auto const input = std::vector{
static_cast<icubaby::char8> (0xF0), static_cast<icubaby::char8> (0x9F),
static_cast<icubaby::char8> (0x98), static_cast<icubaby::char8> (0x80)};
describe (input, "UTF-8");
// A second vector which will contain the UTF-16 output.
std::vector<char16_t> output;
// Instantiate a transcoder which can convert from UTF-8 to UTF-16.
icubaby::t8_16 transcoder;
// Now an output iterator based on the t8_16 transcoder which can consume
// UTF-8 input and convert it to UTF-16. It will then emit the results to a
// second output iterator (`inserter`) which appends those code units to the
// `output` vector.
auto inserter = std::back_inserter (output);
auto output_it = icubaby::iterator{&transcoder, inserter};
// Loop through the input assigning each UTF-16 to the `it` output iterator
// created on the previous line.
for (auto code_unit : input) {
*(output_it++) = code_unit;
}
// Tell the transcoder that the input has been completely processed.
(void)transcoder.end_cp (output_it);
describe (output, "UTF-16");
std::cout << "Input " << (transcoder.well_formed () ? "was" : "was not")
<< " well formed\n";
}
Bytes to UTF-8#
This code converts an array of bytes containing the string “Hello World” in UTF-16 BE
with an initial byte order mark first to UTF-8 and then to an array of
std::uint_least8_t
. We finally copy these values to std::cout
.
#include <iostream>
#include "icubaby/icubaby.hpp"
// The ICUBABY_HAVE_RANGES and ICUBABY_HAVE_CONCEPTS macros are true if the
// corresponding features are available in both the compiler and standard
// library.
#if ICUBABY_HAVE_RANGES && ICUBABY_HAVE_CONCEPTS
int main () {
// The bytes to be converted. An array here, but this could obviously come
// from any source such as user input, a file, or a network endpoint. Note
// that the icubaby transcoder deals with a single byte at a time so we don't
// need to have the entire input available at any time.
static std::array const input{
std::byte{0xFE}, std::byte{0xFF}, std::byte{0x00}, std::byte{'H'},
std::byte{0x00}, std::byte{'e'}, std::byte{0x00}, std::byte{'l'},
std::byte{0x00}, std::byte{'l'}, std::byte{0x00}, std::byte{'o'},
std::byte{0x00}, std::byte{' '}, std::byte{0x00}, std::byte{'W'},
std::byte{0x00}, std::byte{'o'}, std::byte{0x00}, std::byte{'r'},
std::byte{0x00}, std::byte{'l'}, std::byte{0x00}, std::byte{'d'},
};
// A pipeline where the input array is converted from a series of bytes to a
// stream of UTF-8 code units and then finally to std::uint_least8_t for
// display to the user.
auto range = input | icubaby::views::transcode<std::byte, char8_t> |
std::views::transform ([] (char8_t code_unit) {
return static_cast<std::uint_least8_t> (code_unit);
});
// Copy the elements of range directly to `std::cout`.
(void)std::ranges::copy (
range, std::ostream_iterator<std::uint_least8_t> (std::cout));
}
#else
int main () {
std::cout << "Sorry, icubaby C++ 20 ranges aren't supported by your build.\n";
}
#endif // ICUBABY_HAVE_RANGES && ICUBABY_HAVE_CONCEPTS
Manual Bytes to UTF-8#
This code shows how icubaby makes it straightforward to convert a byte array to a
sequence of Unicode code units passing one byte at a time to a transcoder instance. We
take the bytes making up the string “Hello World” expressed in big endian UTF-16 (with a
byte order marker) and convert them to UTF-8 which is written directly to std::cout
.
#include <array>
#include <cstddef>
#include <iostream>
#include <iterator>
#include <vector>
#include "icubaby/icubaby.hpp"
int main () {
// The bytes to be converted. An array here, but this could obviously come
// from any source such as user input, a file, or a network endpoint. Note
// that the icubaby transcoder deals with a single byte at a time so we don't
// need to have the entire input available at any time.
static std::array const input{
std::byte{0xFE}, std::byte{0xFF}, std::byte{0x00}, std::byte{'H'},
std::byte{0x00}, std::byte{'e'}, std::byte{0x00}, std::byte{'l'},
std::byte{0x00}, std::byte{'l'}, std::byte{0x00}, std::byte{'o'},
std::byte{0x00}, std::byte{' '}, std::byte{0x00}, std::byte{'W'},
std::byte{0x00}, std::byte{'o'}, std::byte{0x00}, std::byte{'r'},
std::byte{0x00}, std::byte{'l'}, std::byte{0x00}, std::byte{'d'},
std::byte{0x00}, std::byte{'\n'}};
// A vector to contain the UTF-8 output.
std::vector<icubaby::char8> output;
// An output iterator that will append each UTF-8 code unit to the `output`
// vector.
auto out_it = std::back_inserter (output);
// The transcoder instance. We consume bytes (indicating that the transcoder
// must decide on the input encoding) and emit icubaby::char8 (UTF-8).
icubaby::transcoder<std::byte, icubaby::char8> transcode;
// Call the transcoder for each source byte. Output goes to the 'out' output
// iterator.
for (auto b : input) {
out_it = transcode (b, out_it);
}
// Tell the transcoder that it should have received a complete code point.
// This always happens at the end of the input.
(void)transcode.end_cp (out_it);
// Write the output to the console. This example sticks to the ASCII subset of
// code point, so this should work on most terminals!
for (auto c : output) {
std::cout << static_cast<char> (c);
}
}