Vicare Scheme: stdlib bytevector strings

5.2.9 Operations on strings

This section describes procedures that convert between strings and bytevectors containing Unicode encodings of those strings. When decoding bytevectors, encoding errors are handled as with the replace semantics of textual I/O: If an invalid or incomplete character encoding is encountered, then the replacement character U+FFFD is appended to the string being generated, an appropriate number of bytes are ignored, and decoding continues with the following bytes.

Procedure: string->utf8 string: Return a newly allocated (unless empty) bytevector that contains the UTF-8 encoding of the given string.

Procedure: string->utf16 string
Procedure: string->utf16 string endianness: If endianness is specified, it must be the symbol big or the symbol little. The string->utf16 procedure returns a newly allocated (unless empty) bytevector that contains the UTF-16BE or UTF-16LE encoding of the given string (with no byte–order mark). If endianness is not specified or is big, then UTF-16BE is used. If endianness is little, then UTF-16LE is used.

Procedure: string->utf32 string
Procedure: string->utf32 string endianness: If endianness is specified, it must be the symbol big or the symbol little. The string->utf32 procedure returns a newly allocated (unless empty) bytevector that contains the UTF-32BE or UTF-32LE encoding of the given string (with no byte mark). If endianness is not specified or is big, then UTF-32BE is used. If endianness is little, then UTF-32LE is used.

Procedure: utf8->string bytevector

Procedure: utf8->string bytevector handling-mode

Return a newly allocated (unless empty) string whose character sequence is encoded by the given bytevector.

As Vicare extension: the optional argument handling-mode must be a symbol representing an error handling mode, as validated by error-handling-mode (see error-handling-mode); when not given, it defaults to ‘raise’.

Procedure: utf16->string bytevector endianness

Procedure: utf16->string bytevector endianness endianness-mandatory

Procedure: utf16->string bytevector endianness endianness-mandatory handling-mode

The argument endianness must be the symbol big or the symbol little.

The utf16->string procedure returns a newly allocated (unless empty) string whose character sequence is encoded by the given bytevector.

bytevector is decoded according to UTF-16, UTF-16BE, UTF-16LE, or a fourth encoding scheme that differs from all three of those as follows: If endianness-mandatory is absent or #f, utf16->string determines the endianness according to a UTF-16 Byte Order Mark (BOM) at the beginning of bytevector if a BOM is present; in this case, the BOM is not decoded as a character. Also in this case, if no UTF-16 BOM is present, endianness specifies the endianness of the encoding. If endianness-mandatory is a true value, endianness specifies the endianness of the encoding, and any UTF-16 BOM in the encoding is decoded as a regular character.

NOTE A UTF-16 BOM is either a sequence of bytes #xFE, #xFF specifying big and UTF-16BE, or #xFF, #xFE specifying little and UTF-16LE.

(utf16->string '#vu8(#xAA #xBB) (endianness big))
⇒ "\xAABB;"
(utf16->string '#vu8(#xAA #xBB) (endianness little))
⇒ "\xBBAA;"

;;In all the following tests: the endianness argument is
;;ignored; the BOM is processed; an empty string is generated.

;;Big endian BOM.
(utf16->string '#vu8(#xFE #xFF) (endianness big)    #f)
⇒ ""
(utf16->string '#vu8(#xFE #xFF) (endianness little) #f)
⇒ ""
;;Little endian BOM.
(utf16->string '#vu8(#xFF #xFE) (endianness big)    #f)
⇒ ""
(utf16->string '#vu8(#xFF #xFE) (endianness little) #f)
⇒ ""

;;In all the following tests: the endianness argument is
;;ignored; the BOM is processed; a string of 1 character is
;;generated.

;;Big endian BOM.
(utf16->string '#vu8(#xFE #xFF #xAA #xBB) (endianness big)    #f)
⇒ "\xAABB;"
(utf16->string '#vu8(#xFE #xFF #xAA #xBB) (endianness little) #f)
⇒ "\xAABB;"
;;Little endian BOM.
(utf16->string '#vu8(#xFF #xFE #xAA #xBB) (endianness big)    #f)
⇒ "\xBBAA;"
(utf16->string '#vu8(#xFF #xFE #xAA #xBB) (endianness little) #f)
⇒ "\xBBAA;"

As Vicare extension: the optional argument handling-mode must be a symbol representing an error handling mode, as validated by error-handling-mode (see error-handling-mode); when not given, it defaults to ‘raise’.

Procedure: utf32->string bytevector endianness

Procedure: utf32->string bytevector endianness endianness-mandatory

Procedure: utf32->string bytevector endianness endianness-mandatory handling-mode

endianness must be the symbol big or the symbol little.

The utf32->string procedure returns a newly allocated (unless empty) string whose character sequence is encoded by the given bytevector.

bytevector is decoded according to UTF-32, UTF-32BE, UTF-32LE, or a fourth encoding scheme that differs from all three of those as follows: If endianness-mandatory is absent or #f, utf32->string determines the endianness according to a UTF-32 Byte Order Mark (BOM) at the beginning of bytevector if a BOM is present; in this case, the BOM is not decoded as a character. Also in this case, if no UTF-32 BOM is present, endianness specifies the endianness of the encoding. If endianness-mandatory is a true value, endianness specifies the endianness of the encoding, and any UTF-32 BOM in the encoding is decoded as a regular character.

NOTE A UTF-32 BOM is either a sequence of bytes #x00, #x00, #xFE, #xFF specifying big and UTF-32BE, or #xFF, #xFE, #x00, #x00, specifying little and UTF-32LE.

As Vicare extension: the optional argument handling-mode must be a symbol representing an error handling mode, as validated by error-handling-mode (see error-handling-mode); when not given, it defaults to ‘raise’.