6 Characters and Strings

Character and string literals follow the C syntax with special chars quote with a \ backslash:

Chars: 'a' '%' '\n’
Strings "test" "line with new-line\n"

"""
Multi line

text block
"""

Multiline text blocks have been introduced since Java 15.

6.1 Characters

Wrapper class Character encapsulates a single character. It is immutable like all wrapper classes. It provides a set of utility methods for characters:

isLetter()
isDigit()
isSpaceChar()
toUpperCase()
toLowerCase()

6.1.1 Character sets

A character set is a set of characters in terms of abstract idea (e.g., the lowercase letter i of the latin alphabet ) and the corresponding bit (and byte sequence) representation (e.g. $\mathtt{01111001}_2$ or 0x69 ).

6.1.1.1 ASCII

The ASCII (American Standard Code for Information Interchange) is the most widely adopted character set. The original version uses a mapping on 7 bits and is also coded in ECMA-6 (“7-bit coded character set” 1991).

It maps as set of commong symbols, the numerical digits, and the latin uppercase and lowercase letters.

It is often used in an extended format based on 8 bits and standardize by ISO/IEC 646 and ECMA-94 (“8-bit single-byte coded graphic character sets - Latin alphabets No. 1 to No. 4” 1986). The 8 bits versions include mainly letters with accents and additional symbols.

6.1.1.2 Unicode

Unicode(“Unicode Specification,” n.d.) is a standard that assigns a unique code to every character in any language. It has several parts: Core specification gives the general principles, Code charts show representative glyphs for all the Unicode characters, Annexes supply detailed normative information, and Character Database normative and informative data for implementers.

The basic concepts defined by the standard are:

Character: the abstract concept e.g. LATIN SMALL LETTER I
Glyph: the graphical representation of a character, e.g. i i i i i
Font: a collection of glyphs
Codepoint: the numeric representation of a character
- represented with U+ followed by the hexadecimal code e.g. U+0069 for 'i'
- included in the range U+0000 to U+10FFFF (21 bits)

The encoding is the mapping from code point to a byte sequence, decoding is the inverse. Unicode standard defines several alternative encoding (and decoding) options:

UTF-32 uses fixed width, 32 bits per code point, since it uses at most 23 bit and often just 8, it has a large memory overhead
UTF-16 is a variable width encoding, it represents:
- codepoints from U+0 to U+d7ff on 16 bits (2 bytes) and
- codepoints from U+10000 to U+10ffff on 32 bits (4 bytes)
UTF-8 is a variable width encoding, it represents:
- codepoints U+00 to U+7f are mapped directly to single bytes, i.e. ASCII transparent,
- for the remaining code points, the high bit (0x80) marks multi byte character. Most non-ideographic codepoints are represented on 1 or 2 bytes e.g. U+00C8 representing character ‘è’ is mapped to two bytes: 0xC3 0xA8.

6.1.2 Class Charset

Class Charset allows handling different charsets and encodings in addition to the default Unicode, they are used for reading and writing.

It proides a few static methods to manage the available charsets in the system:

defaultCharset(): returns the object corrsponding to the default charset
forName(..): returns the corresponding charset
availableCharsets(): returns a map of all charsets by name

The predefined charsets available in any Java installation are:

US-ASCII: 7-bit ASCII, a.k.a. ISO646-US, ECMA-6(“7-bit coded character set” 1991)
ISO-8859-1
8-bit single byte ISO Latin No. 1, a.k.a. ISO-LATIN-1(“8-bit single-byte coded graphic character sets - Latin alphabets No. 1 to No. 4” 1986)
UTF-8
8-bit multi byte UCS Transformation Format
UTF-16BE
16-bit UCS Transformation Fmt., big-endian
UTF-16LE
16-bit UCS Transformation Fmt., little-endian
UTF-16
16-bit UCS Transformation Fmt., w/byte-order mark

The Charset class provides two main methods:

ByteBuffer encode(CharBuffer): encodes a sequence of chars into a sequence of bytes
CharBuffer decode(ByteBuffer): decode a sequence of bytes into a sequence of chars

The encoding and decoding is performed through encode and decoder objects that can be created through factory methods:

getDecoder()
getEncoder()

The decoder and encoder objects have an internal state, e.g. awaiting next byte of a multi-byte representation.

Using a decoding scheme to decode a string encoded with a different scheme may lead to an encoding mismatch. For instance, character ‘è’ has Unicode codepoint U+00C8 which is mapped in UTF-8 to two bytes: 0xC3 0xA8, while IS0-8859-1 decoding interprets the above sequence as two distinct characters ‘Ã¨’. Viceversa, ‘è’ in IS0-8859-1 is represented as 0xE8 which is an invalid character in UTF-8 (usually represented as �)

6.2 Strings

There is no primitive string representation, there are three classes that represent strings:

Class String, immutable, not modifiable version
Classes StringBuffer and StringBuilder, mutable, modifiable versions

String s = new String("literal");
StringBuilder sb = new StringBuffer("literal");

6.2.1 Class String

Java redefines the operator + for strings. It is used to concatenate 2 strings, e.g. "This is " + "a concatenation". It is important to remember that strings are immutable, therefore the application of the operator + creates a new string object with the result of the concatenation.

Operator + works also with other types, everything is automatically converted to a string representation using the toString() methods fo objects or the default representation of primitive types:

System.out.println("pi = " + 3.14);
System.out.println("x = " + x);

The two main string methods are:

int length(): returns string length
boolean equals(String s): compares the contents of two strings
String toUpperCase() Converts string to upper case
String toLowerCase() Converts string to lower case
String concat(String str) Creates a concatenation with the given string
int compareTo(String str) Compare to another string returning:
- < 0 : if this string precedes the other
- == 0 : if this string equals the other
- > 0 : if this string follows the other
String subString(int startIndex) "Human".subString(2) -> "man"
String subString(int start,int end) Char start included, end excluded "Greatest".subString(0,5) -> "Great"
int indexOf(String str) Returns the index of the first occurrence of str
int lastIndexOf(String str) The same as before but search starts from the end

Example:

String h = "Hello";
String w = "World";
String hw = "Hello World";
String h_w = h + " " + w;
hw.equals(h_w)  // -> true
hw == h_w       // -> false

In addition String provides the static method:

String valueOf(..): converts any primitive type into a String Overloads defined for all primitive types.

6.2.2 Formatting

It is possible to use a format syntax similar to C printf() using two alternatives:

static String format(String fmt, ...) is a static methods that builds a string using the format string,
String formatted(…) builds format in the string is is called

Example formatting:

answer = String.format("%d",42);
answer = "%d".formatted(42);`

Format essentials:

%[arg_index$][flags][width][.prec]conversion

arguments are positional unless arg_index is provided, it starts at 1
flags can be:
- -: left justified
- +: include sign
- 0: 0 padding
- (: Negative in parenthesis
width indicates the min width
prec defines the max width or decimal digits for floats
conversion can be:
- b boolean
- s string
- d integer
- f decimal
- e scientific

6.2.3 StringBuilder and StringBuffer

The classe StringBuilder and StringBuffer are method-level compatible classes. They represent a string of characters that is mutable and allows operation that modify the content. Can be converted to the corresponding String using the method toString(). The difference is that StringBuilder is non thread safe and non reentrant, this makes it more efficient, i.e. ~30% faster.

The main methods are:

append(String str): inserts str at the end of string
insert(int offset, String str) Inserts str starting from offset position
delete(int start, int end) Deletes character from start to end (excluded)
reverse() Reverses the sequence of characters

They all return a StringBuffer/StringBuilder enabling method chaining.

6.2.4 Performance

The three alternative representations of strings exhibit very different performance behaviors.

Let us consider a very simple use case where many concatenations has to be perfomed to build the resulting final string.

String s="";
for(i=0;i<N;++i){
    s += i;
}

StringBuffer sb = new StringBuffer();
for(i=0;i<N;++i){
  sb.append(i);
}

StringBuilder sb = new StringBuilder();
for(i=0;i<N;++i){
  sb.append(i);
}

The three above code fragmets executed with N=100000 yield the follwing elasped times:

Version	Elasped time	Memory used
String	1.3 s	500.0 MB
StringBuffer	2.9 ms	1.2 MB
StringBuilder	2.2 ms	0.8 MB

In addition to time performance it is important to remember that + instantiates a new object on each concatenation, thus also memory performance is significantly worse.

The huge advantage in using String is that it leads to much simpler code, faster to write and easier to understand.

Note

As a general programming advice: start writing your string manipulation using String and operator + this will make code faster to write and easier to understand. Later, if the code has relevant performance issues refactor it to use StringBuilder or StringBuffer, which are method-compatible. Use the latter only if thread safety is required.

6.2.5 String Pooling

Class String maintains a private static pool of distinct strings. The pool is managed through the method intern() which, when called:

checks if any string in the pool equals() the argument
if it finds one that string is returned
otherwise it adds the string to the pool and returns it

For each string literal the compiler generates code using intern() to keep a single copy of the string with that specific content. This process is called string internalization. In practice, the code:

String ss1 = "Hello!";

Generates the same code as:

String ss1 = (new String(new char[]{'H', 'e', 'l', 'l', 'o', '!'}) ).intern();

On the first occurrence of a literal compiler creates the string and adds it to the pool. Upon later occurrences of the same literal, the compiler creates a string and through intern() returns a reference to the same single one in the pool.

6.3 Wrap-up

Java characters are stored internally using a 16 bits unicode encoding
Conversion to/from streams of bytes is managed by Charset objects
String is immutable representation of strings
StringBuilder and StringBuffer are mutable alternatives, significantly more efficient for string manipulation