6 Characters and Strings
Character and string literals follow the C syntax with special chars quote with a \ backslash:
Chars:
'a''%''\n’Strings
"test""line with new-line\n""""
Multi linetext block
"""
Multiline text blocks have been introduced since Java 15.
6.1 Characters
Wrapper class Character encapsulates a single character. It is immutable like all wrapper classes. It provides a set of utility methods for characters:
isLetter()isDigit()isSpaceChar()toUpperCase()toLowerCase()
6.1.1 Character sets
A character set is a set of characters in terms of abstract idea (e.g., the lowercase letter i of the latin alphabet ) and the corresponding bit (and byte sequence) representation (e.g. \(\mathtt{01111001}_2\) or 0x69 ).
6.1.1.1 ASCII
The ASCII (American Standard Code for Information Interchange) is the most widely adopted character set. The original version uses a mapping on 7 bits and is also coded in ECMA-6 (“7-bit coded character set” 1991).
It maps as set of commong symbols, the numerical digits, and the latin uppercase and lowercase letters.
It is often used in an extended format based on 8 bits and standardize by ISO/IEC 646 and ECMA-94 (“8-bit single-byte coded graphic character sets - Latin alphabets No. 1 to No. 4” 1986). The 8 bits versions include mainly letters with accents and additional symbols.
6.1.1.2 Unicode
Unicode(“Unicode Specification,” n.d.) is a standard that assigns a unique code to every character in any language. It has several parts: Core specification gives the general principles, Code charts show representative glyphs for all the Unicode characters, Annexes supply detailed normative information, and Character Database normative and informative data for implementers.
The basic concepts defined by the standard are:
Character: the abstract concept e.g.
LATIN SMALL LETTER IGlyph: the graphical representation of a character, e.g. i i i
iiFont: a collection of glyphs
Codepoint: the numeric representation of a character
- represented with
U+followed by the hexadecimal code e.g.U+0069for'i' - included in the range
U+0000toU+10FFFF(21 bits)
- represented with
The encoding is the mapping from code point to a byte sequence, decoding is the inverse. Unicode standard defines several alternative encoding (and decoding) options:
UTF-32 uses fixed width, 32 bits per code point, since it uses at most 23 bit and often just 8, it has a large memory overhead
UTF-16 is a variable width encoding, it represents:
- codepoints from
U+0toU+d7ffon 16 bits (2 bytes) and - codepoints from
U+10000toU+10ffffon 32 bits (4 bytes)
- codepoints from
UTF-8 is a variable width encoding, it represents:
- codepoints
U+00toU+7fare mapped directly to single bytes, i.e. ASCII transparent, - for the remaining code points, the high bit (0x80) marks multi byte character. Most non-ideographic codepoints are represented on 1 or 2 bytes e.g.
U+00C8representing character ‘è’ is mapped to two bytes:0xC30xA8.
- codepoints
6.1.2 Class Charset
Class Charset allows handling different charsets and encodings in addition to the default Unicode, they are used for reading and writing.
It proides a few static methods to manage the available charsets in the system:
defaultCharset(): returns the object corrsponding to the default charsetforName(..): returns the corresponding charsetavailableCharsets(): returns a map of all charsets by name
The predefined charsets available in any Java installation are:
US-ASCII: 7-bit ASCII, a.k.a.ISO646-US,ECMA-6(“7-bit coded character set” 1991)ISO-8859-1
8-bit single byte ISO Latin No. 1, a.k.a.ISO-LATIN-1(“8-bit single-byte coded graphic character sets - Latin alphabets No. 1 to No. 4” 1986)UTF-8
8-bit multi byte UCS Transformation FormatUTF-16BE
16-bit UCS Transformation Fmt., big-endianUTF-16LE
16-bit UCS Transformation Fmt., little-endianUTF-16
16-bit UCS Transformation Fmt., w/byte-order mark
The Charset class provides two main methods:
ByteBuffer encode(CharBuffer): encodes a sequence of chars into a sequence of bytesCharBuffer decode(ByteBuffer): decode a sequence of bytes into a sequence of chars
The encoding and decoding is performed through encode and decoder objects that can be created through factory methods:
getDecoder()getEncoder()
The decoder and encoder objects have an internal state, e.g. awaiting next byte of a multi-byte representation.
Using a decoding scheme to decode a string encoded with a different scheme may lead to an encoding mismatch. For instance, character ‘è’ has Unicode codepoint U+00C8 which is mapped in UTF-8 to two bytes: 0xC3 0xA8, while IS0-8859-1 decoding interprets the above sequence as two distinct characters ‘è’. Viceversa, ‘è’ in IS0-8859-1 is represented as 0xE8 which is an invalid character in UTF-8 (usually represented as �)
6.2 Strings
There is no primitive string representation, there are three classes that represent strings:
- Class
String, immutable, not modifiable version - Classes
StringBufferandStringBuilder, mutable, modifiable versions
String s = new String("literal");
StringBuilder sb = new StringBuffer("literal");6.2.1 Class String
Java redefines the operator + for strings. It is used to concatenate 2 strings, e.g. "This is " + "a concatenation". It is important to remember that strings are immutable, therefore the application of the operator + creates a new string object with the result of the concatenation.
Operator + works also with other types, everything is automatically converted to a string representation using the toString() methods fo objects or the default representation of primitive types:
System.out.println("pi = " + 3.14);
System.out.println("x = " + x);The two main string methods are:
int length(): returns string lengthboolean equals(String s): compares the contents of two stringsString toUpperCase()Converts string to upper caseString toLowerCase()Converts string to lower caseString concat(String str)Creates a concatenation with the given stringint compareTo(String str)Compare to another string returning:- < 0 : if this string precedes the other
- == 0 : if this string equals the other
- > 0 : if this string follows the other
String subString(int startIndex)"Human".subString(2)->"man"String subString(int start,int end)Char start included, end excluded"Greatest".subString(0,5) -> "Great"int indexOf(String str)Returns the index of the first occurrence of strint lastIndexOf(String str)The same as before but search starts from the end
Example:
String h = "Hello";
String w = "World";
String hw = "Hello World";
String h_w = h + " " + w;
hw.equals(h_w) // -> true
hw == h_w // -> falseIn addition String provides the static method:
String valueOf(..): converts any primitive type into aStringOverloads defined for all primitive types.
6.2.2 Formatting
It is possible to use a format syntax similar to C printf() using two alternatives:
static String format(String fmt, ...)is a static methods that builds a string using the format string,String formatted(…)builds format in the string is is called
Example formatting:
answer = String.format("%d",42);
answer = "%d".formatted(42);`Format essentials:
%[arg_index$][flags][width][.prec]conversion
- arguments are positional unless
arg_indexis provided, it starts at 1 flagscan be:-: left justified+: include sign0: 0 padding(: Negative in parenthesis
widthindicates the min widthprecdefines the max width or decimal digits for floatsconversioncan be:bbooleansstringdintegerfdecimalescientific
6.2.3 StringBuilder and StringBuffer
The classe StringBuilder and StringBuffer are method-level compatible classes. They represent a string of characters that is mutable and allows operation that modify the content. Can be converted to the corresponding String using the method toString(). The difference is that StringBuilder is non thread safe and non reentrant, this makes it more efficient, i.e. ~30% faster.
The main methods are:
append(String str): inserts str at the end of stringinsert(int offset, String str)Inserts str starting from offset positiondelete(int start, int end)Deletes character from start to end (excluded)reverse()Reverses the sequence of characters
They all return a StringBuffer/StringBuilder enabling method chaining.
6.2.4 Performance
The three alternative representations of strings exhibit very different performance behaviors.
Let us consider a very simple use case where many concatenations has to be perfomed to build the resulting final string.
String s="";
for(i=0;i<N;++i){
s += i;
}StringBuffer sb = new StringBuffer();
for(i=0;i<N;++i){
sb.append(i);
}StringBuilder sb = new StringBuilder();
for(i=0;i<N;++i){
sb.append(i);
}The three above code fragmets executed with N=100000 yield the follwing elasped times:
| Version | Elasped time | Memory used |
|---|---|---|
| String | 1.3 s | 500.0 MB |
| StringBuffer | 2.9 ms | 1.2 MB |
| StringBuilder | 2.2 ms | 0.8 MB |
In addition to time performance it is important to remember that + instantiates a new object on each concatenation, thus also memory performance is significantly worse.
The huge advantage in using String is that it leads to much simpler code, faster to write and easier to understand.
As a general programming advice: start writing your string manipulation using String and operator + this will make code faster to write and easier to understand. Later, if the code has relevant performance issues refactor it to use StringBuilder or StringBuffer, which are method-compatible. Use the latter only if thread safety is required.
6.2.5 String Pooling
Class String maintains a private static pool of distinct strings. The pool is managed through the method intern() which, when called:
- checks if any string in the pool
equals()the argument - if it finds one that string is returned
- otherwise it adds the string to the pool and returns it
For each string literal the compiler generates code using intern() to keep a single copy of the string with that specific content. This process is called string internalization. In practice, the code:
String ss1 = "Hello!";Generates the same code as:
String ss1 = (new String(new char[]{'H', 'e', 'l', 'l', 'o', '!'}) ).intern();On the first occurrence of a literal compiler creates the string and adds it to the pool. Upon later occurrences of the same literal, the compiler creates a string and through intern() returns a reference to the same single one in the pool.
6.3 Wrap-up
- Java characters are stored internally using a 16 bits unicode encoding
- Conversion to/from streams of bytes is managed by
Charsetobjects Stringis immutable representation of stringsStringBuilderandStringBufferare mutable alternatives, significantly more efficient for string manipulation