Java 9: Compact Strings

Java 9 brings a new, improved string, which in most cases, will greatly reduce String memory consumption.

String memory consumption

The value of each String is internally contained in a char[] array. Each character is two bytes, sixteen bits. As this is UTF-16, it allows even representation of all the special characters. The problem is, that the vast majority of the strings in applications can be expressed by just one byte using ISO-8859-1/Latin-1 as they contain no special characters. If such strings could be represented with just one byte per character, that would mean only half of the memory would be used to store the String's characters. Of course, that does not mean that the whole memory consumed by a String object is now only 50% of the original. Not all the memory allocated by strings is used to store the characters. There is some additional data such as cached hash code or padding.

String instances are stored on the heap. Quite a big portion of the heap memory is actually consumed by Strings. According to some studies, Strings usually occupy as much as 25% of the heap memory. Making String twice as small would mean not only a significant memory consumption reduction but also a substantial reduction of Garbage Collection overhead.

Java 6 Compressed Strings

The string memory consumption issue is not new. It has been discussed for quite some time already. In fact, in Java 6, a new feature was introduced to address this issue - Compressed Strings.

The idea was - instead of using char[] array for the internal representation an Object could be used. If necessary, two bytes per character would still be used assigning char[] array to that object. If not, one byte per character is sufficient and byte[] array can be used.

This was an optional, experimental feature, which could be enabled on demand using a -XX flag. However, there were various issues with this feature, which Aleksey Shipilev expressed in his Q&A for InfoQ:

UseCompressedStrings feature was rather conservative: while distinguishing between char[] and byte[] case, and trying to compress the char[] into byte[] on String construction, it done most String operations on char[], which required to unpack the String. Therefore, it benefited only a special type of workloads, where most strings are compressible (so compression does not go to waste), and only a limited amount of known String operations are performed on them (so no unpacking is needed). In great many workloads, enabling -XX:+UseCompressedStrings was a pessimization.

UseCompressedStrings implementation was basically an optional feature that maintained a completely distinct String implementation in alt-rt.jar, which was loaded once the VM option is supplied. Optional features are harder to test, since they double the number of option combinations to try.

Because of that, Compressed Strings support was later removed and is no longer available.

Java 9 Compact Strings

While the implementation of Compressed Strings was flawed in many ways, the main idea was still valid. The implementation was just not solid enough. In Java 9, a new feature was introduced as a replacement of Compressed Strings - Compact Strings.

Instead of having char[] array, String is now represented as byte[] array. Depending on which characters it contains, it will either use UTF-16 or Latin-1, that is - either one or two bytes per character. There is a new field inside the String class - coder, which indicates which variant is used. Unlike Compressed Strings, this feature is enabled by default. If necessary (in a case where there are mainly UTF-16 Strings used), it can still be disabled by -XX:-CompactStrings.

The change does not affect any public interfaces of String or any other related classes. Many of the classes were reworked to support the new String representation, such as StringBuffer or StringBuilder.

Performance Impact

Unlike Compressed Strings, the new solution does not contain any String repacking, thus should be much more performant. In addition to significant memory footprint reduction, it should provide a performance gain when processing 1-byte Strings as there is much less data to be processed. Garbage Collection overhead will be reduced as well. Processing of 2-byte string does mean a minor performance hit because there is some additional logic for handling both cases for Strings. But overall, performance should be improved as 2-byte Strings should represent just a minority of all String instances. In case there are performance issues in situations, where the majority of Strings are 2-byte, the feature can always be disabled.

Conclusion

Java 9 introduces a new feature, which does greatly reduce the memory footprint of String. It is backward compatible, no public interfaces were changed. If required, it can be disabled by an -XX flag. It is a successor to the Compressed Strings from Java 6.