Facets with some surrogate pairs can't be loaded

Description

Facet values with certain code points represented using UTF-16 surrogate pairs will result in this exception from TermStringList: "Values need to be added in ascending order."

Bobo expects all facet string values to be loaded in Java string order, which is UTF-16 code unit order. This is inconsistent with Unicode codepoint order and UTF-8 byte order (used by Lucene, and consistent with codepoint order). Bobo should use the same ordering as Lucene.

This is probably a bug in Java's String class, which was originally written for UCS-2 and didn't need to support surrogate pairs.

Workaround - use a Comparator that is consistent with Unicode code point order. icu4j can be used (MIT style license) https://ssl.icu-project.org/repos/icu/icu/trunk/license.html

Possible optimization: use byte[] instead of String in the string facet structure?

Environment

None

Status

Assignee

Matt Wheeler

Reporter

Matt Wheeler

Labels

None

Fix versions

Priority

Major