Package de.hsh.graja.util
Class CharsetUtils
java.lang.Object
de.hsh.graja.util.CharsetUtils
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic Charset
detectCharacterEncoding8bit
(byte[] buffer, String language, boolean preferMacRomanOnTie, StringBuilder log) Tries to detect ISO-8859-1, x-MacRoman or windows-1252.static boolean
isIso88591
(String csn) static boolean
isIso88591
(Charset cs) static boolean
isMacRoman
(String csn) static boolean
isMacRoman
(Charset cs) static boolean
isWindows1252
(String csn) static boolean
isWindows1252
(Charset cs)
-
Field Details
-
WINDOWS_1252_CN
- See Also:
-
MAC_ROMAN_CN
- See Also:
-
ISO_8859_1_CN
-
-
Constructor Details
-
CharsetUtils
public CharsetUtils()
-
-
Method Details
-
isIso88591
-
isIso88591
-
isWindows1252
-
isWindows1252
-
isMacRoman
-
isMacRoman
-
detectCharacterEncoding8bit
public static Charset detectCharacterEncoding8bit(byte[] buffer, String language, boolean preferMacRomanOnTie, StringBuilder log) Tries to detect ISO-8859-1, x-MacRoman or windows-1252. This method is a heavily modified version of the original written by Guillaume Laforge and Markus Döring: https://github.com/gbif/gbif-common/blob/master/src/main/java/org/gbif/utils/file/CharsetDetection.java- Parameters:
buffer
- byte array to detectlanguage
- ISO 639-1 code of the language of the text in the buffer. This is used to guess 8-bit encodings. currently we support en, de, es, it. Other codes will be silently mapped to en. See <a href="https://www.loc.gov/standards/iso639-2/php/code_list.php>here.preferMacRomanOnTie
- if there is a tie between x-MacRoman and one of the other encodings and this parameter is true, then x-MacRoman will be returned. If this parameter is false, the other encoding will be returned.log
- optional buffer for log messages. May be null, if no log messages needed.- Returns:
- the detected charset
-