Class CharsetUtils

java.lang.Object
de.hsh.graja.util.CharsetUtils

public class CharsetUtils extends Object
  • Field Details

  • Constructor Details

    • CharsetUtils

      public CharsetUtils()
  • Method Details

    • isIso88591

      public static boolean isIso88591(Charset cs)
    • isIso88591

      public static boolean isIso88591(String csn)
    • isWindows1252

      public static boolean isWindows1252(Charset cs)
    • isWindows1252

      public static boolean isWindows1252(String csn)
    • isMacRoman

      public static boolean isMacRoman(Charset cs)
    • isMacRoman

      public static boolean isMacRoman(String csn)
    • detectCharacterEncoding8bit

      public static Charset detectCharacterEncoding8bit(byte[] buffer, String language, boolean preferMacRomanOnTie, StringBuilder log)
      Tries to detect ISO-8859-1, x-MacRoman or windows-1252. This method is a heavily modified version of the original written by Guillaume Laforge and Markus Döring: https://github.com/gbif/gbif-common/blob/master/src/main/java/org/gbif/utils/file/CharsetDetection.java
      Parameters:
      buffer - byte array to detect
      language - ISO 639-1 code of the language of the text in the buffer. This is used to guess 8-bit encodings. currently we support en, de, es, it. Other codes will be silently mapped to en. See <a href="https://www.loc.gov/standards/iso639-2/php/code_list.php>here.
      preferMacRomanOnTie - if there is a tie between x-MacRoman and one of the other encodings and this parameter is true, then x-MacRoman will be returned. If this parameter is false, the other encoding will be returned.
      log - optional buffer for log messages. May be null, if no log messages needed.
      Returns:
      the detected charset