java.lang.Object

de.hsh.graja.util.CharsetUtils

public class CharsetUtils extends Object

Field Summary

Fields

Modifier and Type

Field

Description

static final String

ISO_8859_1_CN

static final String

MAC_ROMAN_CN

static final String

WINDOWS_1252_CN
Constructor Summary

Constructors

Constructor

Description

CharsetUtils()
Method Summary

Modifier and Type

Method

Description

static Charset

detectCharacterEncoding8bit(byte[] buffer, String language, boolean preferMacRomanOnTie, StringBuilder log)

Tries to detect ISO-8859-1, x-MacRoman or windows-1252.

static boolean

isIso88591(String csn)

static boolean

isIso88591(Charset cs)

static boolean

isMacRoman(String csn)

static boolean

isMacRoman(Charset cs)

static boolean

isWindows1252(String csn)

static boolean

isWindows1252(Charset cs)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- WINDOWS_1252_CN
  
  public static final String WINDOWS_1252_CN
  See Also:
  
  Constant Field Values
- MAC_ROMAN_CN
  
  public static final String MAC_ROMAN_CN
  See Also:
  
  Constant Field Values
- ISO_8859_1_CN
  
  public static final String ISO_8859_1_CN
Constructor Details
- CharsetUtils
  
  public CharsetUtils()
Method Details
- isIso88591
  
  public static boolean isIso88591(Charset cs)
- isIso88591
  
  public static boolean isIso88591(String csn)
- isWindows1252
  
  public static boolean isWindows1252(Charset cs)
- isWindows1252
  
  public static boolean isWindows1252(String csn)
- isMacRoman
  
  public static boolean isMacRoman(Charset cs)
- isMacRoman
  
  public static boolean isMacRoman(String csn)
- detectCharacterEncoding8bit
  
  public static Charset detectCharacterEncoding8bit(byte[] buffer, String language, boolean preferMacRomanOnTie, StringBuilder log)
  
  Tries to detect ISO-8859-1, x-MacRoman or windows-1252. This method is a heavily modified version of the original written by Guillaume Laforge and Markus Döring: https://github.com/gbif/gbif-common/blob/master/src/main/java/org/gbif/utils/file/CharsetDetection.java
  
  Parameters:
  
  buffer - byte array to detect
  
  language - ISO 639-1 code of the language of the text in the buffer. This is used to guess 8-bit encodings. currently we support en, de, es, it. Other codes will be silently mapped to en. See <a href="https://www.loc.gov/standards/iso639-2/php/code_list.php>here.
  
  preferMacRomanOnTie - if there is a tie between x-MacRoman and one of the other encodings and this parameter is true, then x-MacRoman will be returned. If this parameter is false, the other encoding will be returned.
  
  log - optional buffer for log messages. May be null, if no log messages needed.
  
  Returns:
  
  the detected charset

Class CharsetUtils

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

WINDOWS_1252_CN

MAC_ROMAN_CN

ISO_8859_1_CN

Constructor Details

CharsetUtils

Method Details

isIso88591

isIso88591

isWindows1252

isWindows1252

isMacRoman

isMacRoman

detectCharacterEncoding8bit