Class StringNormalizer

  • All Implemented Interfaces:
    Serializable, Cloneable, Iterable<StringNormalizer.NormalizerRule>, Collection<StringNormalizer.NormalizerRule>, List<StringNormalizer.NormalizerRule>, RandomAccess

    public class StringNormalizer
    extends ArrayList<StringNormalizer.NormalizerRule>
    This class represents a programmable string "normalizing" engine that can be used to convert strings into a canonical form, say, before comparing strings for equality or something. Basically, a normalizer is a list of zero or more rules, or transformations. The normalize(String) method can be used to apply the entire set of transformations to a given string.

    For example, you can build a string normalizer that replaces all sequences of one or more whitespace characters by a single space character, trims any leading or trailing space, and converts a string to lower case. This class provides a number of predefined transformations in the StringNormalizer.StandardRule enumeration. Some examples:

      // An "identity" transformation that does nothing:
      StringNormalizer norm1 = new StringNormalizer();
      // norm1.normalize(...) returns its argument unchanged
    
      // A "lower case" normalizer:
      StringNormalizer norm2 = new StringNormalizer(
          StringNormalizer.StandardRule.IGNORE_CAPITALIZATION);
      // norm2.normalize(...) returns a lower case version of its argument
    
      // self-explanatory:
      StringNormalizer norm3 = new StringNormalizer(
          StringNormalizer.StandardRule.IGNORE_CAPITALIZATION,
          StringNormalizer.StandardRule.IGNORE_PUNCTUATION);
    
      // A "standard" normalizer:
      StringNormalizer norm4 = new StringNormalizer(true);
      // norm4.normalize(...) returns its contents with all punctuation
      // characters removed, all letters converted to lower case, all
      // whitespace sequences replaced by single spaces, all MS-DOS or
      // Mac line terminators replaced by "\n"'s, and all leading and
      // trailing whitespace removed.
      

    Note that string normalizers that contain multiple rules apply those rules in order (i.e., in the order added, or the List order of this class). This may produce inconsistent results if you are not careful when you add your rules.

    See Also:
    Serialized Form
    • Constructor Detail

      • StringNormalizer

        public StringNormalizer()
        Creates a new StringNormalizer object containing no rules (the "identity" normalizer).
      • StringNormalizer

        public StringNormalizer​(boolean useStandardRules)
        Creates a new StringNormalizer object, optionally containing the standard set of rules. The standard set is all those in StringNormalizer.StandardRule exception the OPT_* rules.
        Parameters:
        useStandardRules - If true, the set of standard (non-OPT_*) rules will be used. If false, an "identity" normalizer will be produced instead.
      • StringNormalizer

        public StringNormalizer​(StringNormalizer.StandardRule... rules)
        Creates a new StringNormalizer object containing the given set of rules.
        Parameters:
        rules - a (variable-length) comma-separated sequence of rules to add
      • StringNormalizer

        public StringNormalizer​(StringNormalizer.NormalizerRule... rules)
        Creates a new StringNormalizer object containing the given set of rules.
        Parameters:
        rules - a (variable-length) comma-separated sequence of rules to add
      • StringNormalizer

        public StringNormalizer​(Collection<? extends StringNormalizer.NormalizerRule> rules)
        Creates a new StringNormalizer object containing the given set of rules.
        Parameters:
        rules - a collection of rules to add (could be another StringNormalizer, or any other kind of collection)