Class UrlCodec


  • public class UrlCodec
    extends Object
    Codecs for the various URL parts. Unlike URLCodec this is focused on Strings and thus the decoder can leave unknown characters untouched: "ä%C3%A4" is decoded to "ää" instead of "?ä" as URLCodec.decode(String) would do.
    • Field Detail

      • PART_URL_SAFECHARS

        protected static final String PART_URL_SAFECHARS
        The characters which can always appear in any URL without being encoded: the "unreserved" chars. Unfortunately there are different recommendations about encoding $!*'(), so we exclude them. Possibly we could include the "extra" chars !*'(), . We exlude ~ since it was declared unsafe in See Also:
        Constant Field Values
      • URLSAFE

        public static final UrlCodec URLSAFE
        Codec quoting everything other than the chars which are safe in every part of the URL.
      • AUTHORITY

        public static final UrlCodec AUTHORITY
        Codec for the authority of an URL.
      • OPAQUE

        public static final UrlCodec OPAQUE
        Codec for opaque URLs that are not parsed. Contains all unreserved, reserved and extra characters
      • PAT_ENCODED_CHARACTERS

        protected static final Pattern PAT_ENCODED_CHARACTERS
        Matches one or several percent encoded bytes.
      • PAT_INVALID_ENCODED_CHARACTER

        protected static final Pattern PAT_INVALID_ENCODED_CHARACTER
        Matches a percent sign followed by something that's not a hexadecimally encoded byte.
      • INVALID_CHARACTER_MARKER

        protected static final String INVALID_CHARACTER_MARKER
        "\ufffd" is inserted whenever something could not be decoded, or sometimes when it's encoded - see encode(String).
        See Also:
        Constant Field Values
      • charset

        protected final Charset charset
      • admissibleCharacters

        protected final String admissibleCharacters
      • validationRegex

        protected final Pattern validationRegex
        Matches an arbitrarily long sequence of admissible chars and percent encodings.
      • invalidCharacterMarkerForEncoding

        protected transient String invalidCharacterMarkerForEncoding
    • Constructor Detail

      • UrlCodec

        public UrlCodec​(@NotNull
                        @NotNull String admissibleCharacters,
                        @NotNull
                        @NotNull Charset charset)
                 throws IllegalArgumentException,
                        PatternSyntaxException
        Initializes the Codec with a range of admissible characters.
        Parameters:
        admissibleCharacters - all characters that remain untouched when encoding, can contain ranges like a-z in simple regex character classes. (Thus, - has to be first or last character if it needs to be included. Obviously, the quoting character '%' always has to be admissible.
        charset - the charset needed for the decoder.
        Throws:
        IllegalArgumentException - if the admissibleCharacters don't contain '%'
        PatternSyntaxException - if the admissibleCharacters are not a well formed character class
    • Method Detail

      • charsToEncode

        protected String charsToEncode​(String admissibleCharacters)
        Hook to calculate the set of characters to encode from the admissibleCharacters
      • encode

        @Nullable
        public @Nullable String encode​(@Nullable
                                       @Nullable String encoded)
        Encodes all characters which are not admissible to percent-encodings wrt. the given charset. If characters are not in the charset, they will silently be encoded as a replacement character, which is either "\ufffd" or '?' if one of these is admissible, or the encoding of "\ufffd" for the charset (which might be an encoded '?').
      • decode

        @Nullable
        public @Nullable String decode​(@Nullable
                                       @Nullable String encoded)
        Decodes a percent encoded characters in the string, never throwing exceptions: if an undecodeable character is encountered it's replaced with the replacement character "\ufffd". The only exception we make here is that a % sign without a hexadecimal number is passed through unchanged, so that this can be used to preventively decode strings that might be encoded - which is not 100% safe, though, since there might been something looking like a % encoded character: e.g. "an%effect" will be decoded to "an�fect".
      • encode

        @Nullable
        protected @Nullable String encode​(@Nullable
                                          @Nullable String encoded,
                                          boolean doThrow)
      • encodePostprocess

        protected void encodePostprocess​(StringBuffer out)
        Hook for finalizing encoding
      • getInvalidCharacterMarkerForEncoding

        protected String getInvalidCharacterMarkerForEncoding()
        To mark characters that could not properly be encoded, we use "\ufffd" or ? if one of these is admissible, or "\ufffd" encoded if that belongs to the charset, or ? encoded if it's not.
      • decodeValidated

        @Nullable
        public @Nullable String decodeValidated​(@Nullable
                                                @Nullable String encoded)
                                         throws IllegalArgumentException
        Decodes percent encoded characters in the string but throws an IllegalArgumentException if the input string is invalid: if it contains an unencoded quoting character % recognizable because it is not followed by a 2 digit hexadecimal number or it does not encode a character in the charset.
        Throws:
        IllegalArgumentException - if encoded is not a validly encoded String
      • decodePreprocess

        protected String decodePreprocess​(String encoded)
        Hook to preprocess something about to be decoded.
      • unhex

        protected byte unhex​(char c)
      • isValid

        public boolean isValid​(@Nullable
                               @Nullable String encoded)
        Verifies that the given String is encoded: all characters are admissible and % is always followed by a hexadecimal number.