regex - How to do a complex negative lookbehind to split tokens in Java? -
regex - How to do a complex negative lookbehind to split tokens in Java? -
i have number of lines in edifact format need tokenized on +
. however, according edifact spec, characters can escaped ?
. example: ??
?
, ?+
+
, ?:
:
. ?+
part of field , should hence not considered delimiter.
i used negative lookbehind deal +
's followed ?
:
delimiter = "\\+"; string[] tokens = data.split("(?<!\\?)" + delimiter);
this split up
a+b+c
a
, b
, c
a?+b+c
a?+b
, c
however, fails when ??
escape sequence involved:
a??+b+c
yields 2 tokens: a??+b
, c
whereas should 3 tokens: a?
, b
, c
on other hand: a???+b+c
should yield 2 tokens: a???+b
, c
is there way accomplish using negative lookbehind?
here's runnable test play around if wish.
import java.util.arrays; public class main { public static void main(string[] args) { asserttokens("a+b+c", "a", "b", "c"); asserttokens("a?+b+c", "a?+b", "c"); asserttokens("a??+b+c", "a??", "b", "c"); asserttokens("a???+b+c", "a???+b", "c"); } private static void asserttokens(string data, string... expectedtokens) { string delimiter = "\\+"; string[] tokens = data.split("(?<!\\?)" + delimiter); if(!arrays.deepequals(tokens, expectedtokens)) { throw new illegalstateexception("not equals " + data); } }
}
rather splitting, tokenization easier using matching. in case, split work you'd have utilize variable-length lookbehind java doesn't support.
try next regex:
(?:[^+:?]++|\?.)+
demo
(i've used possessive quantifier (++
) purely optimization avoid useless backtracking)
if want match empty tokens (a++b
yielding, a
, empty string , b
), regex gets more complicated:
(?:[^+:?\r\n]++|\?.)+|(?<=[+:]|^)(?=[+:]|$)
demo
which means
either match same above (i've added\r\n
grouping newlines don't match) or empty string is: preceded token delimiter or start of line and followed token delimiter or end of line i've added m
alternative work, meaning ^
, $
match start , end of each line.
java regex tokenize regex-negation
Comments
Post a Comment