regex - How to do a complex negative lookbehind to split tokens in Java? -
regex - How to do a complex negative lookbehind to split tokens in Java? -
i have number of lines in edifact format need tokenized on +. however, according edifact spec, characters can escaped ?. example: ?? ?, ?+ +, ?: :. ?+ part of field , should hence not considered delimiter.
i used negative lookbehind deal +'s followed ?:
delimiter = "\\+"; string[] tokens = data.split("(?<!\\?)" + delimiter); this split up
a+b+c a, b , c
a?+b+c a?+b , c
however, fails when ?? escape sequence involved:
a??+b+c yields 2 tokens: a??+b, c
whereas should 3 tokens: a?, b , c
on other hand: a???+b+c should yield 2 tokens: a???+b , c
is there way accomplish using negative lookbehind?
here's runnable test play around if wish.
import java.util.arrays; public class main { public static void main(string[] args) { asserttokens("a+b+c", "a", "b", "c"); asserttokens("a?+b+c", "a?+b", "c"); asserttokens("a??+b+c", "a??", "b", "c"); asserttokens("a???+b+c", "a???+b", "c"); } private static void asserttokens(string data, string... expectedtokens) { string delimiter = "\\+"; string[] tokens = data.split("(?<!\\?)" + delimiter); if(!arrays.deepequals(tokens, expectedtokens)) { throw new illegalstateexception("not equals " + data); } } }
rather splitting, tokenization easier using matching. in case, split work you'd have utilize variable-length lookbehind java doesn't support.
try next regex:
(?:[^+:?]++|\?.)+ demo
(i've used possessive quantifier (++) purely optimization avoid useless backtracking)
if want match empty tokens (a++b yielding, a, empty string , b), regex gets more complicated:
(?:[^+:?\r\n]++|\?.)+|(?<=[+:]|^)(?=[+:]|$) demo
which means
either match same above (i've added\r\n grouping newlines don't match) or empty string is: preceded token delimiter or start of line and followed token delimiter or end of line i've added m alternative work, meaning ^ , $ match start , end of each line.
java regex tokenize regex-negation
Comments
Post a Comment