I think this is worth highlighting, because I’ve seen so many cases where programmers “parse” text using java tools like StringTokenizer or split() with a set of punctuation characters:
java already has a built-in, locale-aware method for getting sentences from text, and words from sentences:
Anything you write yourself to parse text will likely miss corner-cases✝ and be un-prepared for other languages.
Since BreakIterator
does the job, isn’t difficult to use and has been around jdk 1.2, why not use it?
✝ odd punctuation like this, for example, when reading words.