multipart-mixed

Mini-Pattern/Code: Tokenize In Place

Quickly tokenize a string without copying, much like strtok(), but usable in a multi-threaded environment. Commonly used for parsing communications protocols. Only use when you don't need to keep the original string intact.

Motivation

There are a couple recurring themes in string parsing:

  • break apart string with character or string separator
  • extract one line from a buffer
  • check if a string contains a token

The C function strtok() is only useful for the first one, and you need to use strtok_r() if two threads could possibly be tokenizing at the same time.

Extracting lines from a buffer can be tricky. Especially when communicating with external resources, "\n" as a line ending just doesn't cut it. Most Internet protocols specify carriage return and linefeed (CRLF) endings, while others might use just CR or LF. Some especially heinous servers will mix conventions, as I observed when writing an IRC client.

Checking a string for a token is easy, and it belongs in the same place as tokenizing methods.

For these reasons I rolled my own C++ tokenizer class, presented here.

Implementation

I won't present the full text of the class here, so download the source code to use the class in your own projects. I am releasing this under the very lenient BSD license, which allows use in commercial products.

The main principle is extremely simple: scan for the separator, and stick a null over it. Return a pointer to the first string and remember the start of the remaining string, as illustrated here:

Tokenize illustration

Note that "current" is a pointer maintained in the Tokenizer class.

Here's an implementation of this, using a character separator:

char* Tokenizer::nextToken(const char sepChar)
{
    if (_length == 0)
        return 0;

    char* end    = &_current[_length - 1];
    char* marker = _current;

    /* Parse up to sepChar or end of Tokenizer */
    while (*_current != sepChar && _current <= end)
        _current++;

    /* _current is either at token or end */
    if (*_current != 0)
        *_current++ = 0; /* Set null over token and advance */

    /* Update length */
    _length -= (_current - marker);

    return marker;
}

nextToken() will return a string up to but not including the sepChar, or up to the end of the string if the sepChar was not in the string. Null is returned if no data was left in the tokenizer.

Using a string as a separator is just about as easy. I won't list the code here, but it's in the Tokenizer class. Note that both of these methods could have been implemented with strtok_r(), but at the time I wrote this code I was building on a platform lacking that function.

Detecting end of line is as follows:

char* Tokenizer::nextLine()
{
    if (_length == 0)
        return 0;

    char* end = &_current[_length - 1];
    char* marker = _current;
    char  crEnding = false;

    /* Parse up to line ending or end of string */
    while (*_current != kLineEndingCR && *_current != kLineEndingLF && _current <= end)
        _current++;

    /* _current is either at token or end */
    if (*_current == kLineEndingCR)
        crEnding = true;

    if (*_current != 0)
        *_current++ = 0;

    /* Special case CR/LF combos */
    if (crEnding && *_current == kLineEndingLF)
        _current++;      /* Skip the linefeed */

    _length -= (_current - marker);

    return marker;
}

This will handle CR, LF, and CRLF line endings.

Consequences

The advantages to Tokenize In Place are:

  • Fast parsing with little memory overhead (data is not copied).
  • As part of the Tokenizer class, includes smart EOL detection and "has token?" methods.

The disadvantages:

  • Original buffer is modified during parsing.

Download Code

tokenizer.zip