In the first two posts, I introduced the basic concepts and technical details of string pools. Today I will show you how we implemented the string pool, integrated it into our SmartInspect Console and why the effort was worth it.

Implementation

I already wrote in the previous posting that I’m a big fan of hash tables. They have a great constant-time asymptotic run-time performance and consume only a reasonable amount of memory at the same time. A very rare combination. At first look, it’s not that obvious how you can achieve constant-time lookups, inserts and deletes on a set of items. Let me explain how this works. A hash table, in its simplest form, usually consists of:

  • A table / array
    The array is responsible for holding the items. The items are usually key/value pairs. For our string pool, we are only interested in the keys and can therefore ignore the values. The hash table can keep this array fixed in size or dynamically resize it depending on the current load factor (i.e. the ratio of the used array elements and the capacity of the array).
  • A hash function
    The hash function is a function which calculates a value, the hash, for a given item. This hash is then mapped directly to the index in the array.
  • Collision lists
    Since a hash table has only limited space but usually deals with an unlimited set of possible items, it cannot have a simple 1:1 relationship between items and the indices in the array. It may happen that two different items are mapped to the same index in the array. This is called a collision. Collisions are handled by allowing more than one item per array element, usually in the form of collision lists but sometimes also with collision trees or even inner hash tables.

For our string pool, we chose a configurable but fixed-size array and standard collision lists as its secondary storage. We decided against a variable-size array to avoid occasional large memory moves and the need for re-indexing the items. It also helped to keep the implementation simple. Picking a sufficiently-large array and, more importantly, a good hash function was therefore crucial:

“A good hash function is essential for good hash table performance. A poor choice of a hash function is likely to lead to clustering, in which probability of keys mapping to the same hash bucket (i.e. a collision) is significantly greater than would be expected from a random function. A nonzero collision probability is inevitable in any hash implementation, but usually the number of operations required to resolve a collision scales linearly with the number of keys mapping to the same bucket, so excess collisions will degrade performance significantly.”

I personally can recommend the djb2 hash function. We have used it with great success in several occasions over the years. It’s fast, easy to implement and provides a good hash distribution:

function HashOf(const S: String): Cardinal;
begin
  Result := 5381;
  for I := 1 to Length(S) do
  begin
    Result := ((Result shl 5) + Result) + Ord(S[I]);
  end;
end;

Side note: I also experimented with other hash functions such as the Bob Jenkins hash or several SBox variants. I found that, although these functions are said to provide a better hash distribution as djb2, the djb2-based string pool outperformed all other string pool variants. In several tests with our particular string pool implementation and varying inputs and array sizes, the djb2 solution sometimes even won by a factor of two or three. Not sure why this happens actually. I will investigate this, maybe that’s something for another article.

So, when putting all these pieces together and combining it with the basic string pool algorithm I gave in the previous post, interning a string now roughly looks like this:

function InternString(const S: String): String;
var
  LIndex: Integer;
  P: ...
begin
  { Hash the string and map it to the array }
  LIndex := IndexOf(S);

  { Lookup the record at the given array index }
  P := Find(LIndex);
  if not Assigned(P) then
  begin
    { Allocate a new record and add it to the table }
    ...
  end;

  { Return a reference to our interned string }
  Result := P.Value;
end;

For the threading related aspects of the string pool, we opted for the thread-local solution. This may not be right decision for all applications but makes perfect sense for the Console. Because its string input and distribution are heavily thread-local as well, spreading the string pool over all application threads wouldn’t bring many benefits. It could possibly even decrease the overall application performance. We added support for thread-locality directly to the string pool by storing the internal hash table in thread-local-storage.

Integration

Since we chose the thread-local option, we needed to add one string pool to each application thread which calls InternString for the various strings it processes and loads. The Console uses a multi-threaded approach for its log clients, so we integrated the string pool into the main thread and into each of the listener classes for the clients (TcpListener and PipeListener for handling dynamic threads and clients via TCP/IP and named pipes, respectively). The integration consists of just a few lines of code for each thread pool instance to dynamically initialize and shutdown the string pool for the particular thread (required for setting and cleaning up the hash table array). The string pools are either cleared when a dynamic client thread exits, or when a user clears the log in the main thread.

Our string pool is plugged right into the reading and loading process of log packets. Every time a string property is read from a log stream, we route the result through InternString. It basically works as follows:

function TLogEntry.ReadFromStream(const AStream: TStream);
var
  LHostNameLength: Integer;
begin
  LHostNameLength := ...

  ...

  { Read the host name from the stream and get the equivalent string
    from our string pool }
  FHostName := InternString(ReadUtf8String(AStream, LHostNameLength));

  ...
end;

The following screenshot of the Console illustrates duplicates that are avoided with this technique. There are several additional strings we intern that aren’t obvious from the screenshot, but it should give you a good overview nonetheless (for example, the watches toolbox on the lower left corner of the screenshot displays only the latest name/value pair for a given watch, but the names and values can be logged and stored multiple times):

SmartInspect Console with interned strings highlighted

Results

Besides improving the memory consumption of the Console, one design goal of the string pool and integration was to keep the performance penalty of the changes to a minimum. We therefore tested the hash table with several different array sizes and chose a size that offers both a good overall performance and uses a reasonable amount of memory. It turned out that a relative large array with 216 elements would be a good fit.

The following table shows some benchmarks we conducted on my current development machine (an Intel 3GHz Core 2 Duo). Each row lists the interning of one million strings with varying contents. The time column specifies the time spent for the loop to intern the strings. The memory column specifies the memory consumption of the interned strings compared to the consumed memory without a string pool (not including the size of the fixed-size array).

Time Memory
Random, length 1 to 100 702 ms 103.47%
Random, fixed length of 25 639 ms 106.89%
Numbers, all the same 187 ms 0.01%
Numbers, 10% different 203 ms 11.45%
Numbers, 50% different 249 ms 56.43%
Numbers, 90% different 297 ms 102.71%
Numbers, no duplicates 312 ms 112.67%

As you can see, performance and memory consumption improve if you have more duplicates in your data. This is expected since more duplicates result in a smaller load factor of the hash table and less memory overhead. This also means that the memory consumption increases if you have less duplicates (which then leads to a decrease in performance because of the collisions and resulting overhead).

It’s therefore important to decide which strings should be interned. If a string has a small probability of being duplicated somewhere in your application or thread, it’s likely that interning such a string does more harm than good. Interning a few additional strings isn’t a lot of overhead of course, but if you unnecessarily intern thousands of thousands of strings, this may negatively influence your application performance and even slightly increase your memory usage. So the rule should be to only intern strings that are likely to be duplicated.

I think we found a very good trade-off for the Console. We expect that only 30-40% of strings in a typical log file are different, which means a reduction of memory consumption of around 50% for strings. But even in the worst case (a log file with no duplicates whatsoever), the overhead is not really noticeable. We are very pleased with these results and are happy that we took the time to implement and integrate the string pool into the SmartInspect Console. Optimizing and reducing the memory consumption of the Console means that customers are now able to open and analyze even larger log files, and we are already planning additional improvements to reduce the memory consumption even further.

Related posts:

  1. Optimizing Memory Consumption with String Pools: Part II
  2. Optimizing Memory Consumption with String Pools: Part I

Found this article useful? Make sure to subscribe to the No bug left behind feed or via email and don't miss our future articles about software quality, performance, usability and related topics. This blog also features the regular Software Quality Digest with links to relevant articles, discussions and other resources.

3 Responses

  1. Eric Says:

    Your hash function has the particularity of emphasing the last character in a string, with the shl 5, especially the last 8 characters of the string. Those 8 characters will be present almost literally in the hash if they are mostly numbers, delimiters and uppercase letters.

    For experimentation’s sake, you could try reversing the loop (taking the characters in the hash from last to first), thus emphasing the first characters, if I guessed right, your hash performance should decrease. If not, it’s open season for other explanations :)

  2. Giel Says:

    Thanks for the 3 articles, Tobias :-)

  3. Tobias Gurock Says:

    Eric:

    “For experimentation’s sake, you could try reversing the loop (taking the characters in the hash from last to first), thus emphasing the first characters .. ”

    Will do. As written, I also experimented with other hash functions and will continue to do so. Maybe I will also revisit the collision lists/fixed-size array solution. I’m thinking of turning the collision lists into collision trees and testing with dynamically-sized arrays.

    Giel: You are welcome. :-)