Optimizing Memory Consumption with String Pools: Part I

By Tobias Gurock, February 9th, 2009

While working on the latest version of SmartInspect, we brainstormed on how to reduce the memory consumption of the SmartInspect Console. The background for this is that we recently got one or two reports of Out Of Memory exceptions when loading really huge log files into the Console (huge begins at several 100MB depending on the installed amount of memory). While there’s a good solution for this problem (splitting log files, either by enabling log file rotation in the logging library or by using our splitting tool), it would be nicer if the Console could handle bigger log files by itself.

Before I go into details explaining what we’ve already done and what we are hoping to do in the future to improve the memory behavior of the Console, let me first give you some background information about SmartInspect log files and how they are represented in memory. The format of a SmartInspect log file is a simple binary format which is optimized for performance and size. It’s basically a sequence of packets each consisting of a simple packet header and packet body. The header describes its type and size and the packet body contains the actual logging data (if you are interested in all the gory details, see SmartInspect Log Formats and Protocols). While it’s not a super flexible file format, it can be written and read extremely efficiently and stores the data in a very compact way. It’s so compact that the Console needs around three to four times the memory of the log file size.

Here’s why. In a log file, the data is stored sequentially and nearly every byte of data in the log file on disk maps to one byte of real information. There is practically no overhead involved. This is completely different from the memory representation in the Console where:

  1. log packets are strongly-typed objects with the usual overhead associated with them
  2. strings and attached log entry data are accessed indirectly through pointers
  3. packets have several additional object fields to implement some Console features efficiently
  4. the packets require some additional management structures such as lists and queues

We’ve already reduced the memory usage of the Console by implementing several lazy loading mechanisms and by choosing smaller data types wherever possible (2 instead of 4 byte integers). We also drastically improved the log file loading process by avoiding temporary memory peeks. However, the real issue which we haven’t adressed yet are strings. One big problem is that strings are stored in log files with the UTF-8 Unicode encoding, but use the standard Windows UTF-16 encoding in the Console. This results in using twice as much as memory as needed in most cases, plus the additional management overhead that is always associated with strings. Here’s an illustration of how a string looks like in a SmartInspect log file and how the same string is represented in the Console’s memory (the Console is written in Delphi, but a string has a similar or even worse overhead in Java or .NET):

delphi_strings

If you have a string variable somewhere in a Delphi application, it’s basically just a pointer to its first character similar to char pointers in C (actually, its a pointer to the first Unicode code point, but let’s ignore the whole surrogate thing for a moment). Prepended to this array there’s a header record storing the reference count and the string length. Compared to how a string is stored in the log file, the memory representation has the overhead of the pointer, the reference count, an additional null character for C/Windows API compability and the different string encoding. All combined, a string represented in memory uses more than twice as much as space. Here’s certainly some room for improvement.

An obvious and quite effective solution would be to change the string encoding and use UTF-8 the same way it’s used in log files. But this would have a negative impact on performance as we would have to convert a string to UTF-16 every time we want to display it (all Windows APIs expect strings to be encoded with UTF-16). We’ve tested this and while it looked like a viable solution, we didn’t implement it because we came up with a better idea: a string pool.

Log files have the property that, although they usually contain thousands of strings, a large percentage of these strings are actually the same. Besides the actual data passed to log methods (such as log messages, variable names and values), the libraries also collect some context information such as the host name, application name and session name (sessions can be thought of as log categories). These strings never or rarely change inside a log, so we end up with thousands and thousands of copies of the same strings. Now the idea is to implement some kind of abstraction, the string pool, that makes it easy to avoid duplicate strings by storing them in a central location and reusing them. Implemented correctly, this has the potential of reducing the memory usage in the Console by so much that it uses less space than required for an (uncompressed) log file on disk.

The idea is actually quite simple and the concept is nothing new (it’s usually called string interning and is supported out-of-the-box by several modern programming environments, including Java and .NET):

“In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool. String interning is supported by some modern object-oriented programming languages, including Python, Ruby (with its symbols), Java and .NET languages.”

Although string interning/pooling is pretty much a known secret, we haven’t seen it used with Delphi yet and it’s a great solution for a not so untypical problem. Although the Console might be a special case and would benefit more from string interning than other applications, it’s a very useful concept every developer should be familiar with and add to his toolbox. That’s why you shouldn’t miss part II and part III of this blog series which will introduce the technical details of string pools, explain the several different usage scenarios you can optimize for and give a detailed implementation guide.

Related posts:

  1. Optimizing Memory Consumption with String Pools: Part III
  2. Optimizing Memory Consumption with String Pools: Part II

Found this article useful? Make sure to subscribe to the No bug left behind feed or via email and don't miss our future articles about software quality, performance, usability and related topics. This blog also features the regular Software Quality Digest with links to relevant articles, discussions and other resources.

7 Responses

  1. Thomas Mueller Says:

    Yes, I know, everybody could say that, but I implemented something like this about 10 years ago in Delphi. The problem was similar: A internal list of applications, publishers and filenames gathered over a population of several 100 computers. There were pretty many duplicates in that data and storing duplicate strings only once reduced memory requirements significantly.

    Of course I had never heard about “string interning” back then. ;-)

    (I must be getting old, because I start sounding like all these old guys when I started into computing: “Been there, done that, nothing new under the sun.”)

  2. Tobias Gurock Says:

    Thanks for your feedback, Thomas. Yeah, it’s a pretty common scenario I guess. It’s too bad Delphi doesn’t have built-in support for string interning. Ideally, Delphi would provide a simple InternString method allowing you to selectively intern some strings depending on your data.

  3. Thomas Mueller Says:

    If I remember correctly, I used a simple sorted TStringList back then, which reduced memory consumption dramatically while still giving decent performance.

    something like this:

    procedure InternString(var _s: string);
    var
    Idx: integer;
    begin
    if InternedStrings.Find(_s, Idx) then
    _s := InternedStrings[Idx]
    else
    InternedStrings.Insert(Idx, _s);
    end;

    Where InternedStrings is the global TStringList in question. Not quite rocket science, is it? ;-)

    A hashing list would probably perform even better.

    The problem with this approach is, that it will never free the strings. A solution could be to periodically scan the list and remove every string that has a reference count of 1. Of course, in your case, it is simpler: You can just clear the list when the user closes the log file. Also, when the list gets larger, the performance for inserting a string gets worse.

    Btw: I am glad to hear you are working on reducing the memory requirement of the console. That gives me hope that I will some day be able to load my log files without splittig them. They usually are huge. The only real solution for this problem of course is to not load the whole file into memory at all, but I guess you already knew that. I ran into memory constrictions myself just a few weeks ago, it felt like a return to the bad days of DOS programming…

  4. Patrick van Logchem Says:

    Same story here – I did this kind of thing years ago. The solution back then was rather nice if I may say so:

    For each addition to the pool, we first searched the pool to prevent duplicates. When a string was already present, we returned that string, instead of the one being added (so the reference-count of the former was increased, while the latter was ultimately cleared).

    The pool started out as one single array, containing 16 unsorted strings. (Searching in this array required a check at each element, but on average this array was half-full, so it only took 8 compares on average.)
    As soon as the unsorted array ran out of space, we sorted these strings, and moved the sorted array aside. (Sorted arrays can be searched more quickly, by using a binary search).

    As soon as the unsorted array filled up again, we did something extra : Not only did we sort the 16 strings, but now that we had 2 sorted arrays of equal size, we merge-sorted them into 1 array of size 32, and cleaned out the sorted size-16 array.

    When the next fill-up of the unsorted array occurred, we could again move the newly sorted size-16 array aside, as there was no size-16 array in use anymore. But this wasn’t possible on the next fill-up anymore, as we had 2 sorted size-16 arrays and 1 size-32 array… I’m sure you can see a pattern arising here : We just merged all those into one single size-64 array!

    This pattern continued for ever larger powers-of-2, meaning that presence-checks became a bit more expensive when the pool grew, as we had to do a binary-search in an ever increasing number of sorted, power-of-2 sized, arrays before we could conclude a string was not yet present. (A hit resulted in an early exit ofcourse.)
    It’s not as bad as it sounds thou – I believe we never went past 25 levels or so.

    For speed and fragmentation-prevention, our implementation didn’t actually clear out the arrays – we kept all arrays sized to their alotted capacity, but instead used an Empty-flag per array to see which arrays where inactive and as such shouldn’t be searched.

    Anyway, as I said – this was years ago. Since then we managed to dispence of the pool altogether, as we off-loaded most of the strings – which is quite the memory-saver ;-)

  5. Tobias Gurock Says:

    Thomas: We implemented it in a very similar way (we are using the hash-based approach), except that we also needed to consider the usage from multiple threads. We currently plan to intern all string properties of a log file. We suspect that, in a typical log file, you have at least 60-70% duplicates, so this should bring a lot. First tests have shown that the string pool (and the other improvements which are already included in the 3.2 beta) drastically reduce the memory usage of the Console.

    Patrick: Indeed a nice solution. :-) A disadvantage I see is the latency/increased response time of the string pool operation when you are doing your sort/resize/merge operation. However, I’m sure it’s just a theoretical issue and not really noticeable in practice (depending on the amount of managed strings).

  6. John Says:

    Unless I misunderstood, there is a much straightforward solution: don’t convert the data structures in memory, work on the binary, use virtual lists and virtual components, meaning that you don’t need to have strings in memory for more than what is visible on screen.

    This is an even older solution, and it can handle data of practically unlimited size. The code required is usually dead simple (replace fields in whatever object structures by methods that fetch the actual data).
    As for performance, let the windows file cache do the job for you.

  7. Tobias Gurock Says:

    Unfortunately, it’s not as easy as it first looks. For example, one problem is that logs can come from several destinations, not just log files. The Console can also accept log packets from a TCP or named pipe connection. So, we would need some way of storing these streams in temporary files which is not exactly rocket-science but would certainly require some time to get right.

    Another problem is the performance. If you are dealing with log data in the range of multiple 100MB, adding a new view (which is, essentially, a new tab for the log data with a filter) would be way too slow (as this requires going through the entire list of log packets and checking if the filter allows/denies each packet). There are several other cases like this in the Console.

    And there are even more problems, such as updating (we allow adding comments to the log, for instance) or deleting data from the log.

    Coincidentally, we are experimenting with the very same idea in a slightly different form. I think the string problem is largely solved by the string pool. The next problem deals with the attached data of log packets (log packets can transmit additional data such as images, hex dumps or database query results etc.). To avoid storing this data in memory, we thought about adding some kind of file-based database support to the Console (for example, with Sqlite). A file database would simplify delete and update operations and should provide a better performance than direct file manipulations.