Optimizing Memory Consumption with String Pools: Part II
By Tobias Gurock, February 17th, 2009
As promised in the first part of this series, today I will delve a bit more into the technical details of string pools.
From an abstract point of view, a string pool can be seen as a simple black box which, on a given input string, returns a string with the same value. That’s all you can expect from a string pool. In particular, this means that it provides no real upfront information about the relationship between the memory of the input string and the memory of the output string. It may return the same string object (or, to be more precise, a reference to the same memory location) as passed in but may also do something completely different such as returning a different existing string object or even allocating a new string. It only guarantees that will you receive a content-wise equivalent string:
![]()
And that’s the whole point of a string pool. By giving this minimal guarantee and letting you worry about the memory of the input string (in languages without garbage collection at least), it can drastically optimize the string management and allocation. In the best case (memory-wise), you may end up with an application with no duplicate strings whatsoever. Let’s see how a string pool can do this.
On a given string:
- it checks whether a string with the same value is already interned
- if so, it returns a reference to this interned string, thus reusing an existing string and memory area
- otherwise, it stores the passed string and returns the reference to it
That’s the whole magic if you can call it like that. There may be small variations here and there but that’s the usual way you would implement a string pool.
In languages with manual memory management, a string pool may give the additional guarantee of always returning a string that is unrelated to the input string (from a memory perspective). This drastically simplifies the memory management for your input string because you can always free it safely after you got your equivalent string from your string pool. A string pool may provide this guarantee by always creating a copy of the given string in step three instead of returning the original string reference:
“Software written to minimize memory usage can conserve memory by keeping all allocated strings in a hash table. If an already existing string is found a pointer to that string is returned; otherwise, a new string is allocated and added to the hash table.” (emphasize mine)
That’s all there is to it?
In an ideal world, the presented properties and guarantees of a string pool should be sufficient to make use of it. In real-world applications, though, you unfortunately need some additional tidbits of information. As we all know, abstractions are leaky and the string pool is no exception. In particular, we need to know:
- whether it can be safely used from multiple threads
- which factors influence its performance and behavior
Threading and thread-safety
Let’s begin with the threading issue. From a threading perspective, a string pool can basically be modeled in one of the following ways:
- Thread-unsafe
The simplest form of a string pool doesn’t pay any attention to things like threading, locking and race conditions. Although such a thread-unsafe implementation may look useless for a data structure which is apparently best used in an application-wide fashion, it has a few very interesting properties: it is easy to develop, scales perfectly and contains no threading-related overhead. - Thread-safe with locks
The most obvious implementation for a string pool with safe support for multi-threaded access. However, the most obvious solution is not always the best one. On the pro side, we have a straightforward and not too difficult implementation, and the support for thread-independence. On the con side, however, we have a huge performance and scalability bottleneck when used massively from multiple threads. - Thread-safe without locks
Also a thread-safe but much more scalable and unobtrusive solution that uses a lock-free data structure underneath. Lock-free string pools can safely be used from multiple threads, are reasonable fast and sufficiently scalable. The problem is that they are very difficult to implement correctly without introducing obscure hidden bugs, something even experts cannot get right every time.
So, if you want to use a single string pool from multiple threads at the same time, you are essentially stuck with the most difficult option that is really easy to mess up. In this case, you can hopefully reuse some existing and proven string pool implementation and don’t need to write your own (I suspect, or hope, that String.Intern for .NET and String.intern for Java are implemented that way). But what if you wouldn’t need the support for an application-wide string pool? In this case, we may take a look at an additional fourth option:
- Thread-local
Inherently thread-safe by managing a different string set for each thread. This string pool type has the same positive properties as the thread-unsafe string pool: it’s fast and reasonable easy to implement. Managing string sets on a per-thread basis may, for instance, be implemented by adding some kind of thread-locale-storage support to the thread-unsafe approach.
Thread-local string pools are the perfect alternative if your string distribution is heavily thread-local as well or if you do not have access to a thread-safe lock-free string pool. Otherwise, the thread-safe lock-free option should be your string pool of choice.
Performance and scalability
As we have seen, the supported thread model is not only an important factor for the thread-safety of a string pool, but also influences its performance and scalability. In addition to this threading related performance, it’s also important to know the expected runtime performance from a string pool in isolation (i.e. without taking threading related aspects into consideration). In particular, we need to know whether interning a string is a constant-time operation or somehow depends on the number of previously interned strings. You definitely want to avoid the latter as it’s getting slower and slower the more strings you add to the pool (logarithmic scale may be acceptable depending on your amount of strings, but linear or even worse isn’t):

If you recall from the introduction of this post, a string pool has two internal operations for finding and adding string entries. These two operations determine the overall performance of the string pool and if one of the operations is slow, this means that the whole interning is slow. So, essentially, we need a data structure which can represent both operations efficiently and in constant time. One data structure which meets these requirements is the hash table:
“Hash tables support the efficient lookup, insertion and deletion of elements in constant time on average (O(1) in the number of accessed memory cells); this bound is independent of the number of elements stored [...]. These operations are all performed in amortized constant time, which makes maintaining and accessing a huge hash table very efficient.”
A hash table is something like the Swiss knife for a developer. If you need a data structure with an excellent average performance for inserts, deletes and lookups, the hash table is definitely the way to go.
In the last part of this series, I will explain how a hash table is usually build and what to consider when designing a hash table for a string pool. I will also reveal which thread model we chose for our own string pool implementation for the SmartInspect Console (Delphi doesn’t have a built-in string pool), how we implemented it and why integrating it into the Console was well worth it.
Related posts:
- Optimizing Memory Consumption with String Pools: Part III
- Optimizing Memory Consumption with String Pools: Part I
Found this article useful? Make sure to subscribe to the No bug left behind feed or via email and don't miss our future articles about software quality, performance, usability and related topics. This blog also features the regular Software Quality Digest with links to relevant articles, discussions and other resources.

February 17th, 2009 at 16:17
Interesting idea, made me think. Usually I handle huge amount of data by implementing virtual controls, and have tailored data structures that handle storage and lookup.
Maybe a string-pool is a too specialized optimization approach (special form of compression), because if your kind of data and access patterns change it will not be flexible enough and be difficult to adapt. What happens for example if your logs entries have much less similarities or only match partially? You would have to split them up and match the parts, or just really reduce the amount of strings you hold in memory if compression is not enough.
A lightweight database could be an easier solution since you get inbuilt compression, lookup/filter, insert/update for many scenarios. For display you would just fetch what you really need to show, i.e. a fraction of the data. Also concurrency problems (threads) would be dealt with automatically.
“only guarantees that will you receive a linguistically equivalent string”
I wouldn’t say “linguistically equivalent”, since it implies your string pool has some linguistic knowledge or is a rule based system or whatever. Maybe use content vs. reference(to content), instead.
February 17th, 2009 at 17:51
Hello Maël,
We are also experimenting with the idea of using a database for the data. However, we currently do not plan to use it for the individual strings or as a replacement for our in-memory log objects but rather for the optional data which can be attached to log entries (such as images, stack traces etc.).
A more general approach sounds useful as well but would probably be much more difficult to get right. I’m also not sure about the performance of this solution. I suspect that the performance difference would be noticeable when compared with the current in-memory solution, something that we would like to avoid.
Thanks for the hint with the wording. I’ve changed it to ‘content-wise equivalent’. Should make things clearer.
February 17th, 2009 at 22:00
I’m going to assume that although I found this post through DelphiFeeds, that this is intented to be more general than Delphi.
After all, Delphi is already very efficent when working with strings (reference counted, copy on write semantics).
Scenario one: You make a copy of a string through a := b;
This does not create an additional memory load, rather both strings point at the same memory buffer. If you try to change the contents of A or B, it only then creates a copy and then modifies it. (if only 1 variable points at the memory buffer it skips that, and modifies the buffer directly)
Scenario two: You build lots of strings on the fly.
By creating a string, you are already paying the cost of creating the unique string, memory allocation is the lesser, secondard cost. Besides which, chances are if you build it on the fly you then discard it almsot as quickly. If it goes as quickly as it comes, why waste memory storing it longer?
Also, why would you want to bloat a hash table with temporary strings?
Neither scenario really recommends an complicated, permanent storage mechanism.
Incidently, there is already a fairly simple storage mechanism in Delphi you can already use if you really want to do this.
TStringList.
First, it isn’t limited in the way of a hash table, secondly it can be quicksorted and binary searched.
Code like this should do the trick:
// Let’s assume that MyStringCache is a TStringList and already instantiated, and its case sensitity flag is set to the prefered state (defaults to case insensitive).
Function GetCachedString(Const SourceString : String) : String;
Var
At : Integer;
Begin
At := MyStringCache.IndexOf(SourceString);
If At=-1 Then
Begin
MyStringCache.Add(SourceString);
MyStringCache.Sort;
MyStringCache.Sorted := True;
Result := SourceString;
End
Else
Begin
Result := MyStringCache[At];
End;
End;
There you go. Flexible mechanism for adding strings to a cache for reuse, binary searchable and no size limitation on contents. If you want it to be threadsafe, add either a critical section or a so called “lock free” lock.
Incidently, if you wanted to attempt this in any other langauge, I would recommend ANY structure other than a hash table. A BTree would be pretty efficent. Easily searched, easily implemented, no limitations on contents size, you can keep performance optimal through tree design.
Hash tables have a fixed size, cost the full memory load even empty, perform goes down quickly if the table gets even remotely close to full, your hash routine has a bias, or your input strings bias the hash. They are a very poor choice for almost every problem you could point them at. In the old days when storage and cpu performance was so limited that saying you only wanted to store 100 or 200 items was an acceptable trade of for each – such is no longer the case.
My opinion anyways.
February 18th, 2009 at 11:30
Hello Xepol,
The string pool is only used for strings that are stored somewhat permanently in the Console. We obviously do not use it for temporary strings. In the Console, the permanent strings are usually created by converting a binary stream of UTF-8 data (from a file, TCP/IP or named pipe stream) and are then compared to the strings in the string pool.
If there’s a string in the string pool with the same value, we keep a reference to this string and dispose our newly created string. Otherwise we store the newly created string in the string pool.
I know that Delphi strings are reference counted (that’s the whole concept behind our string pool implementation).
I still think that the hash table is the perfect data structure for string pools. Hash tables have several advantages over trees or lists. Implemented correctly, lookup and add of a hash table are blazingly fast and are usually O(1) operations. This is different with trees or lists where the best you can achieve is a logarithmic performance (and by the way, if you sort your list every time you add a new string to your string pool as in your example, you get a much much worse performance).
Hash tables are also usually not size-limited. If you like, you can dynamically grow (and shrink) your table depending on the current load factor. But even if you use a fixed-size hash table, there’s nothing to worry about if you implement your collision lists as a balanced sorted tree.
February 18th, 2009 at 21:15
To avoid contention in multi-thread environment, you should have two pools, global and one for each thread. When a thread starts, it takes a snapshot of the global pool. When doing a look-up, use the thread pool only hence no lock needed. Once in a while or when a thread is about to shut down, sync it with the global one
Cheers
February 19th, 2009 at 10:49
Good idea. For the Console, we ended up with a strict thread-local solution. The string creation and distribution is heavily thread-local as well, so this makes sense for the specific behavior of the Console. In applications with no thread-local behavior, your solution sounds like a good alternative to thread-safe application-wide string pools.
February 19th, 2009 at 18:49
Tobias -> Years later, still doing learning myself.
Setting sorted on a TStringlist to true means that it does a binary search when you do an IndexOf or find. What I did not realize is that it also does a binary search for insertion location before adding as well, so the list does not need to be constantly resorted.
So setting the sorted property of the list to true before you add any strings will completely eliminate the need to keep resorting (time to go through some of my old code!)
So there ya go, no need to keep resorting AND fast searches. While it is true that binary searches have a run time of O(log2(N)) – you have to get a very, very large data set before that number amounts to much (you would run out of ram and hard drive space before the actualy run time was significant). If you use a binary search to locate an insert point, insertion time is supposed to be a maximum of O(Log2(N)+N), but since moving pointers is really just a single block memory move, you could easily consider it O(Log2(N)+3) as a maximum.
And again, hash tables are not blazingly fast unless they are mostly empty. The moment you have 2 items that generate identical hashes, the whole system falls apart into not just a linear search for that item, but for other items whose hash slots you have to steal.
The problem with hash tables is that their performance is not only predictable somwhere between O(1) and O(n) for both searchs and inserts.
And remember: once the hash table is full, insertions fail once they reach their articifical limit? Binary trees and lists with binary searching really only fail once all available storage is exhausted.
Incidently, based on your described usage, unless your source data has a lot of similarities, it sounds like you are wasting a lot of memory storing something that isn’t going to be around for long. Are you sure its worth the effort you are going to and the memory&cpu costs you incure?
February 19th, 2009 at 21:01
Xepol,
“So setting the sorted property of the list to true before you add any strings will completely eliminate the need to keep resorting (time to go through some of my old code!”
Yep, I know, that’s why I was wondering why you are sorting the list every time you add a new entry. I guess we all agree that doing so is not the fastest solution.
You are probably right that a logarithmic scale is probably not much different from a (theoretically) constant time operation with the expected amount of data and on our computers nowadays. I think that both options are fine for that purpose. I’m also aware of the limitations/problems of hash tables (such as the difficulty of choosing a hash function or the linear search problem with standard collision lists).
However, I think both problems can be avoided/minimized if you
1. choose your hash function wisely, ideally with taking the expected data into consideration
2. keep the load factor of the hash table at an acceptable level (with a really good hash function, I think many implementations go up to a load factor of .75, but no more by default)
3. by using a smarter algorithm than a linear search in the collision lists by using either a binary search on a sorted list as in your example or by using a balanced sorted tree. In this case, you essentially combine the pros of all solutions while keeping the cons of the hash table to a minimum.
“Incidently, based on your described usage, unless your source data has a lot of similarities, it sounds like you are wasting a lot of memory storing something that isn’t going to be around for long. Are you sure its worth the effort you are going to and the memory&cpu costs you incure?”
Yes, absolutely. We expect that up to 60-70% of the strings in a typical log file are duplicates (I wrote in the first part which strings are included in a log file and why we expect this distribution). And once the strings are loaded/created, they normally exist for a long life time (sometimes even for the entire lifetime of the application run). We do not use the string pool for temporary strings but only for those strings which are kept in memory for a while and are very likely to be duplicated throughout the application (such as hostnames or the name of the instrumented application which are included in every log entry). This should result in a good trade-off between the size of the hash table and memory savings.
February 21st, 2009 at 22:49
@ Xepol : Don’t underestimate the cost of copying data around! On average, each insert has to move half of the elements one position aside. This amounts to N/2 memory reads and N/2 writes – that’s quite some bus-traffic, especially when N is getting large!
Also, every time the allocated capacity of the list is depleted, a whole new block of memory has to be allocated and all contents need to be copied over again. Now, if TStringList would use the doubling method (google this up for yourself) this cost could be averaged over all elements, making it constant cost on average – but AFAIK this is not the case.
Also, the scientific community is still very actively inventing new hashing algorithms, like : Lock-free extensible hash tables (Shalev, Shavit), HAT-tries (Askitis, Sinha), cache-conscious algorithms (Zobel), Cuckoo hashing, etc. These improve hashing by upping the load-factor (using multiple hash-functions and/or bins of buckets) and lowering the access time (by -among other things- taking current CPU architectures into the equation).
Bottom line is : For every application, there’s a data-structure/algorithm well (I won’t say ‘best’) suited for the job. Just make sure you pick the right one!