As I’ve explained in the first posting of this small blog series, we have been looking for an ideal web development stack for a new installable product. We looked at the most popular development stacks, platforms and frameworks before eventually deciding which one to use. I will try to summarize the pros and cons of the stacks we evaluated and explain how we’ve come to the final decision.

Photo of books about web development

We aren’t experts in all the different frameworks and platforms we evaluated and there are probably some details we got wrong. Feel free to correct us by leaving a comment. So without further ado, here are the pros and cons of the different development stacks in regards to installable web applications:

  • Java (including Grails and Spring)

    Java and the various frameworks such as Spring and the dynamic counterpart Groovy combined with the Grails framework qualify in many ways to be the perfect web development stacks for installable applications. The Java platform is very mature and has very good support for cross-platform and cross-database development. If you use frameworks such as Spring, you also benefit from the MVC pattern and from enterprise features and libraries. Java also isn’t going anywhere and it’s safe to say that Java will be available for the next 10+ years and probably a lot longer. Unicode support is also excellent, which is rare for cross-platform development stacks.

    However, one of the biggest drawbacks of this platform also made us decide against using Java: deploying Java web applications is complicated and scares the hell out of us. We wouldn’t feel comfortable having to support deployment scenarios with all the different available Java application servers out there, which you need to run a Java web application. The need for an application server such as Tomcat or JBoss is a serious disadvantage, especially if your customers aren’t experienced with installing and configuring such a solution.

  • Ruby on Rails

    Ruby on Rails is a modern and productive web development stack and framework. Using Ruby as the programming language of choice, it’s best known for the MVC architecture and the well-integrated ActiveRecord object-relational mapping system for database access. Ruby on Rails also emphasizes the convention over configuration concept, making it very easy to get started developing web applications. Rails also comes with reasonable support for multiple database servers, but is usually used with MySQL and PostgreSQL.

    Recent versions of Rails also improved the support for Unicode and UTF-8 encoding, making it more suitable for applications that are used by users from non-western countries. Rails is usually deployed by configuring a Mongrel cluster (which is easier than it sounds) and using a front-end web server such as Apache, nginx or lighthttp for static content serving and load balancing the different Mongrel processes. Although setting up a run-time environment for Rails applications is still a lot easier than for Java web applications, it can still be problematic for customers who have no experience with Rails and/or Unix environments.

    This leads us to the big problem with Rails: the support for Windows servers is miserable. Because most Rails applications are developed, tested and deployed in Unix environments, either for hosted web applications or in-house intranet applications, supporting Windows servers isn’t an important issue for the Rails team or developer community. However, if you are trying to sell an installable web application to typical companies, Windows servers are still extremely popular for internal application and workgroup servers. And if your potential customers are used to running Windows servers, you better make sure that your solution works flawlessly in this environment.

  • ASP.NET

    ASP.NET is a good choice for many installable web applications. It’s reasonable easy to deploy an ASP.NET web application on a Windows Server with IIS and most companies (and potential customers) already have such a server running as an internal application server. The Unicode and localization support is also top-notch and if this is one of your most important requirements, you should definitely consider going with this development stack.

    Another big plus for ASP.NET is the strong developer community with countless of blogs, books, websites and forums. The rich offering of user interface controls, libraries and development tools for the .NET platform in general and ASP.NET in particular by third-party vendors is also unmatched. However, ASP.NET also has its problems. ASP.NET Web Forms really encourages mixing application logic with the user interface code, which can result in badly designed applications and hard-to-maintain code. The abstraction in ASP.NET Web Forms that’s supposed to make web development easier for Windows desktop developers makes the development stack hard to debug and overcomplicates things.

    Experienced ASP.NET developers try to workaround these limitations by using patterns such as MVP or MVC. Especially Microsoft’s new ASP.NET MVC framework is an interesting alternative to Web Forms, as it encourages the separation of business logic, data access/validation and user interface code and makes it easier to build a maintainable application. This leaves us with the biggest drawback of ASP.NET: the lack of cross-platform support. Sure, there’s Mono, an open source .NET implementation that also supports ASP.NET. But while it’s definitely a nice option to have, it’s not widely used and I personally would not consider it to develop cross-platform web applications at this point.

  • PHP

    PHP is arguably the most used web development programming language, despite it’s many technical limitations and problems. The main reason PHP has become so popular is that it’s extremely easy to use and that it’s supported by virtually all web hosting companies. Developing an PHP application without using a decent application framework will most likely lead to an unmaintainable code base, however. The language itself, although explicitly designed for web development, doesn’t come with a (MVC) framework and doesn’t help with separation of concern in any way.

    One of the technical limitations is the lack of Unicode support. While most development stacks have quite good Unicode support by now, PHP’s support for non-western characters is still very limited and if you plan to support most server configurations and multiple databases, it’s basically non-existent.

    Luckily, there are quite a few available PHP frameworks to remove some of the mentioned limitations such as the Zend framework, CakePHP or CodeIgniter. Using one of these frameworks makes it a lot easier to write maintainable applications and also provides you with often needed functionality not found in PHP itself, such as a database abstraction layer.

    Of course, PHP also has its good sides. Applications written in PHP are very easy to deploy and work on most operating systems and server configurations. Even Microsoft is working hard to improve the PHP deployment experience on Windows and IIS, as many popular web applications such as WordPress are written in PHP. The great cross-platform, cross-database support combined with the ease of deployment and available frameworks make PHP a good web development stack, even though it’s not the best technical choice.

Having evaluated the above mentioned frameworks and having looked at the pros and cons of each development stack, we have come to the conclusion that PHP together with a lightweight framework such as CodeIgniter is the best choice for an installable web development stack, at least today. Yes, from a technical standpoint, PHP is less than great. The ability to easily deploy and run a PHP application on most server configurations, however, easily outweighs its limitations. One drawback that really made us think twice before deciding to go with PHP was the lack of proper Unicode support. Unicode support was originally a really important requirement to us, but we decided that we can do without Unicode support for a few more years until this problem will hopefully solve itself (with better Unicode support in PHP).

You might also be interested in our reasons to choose CodeIginiter as the PHP framework for our application. The great thing about CodeIgniter is that it’s a really simple framework. Although CodeIgniter only offers the most basic features and you have to implement advanced capabilities such as locale or authentication support yourself, this simplicity makes it really easy to understand the entire code base in a very short amount of time. So even in the unlikely case that we needed to fork this framework because the authors stopped supporting it, we could easily improve and add features to the existing code ourselves, something that is not really feasible with most other frameworks and stacks out there as they are often too complex.

Some months into development we are still happy with our choice. The application is coming along nicely and PHP and CodeIngiter have proven to be a solid development stack for an installable web application so far (of course, the real challenges come when customers deploy our application in many different hosting environments). Our choice of a development stack would most likely be different if we were building a hosted solution where we could control every little aspect of the server environment. But because our customers will have to deploy and host our application themselves, PHP and CodeIgniter are a great choice for us.

Now before you start your next big thing, make sure to consider all your options, as you cannot easily change platforms once you’ve invested a lot of time and resources.

Besides improving our flagship product SmartInspect, we’ve started working on a completely new product a few months ago. We aren’t ready to announce what it is just yet, but we will hopefully be able to share some early screenshots and more information in the coming weeks on our company blog. Our next product will be delivered as an installable web application. Similar to applications such as FogBugz or WordPress, customers will be able to download our application and install and host it directly on their servers. We decided to go with an installable version of the product first (instead of offering it as a hosted solution), because we believe that our target market still prefers to host their sensitive data themselves.

One of the first decisions we had to make regarding the product was to choose a development stack. This sounds easier than it really is: there are quite a few requirements that a development stack and platform has to meet in order make it suitable for an installable product. When you develop a hosted solution, you don’t have to care so much about things like ease of deployment, cross-platform support or support for multiple databases. Other requirements such as Unicode support are also a lot easier to implement if you can control every little detail of the run-time environment. So these are our main requirements for an ideal development stack:

  • Ease of deployment
    Because our customers have to install the application themselves, it must be very easy to deploy. The application will of course come with a convenient installer to setup a database, configure the initial settings and so on. However, a platform with complex dependencies should be avoided as it would cause endless support headaches and would make it much harder for customers to evaluate our product.

  • MVC framework
    We have been using different frameworks and patterns for web development over the years and we strongly believe in using the MVC pattern for building web applications. Our ideal web development stack comes with a solid and clean MVC framework to make it easier to build a maintainable application.

  • Cross-platform
    We fully expect that 80%+ of our customers will install our web application on a Windows server (those are still extremely popular for internal application and workgroup servers). However, to make it easier for us to offer hosted trials (or even fully hosted application instances), we also need the application to run on Linux machines. Linux support is also handy for customers with distributed teams who want to host our application on cheap dedicated Linux servers on the net.

  • Multiple databases
    Similar to the cross-platform requirement, our development stack of choice should support multiple databases. Many companies already have a preference for a certain database (because it’s already installed, maintained and backuped etc), so having easy-to-use support for multiple databases is a big plus. We initially plan to support MS SQL and MySQL only, but we might add support for additional databases such as PostreSQL in the future.

  • Support for Unicode and localization
    In an ideal world, every application would have great Unicode and localization support. In reality, though, many applications (and web frameworks) still have a lot of problems with non-Western characters, languages and locales. To make our application suitable for an international audience, good Unicode and localization support in a web framework is key.

  • Safe future and strong community
    Of course, the ideal development stack and platform has a strong and clear future. However, it’s not easy to tell which technologies have a safe future. Open platforms with a strong community certainly have better chances to survive in case the owner or original organization behind it loses interest.

I can already reveal that none of the evaluated platforms and frameworks met all our requirements. Some platforms make it really easy to deploy applications and have excellent Unicode support, but fail the cross-platform requirement at the same time. Other frameworks and stacks provide great cross-platform support, but have really bad support for Unicode.

If you are planning to build a web application yourself, make sure to read the upcoming second part of this series to learn about the pros and cons of the available development stacks and platforms (if you haven’t subscribed to this blog yet, here’s the feed). I will also reveal which development stack we have chosen and explain the reasons behind the decision.

In the first two posts, I introduced the basic concepts and technical details of string pools. Today I will show you how we implemented the string pool, integrated it into our SmartInspect Console and why the effort was worth it.

Implementation

I already wrote in the previous posting that I’m a big fan of hash tables. They have a great constant-time asymptotic run-time performance and consume only a reasonable amount of memory at the same time. A very rare combination. At first look, it’s not that obvious how you can achieve constant-time lookups, inserts and deletes on a set of items. Let me explain how this works. A hash table, in its simplest form, usually consists of:

  • A table / array
    The array is responsible for holding the items. The items are usually key/value pairs. For our string pool, we are only interested in the keys and can therefore ignore the values. The hash table can keep this array fixed in size or dynamically resize it depending on the current load factor (i.e. the ratio of the used array elements and the capacity of the array).
  • A hash function
    The hash function is a function which calculates a value, the hash, for a given item. This hash is then mapped directly to the index in the array.
  • Collision lists
    Since a hash table has only limited space but usually deals with an unlimited set of possible items, it cannot have a simple 1:1 relationship between items and the indices in the array. It may happen that two different items are mapped to the same index in the array. This is called a collision. Collisions are handled by allowing more than one item per array element, usually in the form of collision lists but sometimes also with collision trees or even inner hash tables.

For our string pool, we chose a configurable but fixed-size array and standard collision lists as its secondary storage. We decided against a variable-size array to avoid occasional large memory moves and the need for re-indexing the items. It also helped to keep the implementation simple. Picking a sufficiently-large array and, more importantly, a good hash function was therefore crucial:

“A good hash function is essential for good hash table performance. A poor choice of a hash function is likely to lead to clustering, in which probability of keys mapping to the same hash bucket (i.e. a collision) is significantly greater than would be expected from a random function. A nonzero collision probability is inevitable in any hash implementation, but usually the number of operations required to resolve a collision scales linearly with the number of keys mapping to the same bucket, so excess collisions will degrade performance significantly.”

I personally can recommend the djb2 hash function. We have used it with great success in several occasions over the years. It’s fast, easy to implement and provides a good hash distribution:

function HashOf(const S: String): Cardinal;
begin
  Result := 5381;
  for I := 1 to Length(S) do
  begin
    Result := ((Result shl 5) + Result) + Ord(S[I]);
  end;
end;

Side note: I also experimented with other hash functions such as the Bob Jenkins hash or several SBox variants. I found that, although these functions are said to provide a better hash distribution as djb2, the djb2-based string pool outperformed all other string pool variants. In several tests with our particular string pool implementation and varying inputs and array sizes, the djb2 solution sometimes even won by a factor of two or three. Not sure why this happens actually. I will investigate this, maybe that’s something for another article.

So, when putting all these pieces together and combining it with the basic string pool algorithm I gave in the previous post, interning a string now roughly looks like this:

function InternString(const S: String): String;
var
  LIndex: Integer;
  P: ...
begin
  { Hash the string and map it to the array }
  LIndex := IndexOf(S);

  { Lookup the record at the given array index }
  P := Find(LIndex);
  if not Assigned(P) then
  begin
    { Allocate a new record and add it to the table }
    ...
  end;

  { Return a reference to our interned string }
  Result := P.Value;
end;

For the threading related aspects of the string pool, we opted for the thread-local solution. This may not be right decision for all applications but makes perfect sense for the Console. Because its string input and distribution are heavily thread-local as well, spreading the string pool over all application threads wouldn’t bring many benefits. It could possibly even decrease the overall application performance. We added support for thread-locality directly to the string pool by storing the internal hash table in thread-local-storage.

Integration

Since we chose the thread-local option, we needed to add one string pool to each application thread which calls InternString for the various strings it processes and loads. The Console uses a multi-threaded approach for its log clients, so we integrated the string pool into the main thread and into each of the listener classes for the clients (TcpListener and PipeListener for handling dynamic threads and clients via TCP/IP and named pipes, respectively). The integration consists of just a few lines of code for each thread pool instance to dynamically initialize and shutdown the string pool for the particular thread (required for setting and cleaning up the hash table array). The string pools are either cleared when a dynamic client thread exits, or when a user clears the log in the main thread.

Our string pool is plugged right into the reading and loading process of log packets. Every time a string property is read from a log stream, we route the result through InternString. It basically works as follows:

function TLogEntry.ReadFromStream(const AStream: TStream);
var
  LHostNameLength: Integer;
begin
  LHostNameLength := ...

  ...

  { Read the host name from the stream and get the equivalent string
    from our string pool }
  FHostName := InternString(ReadUtf8String(AStream, LHostNameLength));

  ...
end;

The following screenshot of the Console illustrates duplicates that are avoided with this technique. There are several additional strings we intern that aren’t obvious from the screenshot, but it should give you a good overview nonetheless (for example, the watches toolbox on the lower left corner of the screenshot displays only the latest name/value pair for a given watch, but the names and values can be logged and stored multiple times):

SmartInspect Console with interned strings highlighted

Results

Besides improving the memory consumption of the Console, one design goal of the string pool and integration was to keep the performance penalty of the changes to a minimum. We therefore tested the hash table with several different array sizes and chose a size that offers both a good overall performance and uses a reasonable amount of memory. It turned out that a relative large array with 216 elements would be a good fit.

The following table shows some benchmarks we conducted on my current development machine (an Intel 3GHz Core 2 Duo). Each row lists the interning of one million strings with varying contents. The time column specifies the time spent for the loop to intern the strings. The memory column specifies the memory consumption of the interned strings compared to the consumed memory without a string pool (not including the size of the fixed-size array).

Time Memory
Random, length 1 to 100 702 ms 103.47%
Random, fixed length of 25 639 ms 106.89%
Numbers, all the same 187 ms 0.01%
Numbers, 10% different 203 ms 11.45%
Numbers, 50% different 249 ms 56.43%
Numbers, 90% different 297 ms 102.71%
Numbers, no duplicates 312 ms 112.67%

As you can see, performance and memory consumption improve if you have more duplicates in your data. This is expected since more duplicates result in a smaller load factor of the hash table and less memory overhead. This also means that the memory consumption increases if you have less duplicates (which then leads to a decrease in performance because of the collisions and resulting overhead).

It’s therefore important to decide which strings should be interned. If a string has a small probability of being duplicated somewhere in your application or thread, it’s likely that interning such a string does more harm than good. Interning a few additional strings isn’t a lot of overhead of course, but if you unnecessarily intern thousands of thousands of strings, this may negatively influence your application performance and even slightly increase your memory usage. So the rule should be to only intern strings that are likely to be duplicated.

I think we found a very good trade-off for the Console. We expect that only 30-40% of strings in a typical log file are different, which means a reduction of memory consumption of around 50% for strings. But even in the worst case (a log file with no duplicates whatsoever), the overhead is not really noticeable. We are very pleased with these results and are happy that we took the time to implement and integrate the string pool into the SmartInspect Console. Optimizing and reducing the memory consumption of the Console means that customers are now able to open and analyze even larger log files, and we are already planning additional improvements to reduce the memory consumption even further.

Optimizing Memory Consumption with String Pools: Part II

By Tobias Gurock, February 17th, 2009

As promised in the first part of this series, today I will delve a bit more into the technical details of string pools.

From an abstract point of view, a string pool can be seen as a simple black box which, on a given input string, returns a string with the same value. That’s all you can expect from a string pool. In particular, this means that it provides no real upfront information about the relationship between the memory of the input string and the memory of the output string. It may return the same string object (or, to be more precise, a reference to the same memory location) as passed in but may also do something completely different such as returning a different existing string object or even allocating a new string. It only guarantees that will you receive a content-wise equivalent string:

Image of String Pool

And that’s the whole point of a string pool. By giving this minimal guarantee and letting you worry about the memory of the input string (in languages without garbage collection at least), it can drastically optimize the string management and allocation. In the best case (memory-wise), you may end up with an application with no duplicate strings whatsoever. Let’s see how a string pool can do this.

On a given string:

  1. it checks whether a string with the same value is already interned
  2. if so, it returns a reference to this interned string, thus reusing an existing string and memory area
  3. otherwise, it stores the passed string and returns the reference to it

That’s the whole magic if you can call it like that. There may be small variations here and there but that’s the usual way you would implement a string pool.

In languages with manual memory management, a string pool may give the additional guarantee of always returning a string that is unrelated to the input string (from a memory perspective). This drastically simplifies the memory management for your input string because you can always free it safely after you got your equivalent string from your string pool. A string pool may provide this guarantee by always creating a copy of the given string in step three instead of returning the original string reference:

“Software written to minimize memory usage can conserve memory by keeping all allocated strings in a hash table. If an already existing string is found a pointer to that string is returned; otherwise, a new string is allocated and added to the hash table.” (emphasize mine)

That’s all there is to it?

In an ideal world, the presented properties and guarantees of a string pool should be sufficient to make use of it. In real-world applications, though, you unfortunately need some additional tidbits of information. As we all know, abstractions are leaky and the string pool is no exception. In particular, we need to know:

  • whether it can be safely used from multiple threads
  • which factors influence its performance and behavior

Threading and thread-safety

Let’s begin with the threading issue. From a threading perspective, a string pool can basically be modeled in one of the following ways:

  • Thread-unsafe
    The simplest form of a string pool doesn’t pay any attention to things like threading, locking and race conditions. Although such a thread-unsafe implementation may look useless for a data structure which is apparently best used in an application-wide fashion, it has a few very interesting properties: it is easy to develop, scales perfectly and contains no threading-related overhead.
  • Thread-safe with locks
    The most obvious implementation for a string pool with safe support for multi-threaded access. However, the most obvious solution is not always the best one. On the pro side, we have a straightforward and not too difficult implementation, and the support for thread-independence. On the con side, however, we have a huge performance and scalability bottleneck when used massively from multiple threads.
  • Thread-safe without locks
    Also a thread-safe but much more scalable and unobtrusive solution that uses a lock-free data structure underneath. Lock-free string pools can safely be used from multiple threads, are reasonable fast and sufficiently scalable. The problem is that they are very difficult to implement correctly without introducing obscure hidden bugs, something even experts cannot get right every time.

So, if you want to use a single string pool from multiple threads at the same time, you are essentially stuck with the most difficult option that is really easy to mess up. In this case, you can hopefully reuse some existing and proven string pool implementation and don’t need to write your own (I suspect, or hope, that String.Intern for .NET and String.intern for Java are implemented that way). But what if you wouldn’t need the support for an application-wide string pool? In this case, we may take a look at an additional fourth option:

  • Thread-local
    Inherently thread-safe by managing a different string set for each thread. This string pool type has the same positive properties as the thread-unsafe string pool: it’s fast and reasonable easy to implement. Managing string sets on a per-thread basis may, for instance, be implemented by adding some kind of thread-locale-storage support to the thread-unsafe approach.

Thread-local string pools are the perfect alternative if your string distribution is heavily thread-local as well or if you do not have access to a thread-safe lock-free string pool. Otherwise, the thread-safe lock-free option should be your string pool of choice.

Performance and scalability

As we have seen, the supported thread model is not only an important factor for the thread-safety of a string pool, but also influences its performance and scalability. In addition to this threading related performance, it’s also important to know the expected runtime performance from a string pool in isolation (i.e. without taking threading related aspects into consideration). In particular, we need to know whether interning a string is a constant-time operation or somehow depends on the number of previously interned strings. You definitely want to avoid the latter as it’s getting slower and slower the more strings you add to the pool (logarithmic scale may be acceptable depending on your amount of strings, but linear or even worse isn’t):

string_pool_performance

If you recall from the introduction of this post, a string pool has two internal operations for finding and adding string entries. These two operations determine the overall performance of the string pool and if one of the operations is slow, this means that the whole interning is slow. So, essentially, we need a data structure which can represent both operations efficiently and in constant time. One data structure which meets these requirements is the hash table:

“Hash tables support the efficient lookup, insertion and deletion of elements in constant time on average (O(1) in the number of accessed memory cells); this bound is independent of the number of elements stored [...]. These operations are all performed in amortized constant time, which makes maintaining and accessing a huge hash table very efficient.”

A hash table is something like the Swiss knife for a developer. If you need a data structure with an excellent average performance for inserts, deletes and lookups, the hash table is definitely the way to go.

In the last part of this series, I will explain how a hash table is usually build and what to consider when designing a hash table for a string pool. I will also reveal which thread model we chose for our own string pool implementation for the SmartInspect Console (Delphi doesn’t have a built-in string pool), how we implemented it and why integrating it into the Console was well worth it.

Optimizing Memory Consumption with String Pools: Part I

By Tobias Gurock, February 9th, 2009

While working on the latest version of SmartInspect, we brainstormed on how to reduce the memory consumption of the SmartInspect Console. The background for this is that we recently got one or two reports of Out Of Memory exceptions when loading really huge log files into the Console (huge begins at several 100MB depending on the installed amount of memory). While there’s a good solution for this problem (splitting log files, either by enabling log file rotation in the logging library or by using our splitting tool), it would be nicer if the Console could handle bigger log files by itself.

Before I go into details explaining what we’ve already done and what we are hoping to do in the future to improve the memory behavior of the Console, let me first give you some background information about SmartInspect log files and how they are represented in memory. The format of a SmartInspect log file is a simple binary format which is optimized for performance and size. It’s basically a sequence of packets each consisting of a simple packet header and packet body. The header describes its type and size and the packet body contains the actual logging data (if you are interested in all the gory details, see SmartInspect Log Formats and Protocols). While it’s not a super flexible file format, it can be written and read extremely efficiently and stores the data in a very compact way. It’s so compact that the Console needs around three to four times the memory of the log file size.

Here’s why. In a log file, the data is stored sequentially and nearly every byte of data in the log file on disk maps to one byte of real information. There is practically no overhead involved. This is completely different from the memory representation in the Console where:

  1. log packets are strongly-typed objects with the usual overhead associated with them
  2. strings and attached log entry data are accessed indirectly through pointers
  3. packets have several additional object fields to implement some Console features efficiently
  4. the packets require some additional management structures such as lists and queues

We’ve already reduced the memory usage of the Console by implementing several lazy loading mechanisms and by choosing smaller data types wherever possible (2 instead of 4 byte integers). We also drastically improved the log file loading process by avoiding temporary memory peeks. However, the real issue which we haven’t adressed yet are strings. One big problem is that strings are stored in log files with the UTF-8 Unicode encoding, but use the standard Windows UTF-16 encoding in the Console. This results in using twice as much as memory as needed in most cases, plus the additional management overhead that is always associated with strings. Here’s an illustration of how a string looks like in a SmartInspect log file and how the same string is represented in the Console’s memory (the Console is written in Delphi, but a string has a similar or even worse overhead in Java or .NET):

delphi_strings

If you have a string variable somewhere in a Delphi application, it’s basically just a pointer to its first character similar to char pointers in C (actually, its a pointer to the first Unicode code point, but let’s ignore the whole surrogate thing for a moment). Prepended to this array there’s a header record storing the reference count and the string length. Compared to how a string is stored in the log file, the memory representation has the overhead of the pointer, the reference count, an additional null character for C/Windows API compability and the different string encoding. All combined, a string represented in memory uses more than twice as much as space. Here’s certainly some room for improvement.

An obvious and quite effective solution would be to change the string encoding and use UTF-8 the same way it’s used in log files. But this would have a negative impact on performance as we would have to convert a string to UTF-16 every time we want to display it (all Windows APIs expect strings to be encoded with UTF-16). We’ve tested this and while it looked like a viable solution, we didn’t implement it because we came up with a better idea: a string pool.

Log files have the property that, although they usually contain thousands of strings, a large percentage of these strings are actually the same. Besides the actual data passed to log methods (such as log messages, variable names and values), the libraries also collect some context information such as the host name, application name and session name (sessions can be thought of as log categories). These strings never or rarely change inside a log, so we end up with thousands and thousands of copies of the same strings. Now the idea is to implement some kind of abstraction, the string pool, that makes it easy to avoid duplicate strings by storing them in a central location and reusing them. Implemented correctly, this has the potential of reducing the memory usage in the Console by so much that it uses less space than required for an (uncompressed) log file on disk.

The idea is actually quite simple and the concept is nothing new (it’s usually called string interning and is supported out-of-the-box by several modern programming environments, including Java and .NET):

“In computer science, string interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool. String interning is supported by some modern object-oriented programming languages, including Python, Ruby (with its symbols), Java and .NET languages.”

Although string interning/pooling is pretty much a known secret, we haven’t seen it used with Delphi yet and it’s a great solution for a not so untypical problem. Although the Console might be a special case and would benefit more from string interning than other applications, it’s a very useful concept every developer should be familiar with and add to his toolbox. That’s why you shouldn’t miss part II and part III of this blog series which will introduce the technical details of string pools, explain the several different usage scenarios you can optimize for and give a detailed implementation guide.