<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Optimizing Memory Consumption with String Pools: Part I</title>
	<atom:link href="http://nobugleftbehind.com/optimizing-memory-consumption-with-string-pools-part-i/feed/" rel="self" type="application/rss+xml" />
	<link>http://nobugleftbehind.com/optimizing-memory-consumption-with-string-pools-part-i/</link>
	<description>Software quality, testing and programming.</description>
	<lastBuildDate>Fri, 24 Apr 2009 06:04:59 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Tobias Gurock</title>
		<link>http://nobugleftbehind.com/optimizing-memory-consumption-with-string-pools-part-i/comment-page-1/#comment-12</link>
		<dc:creator>Tobias Gurock</dc:creator>
		<pubDate>Tue, 10 Feb 2009 16:29:59 +0000</pubDate>
		<guid isPermaLink="false">http://nobugleftbehind.com/?p=33#comment-12</guid>
		<description>Unfortunately, it&#039;s not as easy as it first looks. For example, one problem is that logs can come from several destinations, not just log files. The Console can also accept log packets from a TCP or named pipe connection. So, we would need some way of storing these streams in temporary files which is not exactly rocket-science but would certainly require some time to get right.

Another problem is the performance. If you are dealing with log data in the range of multiple 100MB, adding a new view (which is, essentially, a new tab for the log data with a filter) would be way too slow (as this requires going through the entire list of log packets and checking if the filter allows/denies each packet). There are several other cases like this in the Console.

And there are even more problems, such as updating (we allow adding comments to the log, for instance) or deleting data from the log. 

Coincidentally, we are experimenting with the very same idea in a slightly different form. I think the string problem is largely solved by the string pool. The next problem deals with the attached data of log packets (log packets can transmit additional data such as images, hex dumps or database query results etc.). To avoid storing this data in memory, we thought about adding some kind of file-based database support to the Console (for example, with Sqlite). A file database would simplify delete and update operations and should provide a better performance than direct file manipulations.</description>
		<content:encoded><![CDATA[<p>Unfortunately, it&#8217;s not as easy as it first looks. For example, one problem is that logs can come from several destinations, not just log files. The Console can also accept log packets from a TCP or named pipe connection. So, we would need some way of storing these streams in temporary files which is not exactly rocket-science but would certainly require some time to get right.</p>
<p>Another problem is the performance. If you are dealing with log data in the range of multiple 100MB, adding a new view (which is, essentially, a new tab for the log data with a filter) would be way too slow (as this requires going through the entire list of log packets and checking if the filter allows/denies each packet). There are several other cases like this in the Console.</p>
<p>And there are even more problems, such as updating (we allow adding comments to the log, for instance) or deleting data from the log. </p>
<p>Coincidentally, we are experimenting with the very same idea in a slightly different form. I think the string problem is largely solved by the string pool. The next problem deals with the attached data of log packets (log packets can transmit additional data such as images, hex dumps or database query results etc.). To avoid storing this data in memory, we thought about adding some kind of file-based database support to the Console (for example, with Sqlite). A file database would simplify delete and update operations and should provide a better performance than direct file manipulations.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John</title>
		<link>http://nobugleftbehind.com/optimizing-memory-consumption-with-string-pools-part-i/comment-page-1/#comment-11</link>
		<dc:creator>John</dc:creator>
		<pubDate>Tue, 10 Feb 2009 15:42:39 +0000</pubDate>
		<guid isPermaLink="false">http://nobugleftbehind.com/?p=33#comment-11</guid>
		<description>Unless I misunderstood, there is a much straightforward solution: don&#039;t convert the data structures in memory, work on the binary, use virtual lists and virtual components, meaning that you don&#039;t need to have strings in memory for more than what is visible on screen.

This is an even older solution, and it can handle data of practically unlimited size. The code required is usually dead simple (replace fields in whatever object structures by methods that fetch the actual data).
As for performance, let the windows file cache do the job for you.</description>
		<content:encoded><![CDATA[<p>Unless I misunderstood, there is a much straightforward solution: don&#8217;t convert the data structures in memory, work on the binary, use virtual lists and virtual components, meaning that you don&#8217;t need to have strings in memory for more than what is visible on screen.</p>
<p>This is an even older solution, and it can handle data of practically unlimited size. The code required is usually dead simple (replace fields in whatever object structures by methods that fetch the actual data).<br />
As for performance, let the windows file cache do the job for you.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tobias Gurock</title>
		<link>http://nobugleftbehind.com/optimizing-memory-consumption-with-string-pools-part-i/comment-page-1/#comment-10</link>
		<dc:creator>Tobias Gurock</dc:creator>
		<pubDate>Tue, 10 Feb 2009 12:48:22 +0000</pubDate>
		<guid isPermaLink="false">http://nobugleftbehind.com/?p=33#comment-10</guid>
		<description>Thomas: We implemented it in a very similar way (we are using the hash-based approach), except that we also needed to consider the usage from multiple threads. We currently plan to intern all string properties of a log file. We suspect that, in a typical log file, you have at least 60-70% duplicates, so this should bring a lot. First tests have shown that the string pool (and the other improvements which are already included in the 3.2 beta) drastically reduce the memory usage of the Console.

Patrick: Indeed a nice solution. :-) A disadvantage I see is the latency/increased response time of the string pool operation when you are doing your sort/resize/merge operation. However, I&#039;m sure it&#039;s just a theoretical issue and not really noticeable in practice (depending on the amount of managed strings).</description>
		<content:encoded><![CDATA[<p>Thomas: We implemented it in a very similar way (we are using the hash-based approach), except that we also needed to consider the usage from multiple threads. We currently plan to intern all string properties of a log file. We suspect that, in a typical log file, you have at least 60-70% duplicates, so this should bring a lot. First tests have shown that the string pool (and the other improvements which are already included in the 3.2 beta) drastically reduce the memory usage of the Console.</p>
<p>Patrick: Indeed a nice solution. <img src='http://nobugleftbehind.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />  A disadvantage I see is the latency/increased response time of the string pool operation when you are doing your sort/resize/merge operation. However, I&#8217;m sure it&#8217;s just a theoretical issue and not really noticeable in practice (depending on the amount of managed strings).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Patrick van Logchem</title>
		<link>http://nobugleftbehind.com/optimizing-memory-consumption-with-string-pools-part-i/comment-page-1/#comment-9</link>
		<dc:creator>Patrick van Logchem</dc:creator>
		<pubDate>Mon, 09 Feb 2009 23:32:30 +0000</pubDate>
		<guid isPermaLink="false">http://nobugleftbehind.com/?p=33#comment-9</guid>
		<description>Same story here - I did this kind of thing years ago. The solution back then was rather nice if I may say so:

For each addition to the pool, we first searched the pool to prevent duplicates. When a string was already present, we returned that string, instead of the one being added (so the reference-count of the former was increased, while the latter was ultimately cleared).

The pool started out as one single array, containing 16 unsorted strings. (Searching in this array required a check at each element, but on average this array was half-full, so it only took 8 compares on average.)
As soon as the unsorted array ran out of space, we sorted these strings, and moved the sorted array aside. (Sorted arrays can be searched more quickly, by using a binary search).

As soon as the unsorted array filled up again, we did something extra : Not only did we sort the 16 strings, but now that we had 2 sorted arrays of equal size, we merge-sorted them into 1 array of size 32, and cleaned out the sorted size-16 array.

When the next fill-up of the unsorted array occurred, we could again move the newly sorted size-16 array aside, as there was no size-16 array in use anymore. But this wasn&#039;t possible on the next fill-up anymore, as we had 2 sorted size-16 arrays and 1 size-32 array... I&#039;m sure you can see a pattern arising here : We just merged all those into one single size-64 array!

This pattern continued for ever larger powers-of-2, meaning that presence-checks became a bit more expensive when the pool grew, as we had to do a binary-search in an ever increasing number of sorted, power-of-2 sized, arrays before we could conclude a string was not yet present. (A hit resulted in an early exit ofcourse.)
It&#039;s not as bad as it sounds thou - I believe we never went past 25 levels or so.
 
For speed and fragmentation-prevention, our implementation didn&#039;t actually clear out the arrays - we kept all arrays sized to their alotted capacity, but instead used an Empty-flag per array to see which arrays where inactive and as such shouldn&#039;t be searched.


Anyway, as I said - this was years ago. Since then we managed to dispence of the pool altogether, as we off-loaded most of the strings - which is quite the memory-saver ;-)</description>
		<content:encoded><![CDATA[<p>Same story here &#8211; I did this kind of thing years ago. The solution back then was rather nice if I may say so:</p>
<p>For each addition to the pool, we first searched the pool to prevent duplicates. When a string was already present, we returned that string, instead of the one being added (so the reference-count of the former was increased, while the latter was ultimately cleared).</p>
<p>The pool started out as one single array, containing 16 unsorted strings. (Searching in this array required a check at each element, but on average this array was half-full, so it only took 8 compares on average.)<br />
As soon as the unsorted array ran out of space, we sorted these strings, and moved the sorted array aside. (Sorted arrays can be searched more quickly, by using a binary search).</p>
<p>As soon as the unsorted array filled up again, we did something extra : Not only did we sort the 16 strings, but now that we had 2 sorted arrays of equal size, we merge-sorted them into 1 array of size 32, and cleaned out the sorted size-16 array.</p>
<p>When the next fill-up of the unsorted array occurred, we could again move the newly sorted size-16 array aside, as there was no size-16 array in use anymore. But this wasn&#8217;t possible on the next fill-up anymore, as we had 2 sorted size-16 arrays and 1 size-32 array&#8230; I&#8217;m sure you can see a pattern arising here : We just merged all those into one single size-64 array!</p>
<p>This pattern continued for ever larger powers-of-2, meaning that presence-checks became a bit more expensive when the pool grew, as we had to do a binary-search in an ever increasing number of sorted, power-of-2 sized, arrays before we could conclude a string was not yet present. (A hit resulted in an early exit ofcourse.)<br />
It&#8217;s not as bad as it sounds thou &#8211; I believe we never went past 25 levels or so.</p>
<p>For speed and fragmentation-prevention, our implementation didn&#8217;t actually clear out the arrays &#8211; we kept all arrays sized to their alotted capacity, but instead used an Empty-flag per array to see which arrays where inactive and as such shouldn&#8217;t be searched.</p>
<p>Anyway, as I said &#8211; this was years ago. Since then we managed to dispence of the pool altogether, as we off-loaded most of the strings &#8211; which is quite the memory-saver <img src='http://nobugleftbehind.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Thomas Mueller</title>
		<link>http://nobugleftbehind.com/optimizing-memory-consumption-with-string-pools-part-i/comment-page-1/#comment-8</link>
		<dc:creator>Thomas Mueller</dc:creator>
		<pubDate>Mon, 09 Feb 2009 21:56:58 +0000</pubDate>
		<guid isPermaLink="false">http://nobugleftbehind.com/?p=33#comment-8</guid>
		<description>If I remember correctly, I used a simple sorted TStringList back then, which reduced memory consumption dramatically while still giving decent performance.

something like this:

procedure InternString(var _s: string);
var
  Idx: integer;
begin
  if InternedStrings.Find(_s, Idx) then
    _s := InternedStrings[Idx]
  else
    InternedStrings.Insert(Idx, _s);
end;

Where InternedStrings is the global TStringList in question. Not quite rocket science, is it? ;-)

A hashing list would probably perform even better.

The problem with this approach is, that it will never free the strings. A solution could be to periodically scan the list and remove every string that has a reference count of 1. Of course, in your case, it is simpler: You can just clear the list when the user closes the log file. Also, when the list gets larger, the performance for inserting a string gets worse.

Btw: I am glad to hear you are working on reducing the memory requirement of the console. That gives me hope that I will some day be able to load my log files without splittig them. They usually are huge. The only real solution for this problem of course is to not load the whole file into memory at all, but I guess you already knew that. I ran into memory constrictions myself just a few weeks ago, it felt like a return to the bad days of DOS programming...</description>
		<content:encoded><![CDATA[<p>If I remember correctly, I used a simple sorted TStringList back then, which reduced memory consumption dramatically while still giving decent performance.</p>
<p>something like this:</p>
<p>procedure InternString(var _s: string);<br />
var<br />
  Idx: integer;<br />
begin<br />
  if InternedStrings.Find(_s, Idx) then<br />
    _s := InternedStrings[Idx]<br />
  else<br />
    InternedStrings.Insert(Idx, _s);<br />
end;</p>
<p>Where InternedStrings is the global TStringList in question. Not quite rocket science, is it? <img src='http://nobugleftbehind.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>A hashing list would probably perform even better.</p>
<p>The problem with this approach is, that it will never free the strings. A solution could be to periodically scan the list and remove every string that has a reference count of 1. Of course, in your case, it is simpler: You can just clear the list when the user closes the log file. Also, when the list gets larger, the performance for inserting a string gets worse.</p>
<p>Btw: I am glad to hear you are working on reducing the memory requirement of the console. That gives me hope that I will some day be able to load my log files without splittig them. They usually are huge. The only real solution for this problem of course is to not load the whole file into memory at all, but I guess you already knew that. I ran into memory constrictions myself just a few weeks ago, it felt like a return to the bad days of DOS programming&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tobias Gurock</title>
		<link>http://nobugleftbehind.com/optimizing-memory-consumption-with-string-pools-part-i/comment-page-1/#comment-7</link>
		<dc:creator>Tobias Gurock</dc:creator>
		<pubDate>Mon, 09 Feb 2009 20:31:53 +0000</pubDate>
		<guid isPermaLink="false">http://nobugleftbehind.com/?p=33#comment-7</guid>
		<description>Thanks for your feedback, Thomas. Yeah, it&#039;s a pretty common scenario I guess. It&#039;s too bad Delphi doesn&#039;t have built-in support for string interning. Ideally, Delphi would provide a simple InternString method allowing you to selectively intern some strings depending on your data.</description>
		<content:encoded><![CDATA[<p>Thanks for your feedback, Thomas. Yeah, it&#8217;s a pretty common scenario I guess. It&#8217;s too bad Delphi doesn&#8217;t have built-in support for string interning. Ideally, Delphi would provide a simple InternString method allowing you to selectively intern some strings depending on your data.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Thomas Mueller</title>
		<link>http://nobugleftbehind.com/optimizing-memory-consumption-with-string-pools-part-i/comment-page-1/#comment-6</link>
		<dc:creator>Thomas Mueller</dc:creator>
		<pubDate>Mon, 09 Feb 2009 20:09:27 +0000</pubDate>
		<guid isPermaLink="false">http://nobugleftbehind.com/?p=33#comment-6</guid>
		<description>Yes, I know, everybody could say that, but I implemented something like this about 10 years ago in Delphi. The problem was similar: A internal list of applications, publishers and filenames gathered over a population of several 100 computers. There were pretty many duplicates in that data and storing duplicate strings only once reduced memory requirements significantly.

Of course I had never heard about &quot;string interning&quot; back then. ;-)

(I must be getting old, because I start sounding like all these old guys when I started into computing: &quot;Been there, done that, nothing new under the sun.&quot;)</description>
		<content:encoded><![CDATA[<p>Yes, I know, everybody could say that, but I implemented something like this about 10 years ago in Delphi. The problem was similar: A internal list of applications, publishers and filenames gathered over a population of several 100 computers. There were pretty many duplicates in that data and storing duplicate strings only once reduced memory requirements significantly.</p>
<p>Of course I had never heard about &#8220;string interning&#8221; back then. <img src='http://nobugleftbehind.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>(I must be getting old, because I start sounding like all these old guys when I started into computing: &#8220;Been there, done that, nothing new under the sun.&#8221;)</p>
]]></content:encoded>
	</item>
</channel>
</rss>
