<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Safety Fork</title>
	<atom:link href="http://safetyfork.net/feed/" rel="self" type="application/rss+xml" />
	<link>http://safetyfork.net</link>
	<description>Harmless...but useless.</description>
	<lastBuildDate>Thu, 05 Apr 2012 03:46:26 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Desaturation Optimization</title>
		<link>http://safetyfork.net/2012/04/04/desaturation-optimization/</link>
		<comments>http://safetyfork.net/2012/04/04/desaturation-optimization/#comments</comments>
		<pubDate>Thu, 05 Apr 2012 03:45:40 +0000</pubDate>
		<dc:creator>comradexavier</dc:creator>
				<category><![CDATA[Optimization]]></category>

		<guid isPermaLink="false">http://safetyfork.net/?p=371</guid>
		<description><![CDATA[Recently while doing some user interface work, I wanted to present grayscale versions of a few images provided by the user at run-time. Win32 GDI does not provide a function to desaturate a bitmap, so I wrote my own. void float_desaturate(DWORD *p, UINT cb) { auto end = p + (cb / 4); for ( [...]]]></description>
			<content:encoded><![CDATA[<p>Recently while doing some user interface work, I wanted to present grayscale versions of a few images provided by the user at run-time. Win32 GDI does not provide a function to desaturate a bitmap, so I wrote my own.</p>
<pre>
void float_desaturate(DWORD *p, UINT cb)
{
    auto end = p + (cb / 4);
    for ( ; p < end; ++p)
    {
        DWORD rgba = *p;
</pre>
<p>To get a grayscale value for each pixel, I multiply each of its color channels by a fractional weight.</p>
<pre>
        float b = 0.11f * ((0x00FF0000 &#038; rgba) >> 16);
        float g = 0.59f * ((0x0000FF00 &#038; rgba) >> 8);
        float r = 0.30f *  (0x000000FF &#038; rgba);
</pre>
<p>The grayscale intensity of the pixel is the sum of the fractional color values. It's converted from a floating-point number to an integer, checked to make sure that a rounding error hasn't exceeded the 8-bit channel width in the image, and copied into each of the color channels. The alpha channel is copied directly from the source pixel, since desaturating the color does not change it.</p>
<pre>
        DWORD f = min(0xFF, static_cast<dword>(b + g + r));
        *p = (0xFF000000 &#038; rgba) | (f < < 16) | (f << 8) | f;
    }
}
</pre>
<p>I tested the speed of this function on the largest bitmap I had handy, which happened to be the large version of <a href="http://xkcd.com/802/" title="xkcd's online communities map"></a>, which is a 3072x3571 pixel, 4MB PNG file. In a Windows 7 VM on my 2009 Mac Pro, this algorithm takes an average of 94ms to convert the image to grayscale. That's probably fast enough for my purposes, since I only needed to convert a few, much smaller bitmaps.</p>
<hr />
Still, it bothered me that I was converting every pixel to three floating-point numbers in order to do hardly any math on them and convert them right back into an integer result. Therefore, I rewrote the function to use fixed-point integer math.</p>
<pre>
void int_desaturate(DWORD *p, UINT cb)
{
    auto end = p + (cb / 4);
    for ( ; p < end; ++p)
    {
        DWORD rgba = *p;
</pre>
<p>There are two things to note about computing the fractional color values in this version. First, the constants have been converted from decimal fractions to integers, which contain the first eight bits of the color weights as binary fractions.</p>
<p>Second, the color values are shifted so that they are in bits 8-15, so that they can be treated as 16-bit fixed-point numbers with 8 integer bits and 8 fractional bits.</p>
<pre>
        DWORD b = 0x1C * ((0x00FF0000 &#038; rgba) >> 8);
        DWORD g = 0x97 *  (0x0000FF00 &#038; rgba);
        DWORD r = 0x4D * ((0x000000FF &#038; rgba) < < 8);
</pre>
<p>The result of multiplying fixed-point numbers in this format produces 32-bit numbers with 16 integer bits and 16 fractional bits, so the sum has to be shifted appropriately. One nice effect of using integer math this way is that the choice of fractions guarantees that the sum will not be larger than 0xFF, so I can skip the range check on the result.</p>
<pre>
        DWORD f = (b + g + r) >> 16;
        *p = (0xFF000000 &#038; rgba) | (f < < 16) | (f << 8) | f;
    }
}
</pre>
<p>On the same image, the fixed-point function runs in an average of 42ms, which is more than twice as fast! This is the version I used in my program.</p>
<hr />
However, I did write one more version. As an experiment, I used compiler intrinsics for SSE to compute four pixels each time through the loop. I haven't previously written any SSE code, so I may not have stumbled upon the most efficient way to do it.</p>
<pre>
void sse_desaturate(DWORD *p, UINT cb)
{
</pre>
<p>Each SSE register is 16 bytes, which is enough to hold four complete pixels. For most of this algorithm, I used the instructions that treat a register as four 16-bit integers, so that I can use the fixed-point integer math I did before.</p>
<p>I start by setting up two constants. <em>zeroes</em> is self-explanatory, and <em>fractions</em> holds the same set of fractions I used as before.</p>
<pre>
    __m128i zeroes = _mm_setzero_si128();
    __m128i fractions = _mm_set_epi16(0x0000, 0x001C, 0x0097, 0x004D, 0x0000, 0x001C, 0x0097, 0x004D);

    auto q = reinterpret_cast<__m128i*>(p);
    auto end = q + (cb / sizeof(__m128i));
    for (; q < end; ++q)
    {
        __m128i packed = _mm_load_si128(q);
</pre>
<p>I load four pixels of data into <em>packed</em>. Since I want to be able to do 16-bit multiplication, I have to 'unpack' them and do the multiplication two pixels at a time. _mm_unpacklo_epi8 and _mm_unpackhi_epi8 interpolate bytes from each of their two arguments. I use <em>zeroes</em> as the first argument, so that <em>unpackedlo</em> and <em>unpackedhi</em> contain the color values in bits 8-15 of each 16-bit segment.</p>
<p>_mm_mulhi_epu16 does a sixteen-bit multiplication and keeps bits 8-15 of the result. Bits 0-7 are discarded. This is okay, since they are the fractional part in my fixed-point scheme, but it is less precise to discard them before summing, so this function might produce results that are a little darker than the non-SSE version. Note that <em>fractions</em> is set up so that the alpha channel is multiplied by zero; that way it is not included in the sum.</p>
<pre>
        __m128i unpackedlo = _mm_unpacklo_epi8(zeroes, packed);
        __m128i productlo = _mm_mulhi_epu16(fractions, unpackedlo);

        __m128i unpackedhi = _mm_unpackhi_epi8(zeroes, packed);
        __m128i producthi = _mm_mulhi_epu16(fractions, unpackedhi);
</pre>
<p>_mm_hadd_epi16 adds the adjacent pairs of 16-bit integers in its operands. Since I want to add three adjacent integers, I have to call it twice. Technically, this sums four adjacent integers, but the multiplication in the previous step has conveniently left a zero where the alpha channel was.</p>
<pre>
        __m128i sum = _mm_hadd_epi16(productlo, producthi);
        sum = _mm_hadd_epi16(sum, zeroes);

        __declspec(align(16)) struct { unsigned __int16 sum0, sum1, sum2, sum3, z4, z5, z6, z7; } sums;
        _mm_store_si128(reinterpret_cast<__m128i*>(&#038;sums), sum);
</pre>
<p>SSE lacks an instruction that would allow me to arbitrarily shuffle bytes around, so I can't use it to duplicate the grayscale value into all of the channels and fill in the original alpha. I suspect this is the slowest part of this version of the function.</p>
<pre>
        auto p0 = reinterpret_cast<DWORD*>(q);
        auto p1 = p0 + 1;
        auto p2 = p0 + 2;
        auto p3 = p0 + 3;
        *p0 = construct_desaturated(*p0, sums.sum0);
        *p1 = construct_desaturated(*p1, sums.sum1);
        *p2 = construct_desaturated(*p2, sums.sum2);
        *p3 = construct_desaturated(*p3, sums.sum3);
    }
}

static inline DWORD construct_desaturated(DWORD original, unsigned __int16 sum)
{
    return (0xFF000000 &#038; original) | (sum << 16) | (sum << 8) | sum;
}
</pre>
<p>I timed this final version of the function as taking an average of 27ms, so it is definitely faster than the others, but there are complications. SSE requires that the data be aligned on 16-byte boundaries, and this function requires that the number of pixels in the image be a multiple of four. (I left out the checks for brevity.) Since my program's performance isn't limited by the speed of desaturating bitmaps, it's not worth the extra complexity to use the SSE version.</p>
]]></content:encoded>
			<wfw:commentRss>http://safetyfork.net/2012/04/04/desaturation-optimization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>C4996</title>
		<link>http://safetyfork.net/2012/03/21/c4996/</link>
		<comments>http://safetyfork.net/2012/03/21/c4996/#comments</comments>
		<pubDate>Thu, 22 Mar 2012 01:39:30 +0000</pubDate>
		<dc:creator>comradexavier</dc:creator>
				<category><![CDATA[Rants]]></category>

		<guid isPermaLink="false">http://safetyfork.net/?p=361</guid>
		<description><![CDATA[C4996 is a warning that Microsoft&#8217;s C++ compiler issues when compiling an invocation of a function that has been deprecated. In my case, it objected to compiling a function that looked something like this (from memory, so there are probably mistakes as I don&#8217;t have Visual Studio handy to check): LPSAFEARRAY SafeArrayFromVector(const vector &#038;v) { [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://msdn.microsoft.com/en-us/library/ttcz0bys(v=vs.100).aspx" title="C4996">C4996</a> is a warning that Microsoft&#8217;s C++ compiler issues when compiling an invocation of a function that has been deprecated. In my case, it objected to compiling a function that looked something like this (from memory, so there are probably mistakes as I don&#8217;t have Visual Studio handy to check):</p>
<p><code>
<pre>
LPSAFEARRAY SafeArrayFromVector(const vector<int> &#038;v)
{
    SAFEARRAYBOUND b;
    b.lLBound = 0;
    b.cElements = v.size();

    long *p;
    LPSAFEARRAY pArray = SafeArrayCreate(VT_I4, 1, &#038;b);
    SafeArrayAccessData(pArray, reinterpret_cast<void **>(&#038;p));
    std::copy(v.begin(), v.end(), p);
    SafeArrayUnaccessData(pArray);

    return pArray;
}
</void></int></pre>
<p></code></p>
<p>In particular, Microsoft&#8217;s STL implementation objects to my use of p as an output iterator when calling std::copy(), because p is not a <a href="http://msdn.microsoft.com/en-us/library/aa985965(v=vs.100).aspx" title="Checked Iterators">checked iterator</a>. The warning appears to be a part of Microsoft&#8217;s campaign against buffer overruns. I wouldn&#8217;t normally object to a compiler warning that points out the possibility of a buffer overrun, but I do object to compiler warnings that can&#8217;t be silenced.</p>
<p>Since I know that I allocated an array large enough to hold the range I want to copy, I first tried the compiler&#8217;s syntax for suppressing a warning (according to the documentation, the #pragma takes effect at the opening brace of the next function):</p>
<p><code>
<pre>
#pragma warning(disable: 4996)
LPSAFEARRAY SafeArrayFromVector(const vector&lt;int&gt; &#038;v)
{
    ...
}
</pre>
<p></code></p>
<p>This didn&#8217;t work, because the actual site of the warning is where std::copy() calls another internal function checked_copy(). So I looked further and found MSDN documentation for a stdext::unchecked_copy() function, only to find that it had been removed in the version of the STL shipped with Visual Studio 2010.</p>
<p>The only option mentioned in the MSDN documentation is a compiler switch to turn off all warnings for functions deprecated as being potentially insecure, including warnings I want, in case other programmers on the project think they should still be calling strcpy().</p>
<p>Since I&#8217;m not willing to accept the spurious C4996 warning, I gave up and rewrote this: <code>
<pre>    std::copy(v.begin(), v.end(), p);</pre>
<p></code>as this:<code>
<pre>    memcpy(p, v.data(), sizeof(vector&lt;int&gt;::value_type) * v.size());</pre>
<p></code> which is half as concise and twice as error-prone.</p>
<p>The flexibility of using pointers as iterators in STL algorithms is one of the few things I actually like about the STL. So thanks, Microsoft, for making STL programming a little more annoying.</p>
]]></content:encoded>
			<wfw:commentRss>http://safetyfork.net/2012/03/21/c4996/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

