Archive for the ‘Programming’ Category

Some programming languages really encourage using UNIX timestamps for working with dates. PHP is a good example of such a language. Functions like date, strtotime, strftime are used all the time. Most people don’t realize that timestamps in general can’t really be used for calculations though. The problem is that most countries use daylight saving time, which means that two times a year the local timezone changes. This nicely breaks the assumption that every day has 24 hours. It doesn’t. Sometimes it has 23 or 25 hours.

This StackOverflow answer is a nice example of the problem:

$mondayStr = "last monday";
if (date('N') !== '1') {  // it's not Monday today
    $mondayStr .= " last week";
}

$monday = strtotime($mondayStr);
echo date('r', $monday);    // Mon, 22 Feb 2010 00:00:00 +1000

$sunday = $monday + 86400 * 7 - 1;
echo date('r', $sunday);    // Sun, 28 Feb 2010 23:59:59 +1000

The code seems logical. Get the timestamp of the last Monday, add 60 * 60 * 24 * 7 – 1 seconds and you have the end of Sunday. Works fine most of the time. Although, if the $monday happens to be Mon, 22 Mar 2010 00:00:00, the date that was supposed to be Sun, 28 Mar 2010 23:59:59 will actually be Mon, 29 Mar 2010 00:59:59. Why? Because 28 March 2010 has only 23 hours.

Never use timestamps to do calendar calculations. It’s hard to get it right. If you really have to, at least use GMT timestamps.

Dealing with file names in a cross-platform application is not easy. A question about using a file name in QString to create a new TagLib file came up on the TagLib development mailing list yesterday. The original problem was not related to Unicode, but after fixing one C++ issue, it ended up there. So, what was wrong?

QString represents an Unicode string. That is, an array of Unicode code-points. The issue is that on most UN*X platforms, filesystems are not aware of Unicode. File names are stored as an array of bytes. The filesystems don’t care how are the bytes interpreted, but if applications want to display non-ASCII characters properly, they need to decode the bytes into Unicode. Since the filesystem itself doesn’t know the encoding, it’s necessary to look for the information somewhere else.

The user’s locale is probably the first place to look. If the user uses some encoding for input/output, it’s expectable that they use the same encoding for file names. This doesn’t always have to be the case, so GNOME for example uses a special environment variable named G_FILENAME_ENCODING. The problem is that all these solutions work globally for all filesystems. What if the main filesystem uses UTF-8 for everything, but the media player on which I sometimes upload files from Windows uses a different encoding? There is no way to tell applications that it should use CP-1250 for /media/disk-1 and UTF-8 for everything else.

That’s not everything though. Seeing broken characters is not nice, but not a blocking problem either. What if the application can’t even read or write such files? That’s a much larger issue. If the application is using Unicode to store file names, but it can’t properly decode/decode the name, it won’t be able to access the file. The obvious solution is to ignore Unicode and just use byte arrays. This would work fine on UN*X, but new problems will show up if you are trying to write a cross-platform application. To be able to access all files on Windows, you have to do the exact opposite. You have to work with Unicode. On Mac you also have to work with Unicode, but it’s even more interesting, because the filesystem will do Unicode normalization for you. There is no solution that works in all cases on all platforms.

To summarize the situation:

  • File names on UN*X are byte arrays. You don’t know their encoding, you can only guess. It’s safest to not treat them as Unicode. If you want to treat them as Unicode, use functions like QFile::decodeName() or g_filename_from_utf8() to do the guessing for you.
  • File names on Windows are in Unicode. You can work with them using UTF-16.
  • File names on Mac are in normalized Unicode. You can work with them using UTF-8, but you can’t just save any Unicode. The filesystem will normalize it to NFD for you.

It’s sad to say, but I think this is one area where Windows is the nicest platform to deal with.

Since I’ve started using Qt, I loved the “implicit sharing” concept it uses for it’s strings and container types. It become so much easier to pass these data around. I wasn’t aware that some STL implementations have copy-on-write semantics for strings as well. When I saw some recommendations for std::string on Stack Overflow, I’ve decided to check out the implementation in GCC and discovered that it indeed does some reference counting.

So the next step was comparing the implementations. I wrote a little program today to check how QString and std::wstring compare in terms of copy-on-write performance. Since QChar is 2 bytes and wchar_t is 4 bytes on my machine, it wouldn’t be completely fair comparison, so I’ve included also std::string. The results were quite surprising for me. STL does almost always better, but for some reason I wasn’t able to make not dereference the string on read-only operations.

wchar_t* QString std::wstring std::string
Read 0 ms 0 ms 2143 ms 2224 ms
Write 0 ms 5588 ms 2621 ms 2570 ms
Copy 1618 ms 601 ms 116 ms 117 ms
Copy + read - 601 ms 6161 ms 5079 ms
Copy + write - 11036 ms 6822 ms 6843 ms
Copy + append - 5801 ms 4650 ms 3482 ms

The table shows times in milliseconds for 10000000 repeated operations on a 200-character long string.

  • Read” just reads the string one character at a time, using the default [] operator. I’ve tried hard to find a cheaper way to do this for STL strings, but I failed (I wasn’t interested in using s.data() and then working with the primitive array, I wanted to work with the object directly).
  • Write” writes to all characters of the string, again one character at a time.
  • Copy” assigns one string to another, using the default = operator for string classes and memcpy for wchar_t*.
  • Copy + read” is the same as “Copy“, followed by “Read” performed on the copy.
  • Copy + write” is the same as “Copy“, followed by “Write” performed on the copy.
  • Copy + append” is again the same as “Copy“, followed by appending a short string to the copy.

I guess I should note that this wasn’t meant to be a generic benchmark of the string classes. I just wanted to know performance details about the copy-on-write implementations in them. The conclusion for me is that the STL strings in GCC are better than I always thought, but the fact that they dereference the data on read-only operations is not very nice.

While writing an SQL script to upgrade the MusicBrainz database for the last release, I needed a way to generate new UUIDs from SQL. PostgreSQL has a native UUID data type and a contrib module for generating UUIDs since version 8.3, but this wouldn’t help me, because I needed it to work with at least version 8.1. I had this idea to write PL/pgSQL functions to generate UUIDs, so I skimmer over the RFC 4122 that documents them and found out that it isn’t actually that hard.

MusicBrainz uses random-based UUIDs (version 4) for all it’s new IDs, so the first idea was to implement the same. I know I can’t use this code in the end, because I need a good pseudo-random number generator, but I couldn’t resist to write it anyway. Messing with bits in high-level languages is always fun :) Here is the result (because of the use of the random() function, don’t use the code for anything serious):

CREATE OR REPLACE FUNCTION generate_uuid_v4() RETURNS uuid
    AS $$
DECLARE
    value VARCHAR(36);
BEGIN
    value =          lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || '-';
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || '-';
    value = value || lpad((to_hex((ceil(random() * 255)::int & 15) | 64)), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || '-';
    value = value || lpad((to_hex((ceil(random() * 255)::int & 63) | 128)), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || '-';
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    value = value || lpad(to_hex(ceil(random() * 255)::int), 2, '0');
    RETURN value::uuid;
END;
$$ LANGUAGE 'plpgsql';

It turned out that we need deterministic IDs to be generated from the script, so V4 was out of question. That was good, because we would need a better PNRG for the final version.

The next idea was to create the URL on which the new rows will be server and generate name-based UUIDs using the URL namespace. The idea is to concatenate a namespace and a name, calculate a cryptographic hash of the result, and use it’s bits to generate the UUID.  There are two options for hashing, either MD5 (version 3) or SHA-1 (version 5). SHA-1 is preferred by the RFC, but PostgreSQL only has a built-in function for MD5, so the decision for us was easy. The code doesn’t depend any random numbers, so it’s good enough to use in production.

CREATE OR REPLACE FUNCTION from_hex(t text) RETURNS integer
    AS $$
DECLARE
    r RECORD;
BEGIN
    FOR r IN EXECUTE 'SELECT x'''||t||'''::integer AS hex' LOOP
        RETURN r.hex;
    END LOOP;
END
$$ LANGUAGE plpgsql IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION generate_uuid_v3(namespace varchar, name varchar) RETURNS uuid
    AS $$
DECLARE
    value varchar(36);
    bytes varchar;
BEGIN
    bytes = md5(decode(namespace, 'hex') || decode(name, 'escape'));
    value = substr(bytes, 1+0, 8);
    value = value || '-';
    value = value || substr(bytes, 1+2*4, 4);
    value = value || '-';
    value = value || lpad(to_hex((from_hex(substr(bytes, 1+2*6, 2)) & 15) | 48), 2, '0');
    value = value || substr(bytes, 1+2*7, 2);
    value = value || '-';
    value = value || lpad(to_hex((from_hex(substr(bytes, 1+2*8, 2)) & 63) | 128), 2, '0');
    value = value || substr(bytes, 1+2*9, 2);
    value = value || '-';
    value = value || substr(bytes, 1+2*10, 12);
    return value::uuid;
END;
$$ LANGUAGE 'plpgsql' IMMUTABLE STRICT;

This code should be easy enough to modify to generate UUIDv5, if you have a way to calculate SHA-1 hashes. To use the function, you need to pass it a namespace and a name. The namespace itself is a UUID, it can be anything, but there are a few well-known options:

  • URL
    '6ba7b8119dad11d180b400c04fd430c8'
  • DNS (fully-qualified domain name)
    '6ba7b8109dad11d180b400c04fd430c8'
  • ISO OID
    '6ba7b8129dad11d180b400c04fd430c8'
  • X.500 DN (in DER or a text output format)
    '6ba7b814-9dad-11d1-80b4-00c04fd430c8'

The URL one is probably the most useful. So, to generate UUIDv3 for http://www.example.com/foo/1, you can use the following:

SELECT generate_uuid_v3('6ba7b8119dad11d180b400c04fd430c8', 'http://www.example.com/foo/1');