OS X Localisation: incremental genstrings and UTF-8 files

· by Steve · Read in about 5 min · (870 Words)

I came across a couple of interesting issues when I came to do the first pass of writing the text for the user-visible strings I’d been setting up for a Cocoa app I’m writing (painfully slowly as I learn the nuances of the environment), and I thought I’d share them. Full details are after the jump, since I’ve embedded a large script in the post.

The basic principle for text localisation on OS X is that, like most systems, you externalise your user-visible strings in string tables and reference them by keyed aliases in code - in this case using NSLocalizedString. Apple provide a tool called ‘genstrings’ which extracts all these into a template strings file called Localizable.strings which you can then populate per language - localised files are kept in folders called en.lproj, fr.lproj etc and helpfully they’re picked up by default like this.  So far, so good.

There are a couple of practical issues though. Firstly, genstrings always overwrites its output file, which means that using it incrementally to add new strings when you’ve already populated the previous set - which is bound to be the normal case for most developers - isn’t possible out of the box. Luckily I found a nice little Python script which solves this problem for you by merging the results in to your existing files. I’ve added a custom target with a Run Script step to my XCode project which uses a modified version of this script (see below) to update my strings files whenever I need to.

The second problem is that genstrings creates UTF-16 encoded files, and there’s no way to alter this. The problem with UTF-16 is that both Mercurial and Git don’t like them very much; both system’s text/binary detection will classify them as binary, meaning you lose the ability to diff and merge these files in any useful way. It’s not a deal-breaker, but it’s inconvenient. Couple that with the fact that OS X will quite happily use UTF-8 encoded .strings files directly at run-time (although iPhone will not), and it seemed something that I should resolve. For the convenience of development, I modified the Python script (as shown below) to convert the result of genstrings to UTF-8 via iconv, meaning they always get picked up as text in source control. If you’re deploying on iPhone, it’s trivial to write a small build script calling iconv again to convert back to UTF-16 for deployment.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# Localize.py - Incremental localization on XCode projects
# João Moreno 2009
# http://joaomoreno.com/

# Modified by Steve Streeting 2010 http://www.stevestreeting.com
# Changes
# - Use .strings files encoded as UTF-8
#   This is useful because Mercurial and Git treat UTF-16 as binary and can't 
#   diff/merge them. For use on iPhone you can run an iconv script during build to 
#   convert back to UTF-16 (Mac OS X will happily use UTF-8 .strings files).
# - Clean up .old and .new files once we're done

from sys import argv
from codecs import open
from re import compile
from copy import copy
import os

re_translation = compile(r'^"(.+)" = "(.+)";$')
re_comment_single = compile(r'^/\*.*\*/$')
re_comment_start = compile(r'^/\*.*$')
re_comment_end = compile(r'^.*\*/$')

class LocalizedString():
    def __init__(self, comments, translation):
        self.comments, self.translation = comments, translation
        self.key, self.value = re_translation.match(self.translation).groups()

    def __unicode__(self):
        return u'%s%s\n' % (u''.join(self.comments), self.translation)

class LocalizedFile():
    def __init__(self, fname=None, auto_read=False):
        self.fname = fname
        self.strings = []
        self.strings_d = {}

        if auto_read:
            self.read_from_file(fname)

    def read_from_file(self, fname=None):
        fname = self.fname if fname == None else fname
        try:
            f = open(fname, encoding='utf_8', mode='r')
        except:
            print 'File %s does not exist.' % fname
            exit(-1)
        
        line = f.readline()
        while line:
            comments = [line]

            if not re_comment_single.match(line):
                while line and not re_comment_end.match(line):
                    line = f.readline()
                    comments.append(line)
            
            line = f.readline()
            if line and re_translation.match(line):
                translation = line
            else:
                raise Exception('invalid file')
            
            line = f.readline()
            while line and line == u'\n':
                line = f.readline()

            string = LocalizedString(comments, translation)
            self.strings.append(string)
            self.strings_d[string.key] = string

        f.close()

    def save_to_file(self, fname=None):
        fname = self.fname if fname == None else fname
        try:
            f = open(fname, encoding='utf_8', mode='w')
        except:
            print 'Couldn\'t open file %s.' % fname
            exit(-1)

        for string in self.strings:
            f.write(string.__unicode__())

        f.close()

    def merge_with(self, new):
        merged = LocalizedFile()

        for string in new.strings:
            if self.strings_d.has_key(string.key):
                new_string = copy(self.strings_d[string.key])
                new_string.comments = string.comments
                string = new_string

            merged.strings.append(string)
            merged.strings_d[string.key] = string

        return merged

def merge(merged_fname, old_fname, new_fname):
    try:
        old = LocalizedFile(old_fname, auto_read=True)
        new = LocalizedFile(new_fname, auto_read=True)
        merged = old.merge_with(new)
        merged.save_to_file(merged_fname)
    except:
        print 'Error: input files have invalid format.'


STRINGS_FILE = 'Localizable.strings'

def localize(path):
    languages = [name for name in os.listdir(path) if name.endswith('.lproj') and os.path.isdir(name)]
    
    for language in languages:
        original = merged = language + os.path.sep + STRINGS_FILE
        old = original + '.old'
        new = original + '.new'
    
        if os.path.isfile(original):
            os.rename(original, old)
            os.system('genstrings -q -o "%s" `find . -name "*.m"`' % language)
            os.system('iconv -f UTF-16 -t UTF-8 "%s" > "%s"' % (original, new))
            merge(merged, old, new)
        else:
            os.system('genstrings -q -o "%s" `find . -name "*.m"`' % language)
            os.rename(original, old)
            os.system('iconv -f UTF-16 -t UTF-8 "%s" > "%s"' % (old, original))
        
        if os.path.isfile(old):
            os.remove(old)
        if os.path.isfile(new):
            os.remove(new)

if __name__ == '__main__':
    localize(os.getcwd())

Hopefully this is useful to someone else! I’m still very much learning on the Mac development side, so if there’s something I haven’t considered, please let me know.