One line of code I am going to present here is one of the most complex line of code that I might have ever written. Goal was to import StackOverflow’s questions and answers to MongoDB for further analysis. You can find whole dump of StackOverflow in XML format here. When you unpack it, it requires 8 lines of code to load it to MongoDB:

1
2
3
4
5
6
7
8
from pymongo.mongo_client import MongoClient
import xml.etree.ElementTree as etree
if __name__ == '__main__':
    db = MongoClient('localhost', 27017)['so']
    for event, elem in etree.iterparse('/home/kokan/Posts.xml', events=('end',)):
        if elem.tag != 'row': continue
        db.entries.insert(elem.attrib)
        elem.clear()

And this is literally whole program!

However, what you might notice is that all fields end up as strings in MongoDB. Somebody might not care and just live with this, but I have OCD, I just couldn’t let that happen. So, I started looking at all attributes in XML and figuring out their types. It turns out we have strings, integers, dates and even one list (it was attribute “Tags” which is in format “<html><css><css3><internet-explorer-7>”). My first reaction is to add code like this:

for key,value in elem.attrib.items():
    if key == 'Id':
        elem.attrib[key] = int(value)
    elif key == 'CreationDate':
        elem.attrib[key] = dateutil.parser.parse(v + 'Z')
    elif key == 'Body:
        pass # this is already string
    ...
    else:
        print('Unknown key %s with value %s' % (key, value))

You can see where this is going…So, I wanted to have a way to execute preprocessor logic applied to any given key to cast it from string to its real type. Another requirement was not to miss any key, e.g. I should have list of all used keys, so if any new key pops up, I can examine it and determine which type it is before rerunning script. Here is my end result – typed import in 23 lines of code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
INTEGER_KEYS = ('Id', 'ParentId', 'LastEditorUserId', 'OwnerUserId', 'PostTypeId', 'ViewCount', 'Score', 'AcceptedAnswerId', 'AnswerCount', 'CommentCount', 'FavoriteCount')
STRING_KEYS = ('Title', 'LastEditorDisplayName', 'Body', 'OwnerDisplayName')
DATE_KEYS = ('CommunityOwnedDate', 'LastActivityDate', 'LastEditDate', 'CreationDate', 'ClosedDate')
LIST_KEYS = ('Tags')
 
def warning_nonexistant_key(key, value):
    print('Unknown key %s with value %s' % (key, value))
    return value
 
PREPROCESSOR = {
    INTEGER_KEYS: lambda k,v: int(v),
    STRING_KEYS: lambda k,v: v,
    DATE_KEYS: lambda k,v: dateutil.parser.parse(v + 'Z'),
    LIST_KEYS: lambda k,v: v[1:-1].split('&gt;&lt;'),
    '': warning_nonexistant_key 
}
 
if __name__ == '__main__':
    db = MongoClient('localhost', 27017)['so']
    for event, elem in etree.iterparse('/home/kokan/Posts.xml', events=('end',)):
        if elem.tag != 'row': continue
        db.entries.insert(dict([key, PREPROCESSOR[next((key_type for key_type in PREPROCESSOR if key in key_type), '')](key, value)] for key,value in elem.attrib.items())
        elem.clear()

Brief explanation – I created dictionary PREPROCESSOR where keys are tuples of all keys in XML of a given type, and value is lambda function that knows how to cast values from string to its own type. Key line here is 22. What it does is – for each XML attribute, it tries to find that value in each tuple of each key in PREPROCESSOR and if it finds it, it executes proprocessor lambda. If it doesn’t find it, it executes default error message and returns unmodified value (as a string). There is so much in this line – list comprehension, dictionaries, tuples, lambdas and couple of awesome and cool built-in functions. If we are going to unwrap it, it would look something like this:

entry = {}
for key,value in elem.attrib.items():
    found_key_type = ''
    for key_types in PREPROCESSOR.keys():
        if key in key_types:
            found_key_type = key_type
    cast_function = PREPROCESSOR[found_key_type]
    entry[key] = cast_function(key, value)

Don’t get me wrong, I would never write lines of codes similar to that in any production code, nor I would encourage others to do that, but this was fun, this was one-time only script and I wanted to push my (and Python’s) limits doing this. And it turned out pretty cool, admit it:)

Whole source code, if interested, is here.