sql

Sqlalchemy: merge on duplicates, primary key, unique and python class

For a pythonista an amateur python code writer like me to work with sql, especially mysql there used to be very limited choices and I remember the days I used MySQLdb module with raw SQL statement string all over the place in my python code. My code looked so tedious that I had to quit my job at yahoo to stay away from it. OK that wasn’t the reason why I quit, but the code I wrote for that tool to query bugzilla database was pretty lame indeed. I even had a class dedicated to generate sql statements based on different parameters passed in, and as expected I spent shit load of time maintaining and refactoring that class. Alright, in all fairness I have to admit that I shouldn’t attribute my horrible experiences solely to MySQLdb, it is a fantastic base data access layer implementation between python and mysql, lots of amazing tools/modules are based on it, my unpleasantness was actually almost entirely due to my bad sense of design pattern and refactoring.

So anyways, as history moves on and human nature of life keeps getting better and better, people, great and smart people created SqlAlchemy. The great thing (maybe just for me) is that it comes with data abstraction layer ORM. and suddenly I can access my data in an elegant, object-oriented way. I like it so much that I can dedicate a whole blog just talk about how it helped me become a better person and thus changed my life completely, Not.

But like all other powerful things SqlAlchemy has its Achilles’ Heel…… well, I’m not going to talk about that here, LOL. What I want to talk about is just something I always feel inconvenient and really wish they could make it part of the SqlAlchmey library. So the here is the task I’m facing more often than not, in a nutshell: So I have two tables users and addresses, and users have a foreign key constrain to addresses, which means multiple users can share one address, but not vise versa. So the my db will look like this:

1
2
3
4
5
6
7
8
9
10
        user = Table('user', metadata,
            Column('id', Integer, primary_key = True),
            Column('name', String(9)),
            Column('addr_id', Integer, ForeignKey('addr.id'))
        )
 
        addr = Table('addr', metadata,
            Column('id', Integer, primary_key = True),
            Column('addr_info', String(10), unique=True)
        )

and of course my table classes will look like this:

1
2
3
4
5
6
7
8
9
10
11
12
class User(object):
    def __init__(self, name=None):
        self.name = name
 
    def __repr__(self):
        return ','.join([self.name,
                         self.addr_id])
class Addr(object):
    def __init__(self, addr_info=None):
        self.addr_info = addr_info
    def __repr__(self):
        return self.addr_info

and of course the mappers:

1
2
mapper(User, user, properties={'addr': relation(Addr)})
mapper(Addr, addr)

now I’m reading my data from a file, with each line describing a resident and an address, my code is supposed to read this file and put them into the database. The logic is fairly simple here, just parse the file, initialize my data class instance with data, then put them into the database.

However, the insertion process needs special attention here: if the address you parsed already exist in the db, you don’t want to insert another row in the address table to create a duplicate record, you really just want to update your foreign key and only insert your new resident here to avoid redundant insertion in the address table. The operation sounds very generic and I was so certain that SqlAlchemy would have something magical I can use to make this “update if exist” happen automatically. And Session.merge() at some point was so close to the pony in my dream until I found it only checks on primary key (see discussion here). So my first and brutal force solution was always do a query first to query out the address id of existing record (None if not) and then assign the id explicitly to the foreign key for resident. Then my first cut looks like this:

1
2
3
4
5
6
7
        user, addr = self.parser.parse_data(line)
        known_addr = self.session.query(Addr).filter(Addr.addr_info==addr.addr_info).all()
        if known_addr:
            addr = self.session.merge(known_addr[0])
        else:
            self.session.add(addr)
        user.addr = addr

While this is essentially what needs to be done on logical level (unfortunately there are no magical APIs to handle this situation from SqlAlchemy, due to performance reasons, more details here), it really makes the code long and hard to maintain. without putting this logic into some code block, almost every data insertion would require this kind of query/validation first. So what would be the best way to refactor this?

the first thing came up in my mind was to put this in my utility/helper class for database, the function could be shortened like this:

1
2
3
        user, addr = self.parser.parse_data(line)
        addr = db_helper.sync_with_db(addr)    #all query_check_merge biz goes here
        user.addr = addr

I’m not really in favor of this solution because I still need to remember to call the helper function from my utility/helper class after I parse my data. plus if I have more tables like address with slightly different logic I would ended up writing functions for each of them in my db_helper, and I have to remember to call all of them as well, it could get very ugly with a super long list of function calling.
Generally speaking, the duplication check is really the business of a table itself since different table has different column names, types and requirements built around them, so it feels natural to build this logic into the table class. Is there a way to embed this into my table class so the duplication check is performed when my instance is formed? There are some smart solutions I found online and my favorite one is to utilize the __new__ class method to check the duplication during class instantiation. Follow this recipe I could make my case work as simple as the following:

1
2
3
4
5
6
7
8
9
class Addr(object):
    def __init__(self, addr_info):
        self.addr_info =addr_info
 
mapper(Addr, addr)
Addr = unique_constructor(Session, 
            lambda addr_info:addr_info, 
            lambda query, name:query.filter(Addr.addr_info==addr_info)
)(Addr)

with declarative this could be simpler even more, as indicated on the recipe page above.

after that the call of

1
Addr('blah')

will grant me a instance representing either an existing row or a new record. No more clumsy query_first_and_then_add business.

I was trying to customized it to make it look simpler and only work the class I created, until I realized how silly that is, it’s like changing the super key to 1000 doors to make it a normal key to only open 1. I’m glad I stopped myself before I face another door and start to miss the super key.

but to keep the record of my stupidity and help me learn, here is my super hacky class, it passed my test of 3 line of data so I’m sure it is useful to someone out there, LOL:

1
2
3
4
5
6
7
8
9
10
11
class Addr(object):
    def __init__(self, addr_info):
        self.addr_info = addr_info
    def __repr__(self):
        return self.addr_info or " "
    def __new__(cls, addr=None):
        known_addr = sessionmaker()().query(cls).filter(Addr.addr_info==addr).all()
        if known_addr:
            return known_addr[0]
        else:
            return object.__new__(cls)
EmailDiggFacebookDeliciousStumbleUponTwitterTumblrGoogle GmailBlogger PostGoogle ReaderSina WeiboBox.netEvernoteFriendFeedGoogle BookmarksHotmailLiveJournalLinkedInPrintPrintFriendlyRedditSlashdotWordPressShare

Tags: , , ,

Saturday, June 4th, 2011 python, sql, tech No Comments