Why Compromise? MemSQL Outperforms NoSQL Solutions Again and Again
So, it probably isn’t much of a surprise to those of you that follow me on twitter that I am huge fan of the in memory database memSQL. There are a lot of awesome reasons why memSQL is crazy fast, which I’ll get to later, for why I have grown to love this database but let’s get started with my latest job…
I have 30 million results pages from Google searches, meaning 300 million entries for a URL, Domain, Subdomain, Keyword and Ranking. You can easily imagine a giant spreadsheet with this data in it. The row might look something like this in the spreadsheet…
1 | http://www.thegooglecache.com/ | www.thegooglecache.com | thegooglecache.com | google cache | 1
My first job is simple – given any URL, Subdomain or Domain, find all the keywords for which it ranks.
I have a couple of options here.
First, I could go with my personal favorite, grep+ramdisk solution. This essentially involves partitioning the file to many smaller files, distributing them based on their keys and using code to know where each file is based on the leading characters of the keys or…
Second, drop it into memSQL the way you would in a traditional database. Have a table for keywords, urls, subdomains, domains and another for rankings. Use joins to and proper indexes to try and speed things up.
1. Let me be clear, the first method is pretty time intensive. It took hours to get the data in just the right format and putting it onto a tmpfs ramdisk was at least a little concerning for durability in the future. It is kind of crazy, but each line of the data set would look something like xIDnsdf|23nasfl8@asdjo890a4#afju2n$23oav89hd%2 – each being a representative fixed-length hash of either the url, subdomain, domain, keyword and finally the raw ranking. This allowed me to use a much faster regular expression that the multi-threaded grep would be able to handle more quickly. Ultimately, I was able to get it to look up and return the keys – not the values, but at least the keys – in about .44 second I could retrieve the keys for the keyword and the rank for any domain, url, or subdomain. I was pretty damn please with myself.
2. Now, on the same machine, I dropped the data into memSQL. Same dual quad core xeon server, same RAM, same everything. Of course, I got to use adminer to set up my tables exactly how I wanted and the whole set up of the database and table structure took just a few minutes.
I then imported all the data just as you would without any special weird hashing, etc. Watching memSQL import data is pretty rad. I don’t have a bench mark for it but it feels like filling a bathtub with a firehose. It is pretty amazing.
And then I wrote the first query…
“SELECT kid,r FROM rankings RIGHT JOIN domains ON domains.id=rankings.did WHERE domains.domain=’wikipedia.org’”;
Easy enough. Enter.
Damn, 4 seconds.
But here is the magic. MemSQL essentially turns that query into an optimized procedure. I hit enter again and .12 seconds.
We have only just begun
Now that I plan on moving to a memSQL cluster, I expect to see even greater speeds. I want to nail down the 0.00 time which is often attainable with memSQL queries. But what is really exciting to me about memSQL is not just the speed.
I can get speed out of LevelDB. I can get speed out of Redis. I can get speed out of a lot of things. But what I can’t get out of them are flexibility.
That same database can now be used to do nearly every query I could imagine. I don’t need custom data structures for each particular type of query I will make.
Why I Love MemSQL
- I can get amazing results out of commodity developers (like myself)
- Nearly all my legacy mySQL code works out-of-the-box
- It essentially trains on my queries to speed up results beyond just being in RAM
Anyway, enough with the rave. Back to optimizing.