Office of Secret Intelligence

Who taught you to be a spy, fucking Gallagher!?!

Handling Increased Load

On top of being submitted to Postgres Weekly, my rails 4 + postgres trees tutorial was submitted to Ruby Weekly, so there has been a pretty drastic increase in load as of late, which is great!

I'm using Pingdom and Google Analytics/Adsense to gauge where people are coming from and what kind of load this site is dealing with, and while not super substantial, I've tried to make some improvements to keep the site load times down around a few seconds (there have in the past been a few minute long spikes, probably due to an influx of readers).

What surprised me was a) how much caching even small database calls helps out in the long run, and b) the bottleneck STILL isn't at the database level, but mostly around serving assets. I've compressed the hell out of my CSS and Javascript, but not using progressive jpegs and setting appropriate caching levels (browser caching, etc) for them.

I mention images because it's been a little more difficult to flat out cache them like I normally would other assets as they are being served from S3 and have an expire time associated with them, and if the image url is cached beyond the expire time, the image has a really difficult time being displayed properly.

One of the things I've been looking at is progressive jpeg compression, where I convert everything that's not a gif to a jpg and strip out a lot of the profiles and compress things down a bit (quality can be reduced significantly, especially in thumbnails, without a significant loss in image quality) to reduce image size and thus allow for better respones times and lower bandwidth. Some of this is detailed in the CarrierWave documentation, the image processor I use.

Another option, after optimization, is to start using Cloudfront. I think the cost would be negligible, and it would pretty much handle all of the static asset caching and such, rendering anything on my side not necessary beyond Rails fragment and action caching. It's fascinating to begin to see the "macro" level that performance optimizations are needed on with web development.

I'm used to doing a lot of under-the-hood behind-the-scenes API work that involves optimization of processing data in one form or another, but not necessarily "let's make this entire page smaller so the request returns faster and only needs to be performed once every few minutes."

More on this soon!

Caching, Caching, Caching and More Caching

So I updated the site code to do some more caching.  Things are going quite a bit faster between the Varnish ESI caching and the redis caching for fragments and such.

Still need to figure out how to get the gallery pages' images caching properly, but as long as the main page and reply trees are cached, the rails app itself will bear less of a burden.

Treeify 0.03 Release

Treeify 0.03 has been released.  I'll put something together that generates a reasonable changelog in the future, but for now, the biggest changes are:

 

  1. An actual README, so you can sort of figure out how to use it.
  2. A new method called "descendent_tree" which returns an array of hashes in a nested format, resembling a tree structure which is mighty handy for passing to a Rails view or serializing to JSON and traversing with Javascript.

All tests pass, which is good enough for me now.  I'll test this new version with the Kodiak build running this site soon enough and hopefully it won't break too much.

 

Thanks to github user espen for opening the issue to create a README and motivate me to clean things up and get a reasonable release out.

Look, I Got Featured!

A kind soul submitted Quick and Dirty How To - Trees in SQL + Postgres + Rails 4 to Postgres Weekly.  Whoever you are, thanks!

 

In that same vein, I have an issue opened on Treeify to add a readme and some examples, so I'll be writing up a post about how to use that within the next day or so.

Quick and Dirty How To - Trees in SQL + Postgres + Rails 4

Intro

I'm going to give a quick rundown on how I implemented threaded comments in this app.

I'm using Postgres 9.4 (recursive queries have been available since 8.4), and Rails 4.

 

The Query

 

I wrote a little gem called Treeify, and (right now) it just gives us a little wrapper around some recursive SQL queries.  Here's the main method we are concerned with:

def tree_sql(instance)
"WITH RECURSIVE cte (id, path) AS (
SELECT id,
array[id] AS path
FROM #{table_name}
WHERE id = #{instance.id}
UNION ALL
SELECT #{table_name}.id,
cte.path || #{table_name}.id
FROM #{table_name}
JOIN cte ON #{table_name}.parent_id = cte.id
)"
end

 

This generates some SQL that ends up looking like this:

SELECT "posts".* FROM "posts" WHERE (posts.id IN (WITH RECURSIVE cte (id, path) AS (
SELECT id,
array[id] AS path
FROM posts
WHERE id = 7
UNION ALL
SELECT posts.id,
cte.path || posts.id
FROM posts
JOIN cte ON posts.parent_id = cte.id
)
SELECT id FROM cte
ORDER BY path)) ORDER BY posts.id

This does alright performance-wise, although I'd much rather not have the "IN" portion there and have it do a JOIN or something instead, as I believe that would be faster, but I digress.

So, moving on, we have a method called "descendents" which basically grabs all the desecendents for a given post:

def descendents
self_and_descendents - [self]
end

self_and_descendents simply grabs the whole tree, descendents just removes the root of the tree.  This gives us our tree of descendents, which ends up looking something like this (after a little bit of serialization - we'll get to that):

[{"id"=>20,
"title"=>"RE: testing",
"body"=>"<p>asfsafasd</p>",
"parent_id"=>7,
"user_id"=>1,
"created_at"=>Thu, 02 Oct 2014 20:04:45 UTC +00:00,
"updated_at"=>Thu, 02 Oct 2014 20:04:45 UTC +00:00,
"category_id"=>1,
"tsv"=>"'asfsafasd':3 're':1 'test':2",
"slug"=>"re-testing"},
{"id"=>21,
"title"=>"RE: testing",
"body"=>"<p>poop</p>",
"parent_id"=>7,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:17 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:17 UTC +00:00,
"category_id"=>1,
"tsv"=>"'poop':3 're':1 'test':2",
"slug"=>"re-testing-4d35d96b-1c8b-4749-bf4b-052af7baf3cf"},
{"id"=>22,
"title"=>"RE: RE: testing",
"body"=>"<p>poop fart</p>",
"parent_id"=>21,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:28 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:28 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':5 'poop':4 're':1,2 'test':3",
"slug"=>"re-re-testing"},
{"id"=>23,
"title"=>"RE: RE: RE: testing",
"body"=>"<p>poop and fart</p>",
"parent_id"=>22,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':7 'poop':5 're':1,2,3 'test':4",
"slug"=>"re-re-re-testing"}]

Cool!  Our whole tree in one query.

But, it's not a tree, it's just a hash.  We need a tree or it will look really weird when we display it.  Let's fix that.

Let's create a method in our model called "build_tree".  We can pass it our results from our descendents method, which I do like so:

 def reply_tree
# give build_tree an array of hashes with
# the AR objects serialized into a hash
build_tree(descendents.to_a.map(&:serializable_hash))
end 

This just turns our descendents data into a serializable hash, which could be turned into JSON, or mangled more easily, like so:

 

def build_tree(data)
# turn our AoH into a hash where we've mapped the ID column
# to the rest of the hash + a comments array for nested comments
nested_hash = Hash[data.map{|e| [e['id'], e.merge('comments' => [])]}]
# if we have a parent ID, grab all the comments
# associated with that parent and push them into the comments array
nested_hash.each do |id, item|
nested_hash[id]['name'] = item['user_id'] ? User.find(item['user_id']).name : "Anonymous"
parent = nested_hash[item['parent_id']]
parent['comments'] << item if parent
end
# return the values of our nested hash, ie our actual comment hash data
# reject any descendents whose parent ID already exists in the main hash so we don't
# get orphaned descendents listed as their own comment
nested_hash.reject{|id, item|
nested_hash.has_key? item['parent_id']
}.values
end

Let's walk through this a little bit.

First, we want to turn our array of hashes into a nested hash, since we are dealing with tree data.

nested_hash = Hash[data.map{|e| [e['id'], e.merge('comments' => [])]}]

This casts the data variable (our array of hashes) as a hash, and maps each id to a the original hash (the comment data itself), and merges in a new key called "comments" that's assigned to an empty array.  This sets us up for our nested comments.

At this point, our data structure looks like this: 

{20=>
{"id"=>20,
"title"=>"RE: testing",
"body"=>"<p>asfsafasd</p>",
"parent_id"=>7,
"user_id"=>1,
"created_at"=>Thu, 02 Oct 2014 20:04:45 UTC +00:00,
"updated_at"=>Thu, 02 Oct 2014 20:04:45 UTC +00:00,
"category_id"=>1,
"tsv"=>"'asfsafasd':3 're':1 'test':2",
"slug"=>"re-testing",
"comments"=>[]},
21=>
{"id"=>21,
"title"=>"RE: testing",
"body"=>"<p>poop</p>",
"parent_id"=>7,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:17 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:17 UTC +00:00,
"category_id"=>1,
"tsv"=>"'poop':3 're':1 'test':2",
"slug"=>"re-testing-4d35d96b-1c8b-4749-bf4b-052af7baf3cf",
"comments"=>[]},
22=>
{"id"=>22,
"title"=>"RE: RE: testing",
"body"=>"<p>poop fart</p>",
"parent_id"=>21,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:28 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:28 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':5 'poop':4 're':1,2 'test':3",
"slug"=>"re-re-testing",
"comments"=>[]},
23=>
{"id"=>23,
"title"=>"RE: RE: RE: testing",
"body"=>"<p>poop and fart</p>",
"parent_id"=>22,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':7 'poop':5 're':1,2,3 'test':4",
"slug"=>"re-re-re-testing",
"comments"=>[]}}

As you can see like I mentioned earlier, we have a hash with each comment's ID as the key and the value is the actual comment data.

Next step, we want to load up the sub-comments.

 nested_hash.each do |id, item|
nested_hash[id]['name'] = item['user_id'] ? User.find(item['user_id']).name : "Anonymous"
parent = nested_hash[item['parent_id']]
parent['comments'] << item if parent
end

This basically traverses the current hash and checks to see if the current node has a parent ID that matches an ID in the hash, and pushes that data into the 'comments' array.

This is what it ends up looking like:

{20=>
{"id"=>20,
"title"=>"RE: testing",
"body"=>"<p>asfsafasd</p>",
"parent_id"=>7,
"user_id"=>1,
"created_at"=>Thu, 02 Oct 2014 20:04:45 UTC +00:00,
"updated_at"=>Thu, 02 Oct 2014 20:04:45 UTC +00:00,
"category_id"=>1,
"tsv"=>"'asfsafasd':3 're':1 'test':2",
"slug"=>"re-testing",
"comments"=>[],
"name"=>"Devin"},
21=>
{"id"=>21,
"title"=>"RE: testing",
"body"=>"<p>poop</p>",
"parent_id"=>7,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:17 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:17 UTC +00:00,
"category_id"=>1,
"tsv"=>"'poop':3 're':1 'test':2",
"slug"=>"re-testing-4d35d96b-1c8b-4749-bf4b-052af7baf3cf",
"comments"=>
[{"id"=>22,
"title"=>"RE: RE: testing",
"body"=>"<p>poop fart</p>",
"parent_id"=>21,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:28 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:28 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':5 'poop':4 're':1,2 'test':3",
"slug"=>"re-re-testing",
"comments"=>
[{"id"=>23,
"title"=>"RE: RE: RE: testing",
"body"=>"<p>poop and fart</p>",
"parent_id"=>22,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':7 'poop':5 're':1,2,3 'test':4",
"slug"=>"re-re-re-testing",
"comments"=>[],
"name"=>"Devin"}],
"name"=>"Devin"}],
"name"=>"Devin"},
22=>
{"id"=>22,
"title"=>"RE: RE: testing",
"body"=>"<p>poop fart</p>",
"parent_id"=>21,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:28 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:28 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':5 'poop':4 're':1,2 'test':3",
"slug"=>"re-re-testing",
"comments"=>
[{"id"=>23,
"title"=>"RE: RE: RE: testing",
"body"=>"<p>poop and fart</p>",
"parent_id"=>22,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':7 'poop':5 're':1,2,3 'test':4",
"slug"=>"re-re-re-testing",
"comments"=>[],
"name"=>"Devin"}],
"name"=>"Devin"},
23=>
{"id"=>23,
"title"=>"RE: RE: RE: testing",
"body"=>"<p>poop and fart</p>",
"parent_id"=>22,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':7 'poop':5 're':1,2,3 'test':4",
"slug"=>"re-re-re-testing",
"comments"=>[],
"name"=>"Devin"}}

 

We now have populated sub-comments.

The final step is to make sure sub-comments are only displayed in their respective array.

nested_hash.reject{|id, item| 
nested_hash.has_key? item['parent_id']
}.values

Iterate over the hash, rejecting anything that has a parent_id that exists in the top-most level of the hash, and return the values of the "good" keys.

Giving us:

[{"id"=>20,
"title"=>"RE: testing",
"body"=>"<p>asfsafasd</p>",
"parent_id"=>7,
"user_id"=>1,
"created_at"=>Thu, 02 Oct 2014 20:04:45 UTC +00:00,
"updated_at"=>Thu, 02 Oct 2014 20:04:45 UTC +00:00,
"category_id"=>1,
"tsv"=>"'asfsafasd':3 're':1 'test':2",
"slug"=>"re-testing",
"comments"=>[],
"name"=>"Devin"},
{"id"=>21,
"title"=>"RE: testing",
"body"=>"<p>poop</p>",
"parent_id"=>7,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:17 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:17 UTC +00:00,
"category_id"=>1,
"tsv"=>"'poop':3 're':1 'test':2",
"slug"=>"re-testing-4d35d96b-1c8b-4749-bf4b-052af7baf3cf",
"comments"=>
[{"id"=>22,
"title"=>"RE: RE: testing",
"body"=>"<p>poop fart</p>",
"parent_id"=>21,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:28 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:28 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':5 'poop':4 're':1,2 'test':3",
"slug"=>"re-re-testing",
"comments"=>
[{"id"=>23,
"title"=>"RE: RE: RE: testing",
"body"=>"<p>poop and fart</p>",
"parent_id"=>22,
"user_id"=>1,
"created_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"updated_at"=>Fri, 03 Oct 2014 02:01:40 UTC +00:00,
"category_id"=>1,
"tsv"=>"'fart':7 'poop':5 're':1,2,3 'test':4",
"slug"=>"re-re-re-testing",
"comments"=>[],
"name"=>"Devin"}],
"name"=>"Devin"}],
"name"=>"Devin"}]

...a nice tree-like structure we can iterate over in whatever we choose for a view.  Disregard the extra "name"=>".." bits, I'm still working out how to best retrieve author data and am currently using a hacky and ugly method to do so.

 

That's all for now.  Hopefully this sheds some light on this sort of thing.  Some improvements right off the bat would be to put the nested tree construction in the treeify gem, and to make the SQL less clunky so we can mold it a little more, and get associated data easier (like author info).