Forum Moderators: phranque
Background
You have a hierchical tree of categories, with items in them. The categories and items are in two database tables. items have a categoryid, and a category has a parentid, which is another category (or NULL if the category is the root of the tree). So, the category table defines a tree of nested categories, and the items table defines items in one of these tree nodes.
When displaying categories, you like to show a count of how many items are in it. But not just the count of items in the current category - you show how many items are in this category and also in all the subcategories within this one.
for example:
stuff (20)
- widgets (15)
- - blue widgets (5)
- - - *****
- - green widgets (3)
- - - ***
- - red widgets (7)
- - - ****
- - - red square widgets (3)
- - - - ***
- grommets (5)
- - ****
- - blue grommets(1)
- - *
In my beautiful ASCII rendering, a * is an item inside the category rendered just above it
The "blue grommets" category has 1 item in it. The "grommets" category has 4 items in it, but also the one from the blue subcategory which makes 5. The "widgets" category has no items in it, but it has three subcategories (and 1 sub-sub-category) which altogether contain 15 items.
As a programmer, you found that recursing through the entire database to count items was too costly, especially since it's done on every page view. So intelligently, you decide to denormalize this data, and store an itemcount column with the recursive count for each category.
You created a recursive function that runs once, takes about 30 seconds (because your database is pretty big), and populates the itemcount for every category. This worked wonderfully.
Problem
When a user moves an item from one category to another, this throws off the numbers. For instance, if I were to move a blue widget into the blue grommet category, my tree would look like this:
stuff (20)
- widgets (14)
- - blue widgets (4)
- - - ****
- - green widgets (3)
- - - ***
- - red widgets (7)
- - - ****
- - - red square widgets (3)
- - - - ***
- grommets (6)
- - ****
- - blue grommets(2)
- - **
numbers in red are ones affected by this change. Note that the root node - "stuff" - is not affected by this change.
If I move a red square widget into the red widgets category, the change only affects one category:
stuff (20)
- widgets (15)
- - blue widgets (5)
- - - *****
- - green widgets (3)
- - - ***
- - red widgets (7)
- - - *****
- - - red square widgets (2)
- - - - **
- grommets (5)
- - ****
- - blue grommets(1)
- - *
If a user deletes an item, then this decrements every category in the ancestry, right up to the "stuff" root. Similarly, if a user adds a new item, that increments the category's parent, grandparent, and so on up the ancestry right up to the root.
The challenge is to write a function that can, as efficiently as possible, update the denormalized itemcount data for categories that are affected. This function would be triggered by users adding an item, deleting an item, or editing an item in a way that changes its categoryid.
Assume that it is not practical to run the original recursive function through the entire tree.
Please do not present your solution with a code dump... describe the process in prose. Extra bonus points for any solutions that are in haiku or iambic pentameter.
- © 2009 httpwebwitch enterprises
In case the limerick is not clear - just a small function to update the total count any time any changes are made to an "item" is all that's needed. I have a similar system in place: we track inventory, and the three states are in stock, pending, and sold. Cust. orders an item, it gets moved to pending; stock is untouched, but what displays publicly is stock minus pending. When the item is shipped, pending and stock are decremented by quantity, sold is incremented by quantity. These are, similar your system, stored in the products "master" stock, pending, and sold fields for quick access and low overhead, rather than futzing about with the inventory tables.
moving from one node into another
affects the nodes above it in this way
if the nodes share a common ancestor,
common node's count of children doesn't change.
put ancestry of nodes in two arrays
remove any nodes they have in common
these are nodes that will not be affected
what you're left with are the ones that will change
in each element of the "from" array,
decrement the count of it by just one
in each element of the "to" array,
increment its count likewise by just one
what you end up with is a nice function
that gladly accepts two parameters
the first is the node the item's leaving
the second is node it's going to
if you are adding a brand new item
you leave that first parameter as null
it decrements no nodes because it has
no nodes to decrement so it gives up
the added item goes into a node
which shares no common ancestry with null
its ancestors are all incremented
by 1up all the way up to the root
if you are deleting an old item
leave the second parameter as null
just as before there is no ancestry
and so all its ancestors subtract one.
in this way you can update your treenodes
and keep the item counts denormalized
it's quite efficient for the database
and flexible enough to handle change
[edited by: httpwebwitch at 8:31 pm (utc) on Sep. 1, 2009]
function recalculate_itemcounts($from,$to){
$fromarr = get_ancestry($from); // returns an array
$toarr = get_ancestry($to); // returns an array
$f = array_diff($fromarr,$toarr);
$t = array_diff($toarr,$fromarr);
foreach($f as $v){
$query = "UPDATE categories SET itemcount = itemcount - 1 WHERE categoryid = ".$v;
mysql_query($query);
}
foreach($t as $v){
$query = "UPDATE categories SET itemcount = itemcount + 1 WHERE categoryid = ".$v;
mysql_query($query);
}
}
Well, IMO, if the database is so huge you can't do a recursion on a page view basis, the server must be pretty well equipped. It couldn't take more than a couple of seconds to do a recursion which would be acceptable if it was done only when an item was changed.
In fact, you are already doing a recursive lookup to get your ancestry tree.
Then if you perform your recursion on the index it should execute instantly.
Using this method I got a database containing over 1,000,000 rows to search in less than 2 seconds, on a machine with a very low spec that I just use for messing around on.