Forum Moderators: open

Message Too Old, No Replies

Query Optimization

         

trevordixon

2:40 am on Nov 16, 2006 (gmt 0)

10+ Year Member



I'm working with four tables right now:

citylistings
id of person
city they're listed in
state they're listed in

appraisers
id of person
name

appraiserinfo
id of person
email, address, phone, etc.

localareas
city
latitude/longitude

I need a query that does the following:

  1. Get a list of ids from the citylistings table where the city is some specified value.
  2. For each id, look up that person's information in the appraisers table and then the appraiserinfo table.
  3. Get the latitude and longitude of the person's zipcode from the localareas table.

Here's the query I came up with:


SELECT citylistings.zipid, appraisers.name, appraiserinfo.company, appraiserinfo.address, appraiserinfo.city, appraiserinfo.state, appraiserinfo.zip, appraiserinfo.phone, appraiserinfo.fax, acos(sin(localareas.latitude*0.01745329252)*sin(40.4192*0.01745329252) + cos(localareas.latitude*0.01745329252)*cos(40.4192*0.01745329252)*cos(localareas.longitude*0.01745329252 - -111.565*0.01745329252))*3958.75586 AS distance
FROM citylistings, appraisers, appraiserinfo, localareas
WHERE citylistings.city = 'American Fork' AND citylistings.st = 'UT' AND appraisers.zipid = citylistings.zipid AND appraiserinfo.zipid = citylistings.zipid AND localareas.zipcode = appraiserinfo.zip AND localareas.ordinal = 1
ORDER BY appraiserinfo.linkhighlighted DESC, distance;

The query returns the right results, but it takes about three minutes! Here's another query that the PEAR DB_DataObject library sort of "generated" for me:


SELECT *, citylistings.id as listing_id , citylistings.zipid as listing_zipid , citylistings.st as listing_st , citylistings.city as listing_city , citylistings.timelisted as listing_timelisted
, acos(sin(localareas.latitude*0.01745329252)*sin(40.4192*0.01745329252) + cos(localareas.latitude*0.01745329252)*cos(40.4192*0.01745329252)*cos(localareas.longitude*0.01745329252 - -111.565*0.01745329252))*3958.75586 AS distance
FROM appraisers
INNER JOIN zipappraisers.citylistings ON zipappraisers.citylistings.zipid=appraisers.zipid
INNER JOIN zipappraisers.appraiserinfo ON zipappraisers.appraiserinfo.zipid=appraisers.zipid
INNER JOIN zipappraisers.localareas ON zipappraisers.localareas.zipcode=appraiserinfo.zip
WHERE ( localareas.ordinal = 1 ) AND ( ( ( citylistings.st = 'UT' ) AND ( citylistings.city = 'American Fork' )) ) AND ( ( ( appraiserinfo.licensestate = 'UT' ) ) )
ORDER BY appraiserinfo.linkhighlighted DESC, distance

That one takes four seconds which is a great improvement, but it shouldn't take that long. How can I write my query so it does things in the most efficient order?

jtara

7:33 am on Nov 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why are you doing that crazy math in your SQL?

Presumably, you are calling this from a programming language. Do the math in the programming language!

Why do you have seperate appraisers and apprasierinfo tables? Can an appraiser have multiple appraiserinfo entries? If not, combine the tables.

You do have indices on the appropriate attributes, right?

FalseDawn

4:17 pm on Nov 16, 2006 (gmt 0)

10+ Year Member




Why are you doing that crazy math in your SQL?

It looks like something I did years ago, to calculate distances between 2 points with latitude and longitude. There may be a more efficient way to do it, but doing it in the SQL is not necessarily a bad thing, and in some cases, necessary.


You do have indices on the appropriate attributes, right?

Yeah - make sure you have indexes on all joined fields.
Also, make (citylingings.st, citylistings.city) an index

How much data is in the tables? Are you disk swapping?

I'm not sure why the first version took so much longer, as it's the same query with old-style joins. Guess the optimizer didn't like it.

jtara

7:42 pm on Nov 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In theory, the distance calculation should only be done on the set of results, which will probably be small.

We don't know what database system is being used here, but MySQL isn't known for having a great optimizer. I'd still lift that calculation from the SQL.

It just strikes me that it's pretty unconventional to do this much math in a SQL statement. So, it's likely that it's a use case that hasn't gotten much attention from the authors of the database system. It's a good idea to just avoid unconventional use cases - to do so is just asking for trouble. Do it when you have to, but avoid it when you can.

Either first create a temporary table, or read the result into the program without first sorting by distance. Do your distance calculation in the program, either storing it in the temporary result table or sorting in-memory. My assumption is that this is a small result set, so either way will be practical.

FalseDawn

8:03 pm on Nov 16, 2006 (gmt 0)

10+ Year Member



Agreed - there's more than 1 way to skin a cat, but without further details, it's hard to make a concrete recommendation.

LifeinAsia

9:58 pm on Nov 16, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Since he's sorting on distance, the math needs to be done in the SQL code. Alternatively, he could do the math outside the code and re-sort.

How many rows do you have in each table?

I'll ask the obvious question- do you have good indexes on your tables? That by itelf will usually speed things drastically.

Also, remember that char/varchar searches are a lot slower than integer searches. So it would also speed things up if you could add a CityID field to citylistings and localareas and search/join on that instead of city.

coopster

11:18 pm on Nov 20, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I have written distance calculators before and kept the calculations in the SQL -- it is blazing fast. Mathematical processing usually is. MySQL can be optimized just like any other database. Some concepts to remember:
  1. Index your tables
    Yes, already mentioned a couple of times but worth mentioning again. This is often the biggest issue.
  2. Limit the number of rows
    The order of the tables in the FROM clause is important; JOIN on tables likely to be most discriminating first, where result sets get minimized. Less data to process = better performance so process your optimal restrictions first.

Other thoughts:
Rethink your table structure
Can you ask your customers for a zip code to plop into the

citylistings
table? If so, add zip code to the lat/long file and link on that instead. A 5-digit numeric entry is going to be a lot less bytes than a 30-character city name, especially if it is of type VARYING CHARACTER (VARCHAR).

Why two tables for

appraiser
information?
Seems these two tables could and should be combined. I don't see where the data would require any type of separation and therefore multiple-table management processes? One less table to JOIN as well.

Just some food for thought ...

aspdaddy

6:35 pm on Nov 21, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That distance calculator isnt right is it?

What use is the birds eye view? This is where mathematical calculation lets you down, its just not sensible :)

In the UK you can buy the pivot table with the real info (distance by road) its based on postcodes and sectors and you just import it into your database, there must be something similar for the US?

These are the slow bits:

citylistings.st = 'UT'
citylistings.city = 'American Fork' appraiserinfo.licensestate = 'UT'

Because you are restricting by text you query does a a table scan of a big table, 3 times! , you really need to change it to this:

citylistings.st_id =1
citylistings.city_id =2
appraiserinfo.licensestate_id =1

By adding City and state lookup tables to the model. Now the lookups are tiny and properly indexed.

Can you create / compile UDF for the calculation ? It might speed it up a bit?

FalseDawn

7:01 pm on Nov 21, 2006 (gmt 0)

10+ Year Member




That distance calculator isnt right is it?

It depends on what it's for!

For example, I did one where it was exactly right, because it was calculating "damage radius" from a bomb blast for insurance purposes.
A shockwave does not follow roads! :-)

jtara

7:40 pm on Nov 21, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Because you are restricting by text you query does a a table scan of a big table, 3 times!

That's not true. As long as "city" is indexed, there's no reason for a table scan. Granted, it would be better to use an integer code value, but text fields benefit from indexing as well as integer fields. The only time an exhaustive scan is necessary is if you do a search that isn't anchored to the first character. (e.g. a wild-card search that doesn't specify the first character, a substring search not starting with the first character, etc.)

trevordixon

6:02 pm on Dec 2, 2006 (gmt 0)

10+ Year Member



I applied appropriate indices and the query execution time fell from 120 seconds to .07 seconds! I've learned the value of proper indexing.

coopster

2:32 am on Dec 3, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Proper indexing is often an overlooked step and has huge performance gains, as you now know. Nice job, glad you got it sorted.