Forum Moderators: open

Message Too Old, No Replies

MySQL - sort entire book into top 1000 keywords

mysql, keywords, sort

         

trex2

11:33 am on Nov 16, 2007 (gmt 0)

10+ Year Member



Hello,

I have an online bible in mysql and I would like a script to automatically scan all of the words in the entire book, count and sort them by most popular (like the top 1000 or all the words sorted), and use this info on webpage or export to Excel csv. Is this possible? The book won't change, so it can be a one-time process.

MySQL/PHP output Example:

Top 1000 keywords
------------------
the - Appears 28,023 times
and - Appears 24,000 times
God - Appears 4,118 times
...
Moses - Appears 784 times
...
whatever - appears x times

Any ideas?

phranque

1:52 pm on Nov 16, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



it would probably help to put a fulltext index on that column.

trex2

2:39 pm on Nov 16, 2007 (gmt 0)

10+ Year Member



Thanks, I did that, but I don't know how that helps. Can you be more specific. I looked at several websites on fulltext indexing and it seems to be more about matching against a specific term and getting a count for that, rather than getting a count for all words in the entire database. I can currently manually go in and get a count for each word (example above), but I don't know how to automate it so it scans each word and prioritizes it. I hope that makes sense. For example, if I did what I want for this paragraph of text I'm now writing it might be something like this in .csv Excel format:

Count / Keyword
--------------
7 I
6 for
6 that
5 know
2 Thanks
1 manually
1 prioritizes

etc (inaccurate, but just gives an idea of what I hope to accomplish). I would think this would be a common thing needed by folks, but I'm still searching...

phranque

3:28 pm on Nov 16, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you could certainly do it in perl.
or even unix shell programming.
i stickied you a link that shows how to do what you are trying to do...

trex2

12:31 am on Nov 17, 2007 (gmt 0)

10+ Year Member



I can't seem to figure that out. Any other ideas to solve via PHP/MySQL?

phranque

1:21 am on Nov 17, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



what operating system are you using?

trex2

2:31 am on Nov 17, 2007 (gmt 0)

10+ Year Member



I have Windows XP.

trex2

2:34 am on Nov 17, 2007 (gmt 0)

10+ Year Member



You probably mean the host though- which I don't know what the OS is. It's just a basic hosting package with mysql/php/cpanel, etc, so I assume its on a Unix machine.

phranque

3:10 am on Nov 17, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i've never written php so you will have to fill in the blanks, but here is the process:
- get all your text into a text string
- fold the string to lower case
- compress all white space to single blanks
- split the string on blanks into an array of words
- use array_count_values

jtara

4:40 am on Nov 17, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't know why you would want to do this with MySQL. Do you need to save a database with the results?

trex2

6:02 am on Nov 17, 2007 (gmt 0)

10+ Year Member



It really doesn't matter how and what technologies are used to come to this result, as long as it's in a spreadsheet- if there's a way to calculate this in Excel, for example, then great, as long as I can get instructions on how to do it.

Has anyone ever done this before?

phranque

8:56 am on Nov 17, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i think there is a line in that book about teaching a man to fish...

jtara

7:22 pm on Nov 17, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, we've hit one of the Catch-22's of WebmasterWorld.

There's a single-line solution floating-around using a few standard Linux commands. It's not QUITE as simple as it might seem, as picking out "words" properly requires a bit of complex regular-expression magic.

However, I can't link to the blog that gives the solution, and I don't want to give the solution here without attribution.

So, search is your friend.

It uses sort, sed, and uniq, maybe awk.

trex2

10:03 pm on Nov 17, 2007 (gmt 0)

10+ Year Member



The closest I've found is a shell command to sort and rank keywords from a log file. I wonder how that can be tweaked to work for a mysql database.

trex2

7:27 am on Nov 19, 2007 (gmt 0)

10+ Year Member



I would think that keyword ranking would be fairly common. Has anyone done this with MySQL or Excel data?

jtara

4:33 pm on Nov 19, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why not dump your MySQL data to a flat file first?

After all, it's a book. It's most natural storage form is a flat file. Shouldn't be difficult to get it out of MySQL and into a file.

Once you have it in a file, there's all sorts of lexical analysis software that you can run on it.

I think you are beating yourself up by trying to do everything within MySQL. It just isn't the most suitable platform for this type of analysis.