Forum Moderators: open
I have an online bible in mysql and I would like a script to automatically scan all of the words in the entire book, count and sort them by most popular (like the top 1000 or all the words sorted), and use this info on webpage or export to Excel csv. Is this possible? The book won't change, so it can be a one-time process.
MySQL/PHP output Example:
Top 1000 keywords
------------------
the - Appears 28,023 times
and - Appears 24,000 times
God - Appears 4,118 times
...
Moses - Appears 784 times
...
whatever - appears x times
Any ideas?
Count / Keyword
--------------
7 I
6 for
6 that
5 know
2 Thanks
1 manually
1 prioritizes
etc (inaccurate, but just gives an idea of what I hope to accomplish). I would think this would be a common thing needed by folks, but I'm still searching...
There's a single-line solution floating-around using a few standard Linux commands. It's not QUITE as simple as it might seem, as picking out "words" properly requires a bit of complex regular-expression magic.
However, I can't link to the blog that gives the solution, and I don't want to give the solution here without attribution.
So, search is your friend.
It uses sort, sed, and uniq, maybe awk.
After all, it's a book. It's most natural storage form is a flat file. Shouldn't be difficult to get it out of MySQL and into a file.
Once you have it in a file, there's all sorts of lexical analysis software that you can run on it.
I think you are beating yourself up by trying to do everything within MySQL. It just isn't the most suitable platform for this type of analysis.