Forum Moderators: coopster & phranque

Message Too Old, No Replies

converting 1000 doc to txt everyday

using perl or php on linux server

         

Jaunty Edward

9:52 pm on Jan 16, 2007 (gmt 0)

10+ Year Member



Hi,

I think I tried all combinations of keywords on google but could not find a solution to this task. I have to store content of doc files into mysql table and for that I need to write a function in php that can read a file and give the text.

I will prefer to use a linux server as the entire application is in php and mysql but if I dont get any possible way out of this then I might consider windows.

This means, the classes that are using COM component are out. Actually I did try some of the classes, but the problem with them is they act like a macro that opens word and save the file in txt format. Which in my opinion will take a hell lot of processing power. And will surely require word to be installed on the server.

I did some more research and found that a small plugin software called Antiword, if installed on the server will let us convert doc files to txt from command prompt.

I am not very happy with that because I dont know how will I initiate a command using PHP. I think i will have to use system_exec() (not sure).

I am still looking out for a better solution, I am surprised as I feel this should be a common need for a lot of applications but I have not been able to find anything on the net.

Now I know why a lot of people hate Microsoft.

Thanks
Bye

phranque

11:28 pm on Jan 16, 2007 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you might try using the Win32::Word::Writer perl module available on cpan.
essentially you could use it to read in a word doc and save as a text file.
not sure how well supported it is...

perl_diver

11:35 pm on Jan 16, 2007 (gmt 0)

10+ Year Member



word documents are 'rtf' formatted, not 'doc' formatted, but maybe thats what he meant. 'doc' is a terrible file format that is not supported by any perl modules I know of.