Forum Moderators: phranque

Message Too Old, No Replies

PDF vs. MS Word

Better to use .pdf files or word .doc files in website

         

logoex

8:57 pm on Nov 30, 2006 (gmt 0)

10+ Year Member



Hello, I am updating a client site of mine that is a Specialized Surgery Center. They have been giving us loads of content in the form of research articles, studies, etc. as word documents. They are updating these a few times per month.

Does anyone recommend linking the word documents themselves or making them pdf files? Or, even converting them to HTML? Which is better format for SE's to spider?

thanks,

coopster

9:22 pm on Nov 30, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Welcome to WebmasterWorld, logoex.

I'll throw my $0.02 worth in ;)

Outside of SEs you must also consider your target audience once they have discovered the resource. Are they going to have a Microsoft Word application or MS Word reader program available? Many folks out there do not. Most will have a PDF reader, but all of them will obviously have an HTML reader (web browser). Convert to HTML, the language of the web and you'll cover all bets.

logoex

9:58 pm on Nov 30, 2006 (gmt 0)

10+ Year Member



Thanks for the input. I think html would be best, but for the longer articles (just received one that was 55 pages) I should stick with pdf, and users can print easier if need be. Good point about using a file format such as word, etc.

phranque

10:14 am on Dec 1, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you should also consider whether you want to have the source of your documents available.
word docs can be edited, but not pdfs.
SE's won't be able to do much with either.
i have worked on a site which provides links to all 3 formats of equivalent documents.
if you have extremely large html documents you might consider providing a package in zipped format.

DataG

12:24 pm on Dec 1, 2006 (gmt 0)

10+ Year Member



If you want to protect your client's intellectual property you're better off creating PDF Documents - it's more difficult to steal their content when you encrypt them.

PDF is also the format of choice when it comes to creating eBooks on the Internet - hardly anybody is creating .exe eBooks any more, they are all in PDF format so you don't need to worry about people not knowing how to read a PDF.

Also, Google can index the contents of PDF Documents.

Sean K.

kaled

12:48 pm on Dec 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The simplest solution is to convert to .rtf (rich text format).

All formatting will remain (almost) entirely unchanged. This is a smaller file format than .doc and is very widely supported - I think even Macs and Linux recognise it these days (but you should check that).

Kaled.

kostis

1:54 pm on Dec 1, 2006 (gmt 0)

10+ Year Member



Hi

I have a similar concern with some large tutorials I have produced, which are full of pictures and tables.

I initially used PDF, but then I realised that this format may be very usable, but it is not search engine optimised.

Only the 1st page of the PDF file is indexed.

Thus, if you want this content to be fully indexed it has to be in HTML.

K

encyclo

2:15 pm on Dec 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would never use the Microsoft Word format for a public document. There are several serious issues when using .doc files:

- the format is a closed, binary one whose longevity is not assured. Over time, changes in the Word format mean that older documents don't display correctly in newer versions.

- not all users have Microsoft Word (or the version of Word that you are using), and alternative programs may not be able to read the file

- with Word you can usually backtrack through the corrections and changes made to a document, so the end user has the final version plus an editing history. This can be awkward to embarrassing dependent on the content!

You should choose the most appropriate open format that suits your needs. The three best options (as suggested by others above) are PDF, HTML or RTF. Each has their advantages and disadvantages.

PDF is best for large documents that you want your users to download but not edit, and it preserves the fonts, formatting and images across all platforms. HTML is easiest for compatibility, but is best for online viewing. RTF is great if you want the user to be able to download then copy or edit the contents.

Receptional

2:41 pm on Dec 1, 2006 (gmt 0)



So - my bit...

Have you noticed recently how PDF files are taming longer to show onscreen? The latest PDF reader seems to insist on the whole file being loaded before displaying anything of use to the reader.

This goes against the whole purpose of the PDF format imho.

So - I'd still go with PDF, but ask (or force) your readers to right click and save to their hard drive.

Adobe really should remember their audience.

DataG

2:51 pm on Dec 1, 2006 (gmt 0)

10+ Year Member



If you want your Browser to load each page as needed the PDf must be linearized.

It's a shame the 'Fast Web View' / Linearized option is not always switched on by default in all PDF Converters.

In Acrobat Reader you can check if your PDF is linearized or not by opening the PDF, then from the File Menu select "Document Properties" and it tells you in there.

justablink

3:08 pm on Dec 1, 2006 (gmt 0)

10+ Year Member



Do what is best for your audience/visitors. If the audience reads the articles via their browser, go HTML. If they typically download articles to read later, go PDF. IMHO, I would go HTML, and give the option to the reader to download a PDF.

Let the reader know in advance that it is a PDF file - too many people put a link up for a pdf file without specifying it is a PDF - this annoys most people.
--justablink

encyclo

3:13 pm on Dec 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you want your Browser to load each page as needed the PDf must be linearized

Thanks for that info DataG, I didn't know that! This explains a few things, I'm off now to go and optimize a bunch of PDFs... ;)

Receptional

3:15 pm on Dec 1, 2006 (gmt 0)



If you want your Browser to load each page as needed the PDf must be linearized.

Not sure what that means, but I'll go and see if I can fix my problem. What if the original file was PDF'd in the year 2,000 and is password protected - Can I still "linearize" it?

mattur

3:20 pm on Dec 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think html would be best, but for the longer articles (just received one that was 55 pages) I should stick with pdf, and users can print easier if need be.

One way of handling this is to write a brief overview/excerpt/table of contents/etc of the PDF contents and publish this as a HTML page, with the PDF linked on it (i.e. one HTML page for each PDF).

This way you can get a nice search engine-optimised HTML page, with the PDF available for people who want to read more - and you don't spend all your time converting massive PDF's into HTML pages ;)

logoex

5:24 am on Dec 7, 2006 (gmt 0)

10+ Year Member



Thanks for all the advice... I appreciate the good points made on this topic

vincevincevince

5:39 am on Dec 7, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I will share my method here:-

I keep documents as HTML (no menus or logos) and then modify them automatically as follows:

a) For HTML - adding menus, logos etc. on-the-fly
b) For PDF - change certain HTML to LaTeX codes and strip other tags, then pdflatex it (install tetex-latex on *nix)
c) For email - wordwrap to a width of 72, use custom replacements for headers (e.g. H2 -> upper case with a row of stars below, H3 -> upper case only etc.), strip out all remaining HTML

DataG

8:54 am on Dec 7, 2006 (gmt 0)

10+ Year Member



To make your PDF linearized, you open your PDF in Acrobat Professional,
from the menu select Advanced -> PDF Optimizer.

In the Dialog Box that is displayed click on "Clean Up" and checkbox the "Optimize the PDF for Fast Web View".

Basically what Linearization does is to put the unternal PDF structure in linear order so the Browser just loads the PDF data line by line without having to parse the entire PDF Structure first.

Sean K.

DataG

8:55 am on Dec 7, 2006 (gmt 0)

10+ Year Member



Edit: unternal = internal