Forum Moderators: not2easy

Message Too Old, No Replies

PDF and SEO

         

toplisek

10:01 am on Mar 16, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you place a link like an example: https://www.example.com it is known it will be read by smart search engines and SEO is validated. But I guess what will happen if I place https://www.example.com without link. Is it detected text as a link or I missed? We all know different techniques but is it known a text without link inside PDF as an official explanation?




[edited by: not2easy at 1:27 pm (utc) on Mar 16, 2021]
[edit reason] please use example.com for domains [/edit]

not2easy

1:51 pm on Mar 16, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Google can read and index .pdf files the same as it reads .html pages so yes, they see the text within a .pdf file the same as if it were text on a page. A URL is understood as a file location address with or without the surrounding anchor hyperlink coding.

I am not sure I understand that robots reading text in any way validates SEO, but I may be misunderstanding that part, sorry.

lammert

5:35 pm on Mar 16, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



PDF files are more difficult to parse by search engines than HTML files. This is because characters are not nicely connected in words separated by spaces, periods etc and there is not a proper hierarchy of items like in HTML or XML. Therefore, search engines often need OCR techniques to figure out what the content is and what the relation is between the content parts. They may therefore not recognize a URL properly if it is only displayed as text.

To increase the chance a URL is properly recognized inside a PDF file, always encode it as a clickable link.

not2easy

6:11 pm on Mar 16, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Thank you Iammert. Search engines other than Google's likely do not have capabilities to parse .pdf files but Google expressly states that they can and do read and index .pdf files. My old link to that is dead and wants me to re-search, but they now include it here: [support.google.com...]

I see that is a separate (although related) page from where I landed on my re-search, the information about how their bots treat various kinds of files/scripts/resources is here: [developers.google.com...]

toplisek

7:44 pm on Mar 17, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thank you for the message. When investigating PDF, some white-papers do not like to increase indexing but publish a content without SEO in mind. So, they do not like to place any link and force mask.

not2easy

8:04 pm on Mar 17, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



You can allow crawling while disallowing indexing if that is the goal. I am not sure what the goal is so I'm not making suggestions here.

Do you want the pdf to be visited and indexed OR visited and not indexed OR not visited and not read by (compliant) bots? The mention of SEO made me think you would like to have the pdf file read and indexed but now it seems I have guessed wrong.

If the idea is to avoid indexing, the simple thing would be to disallow crawling of *.pdf files or (if the .pdf file is in a separate directory such as https://www.example.com/pdf/filename.pdf) you could allow crawling or disallow crawling and use X-Robots for the /pdf/ directory to set your preferences for all files in that directory.



There's some older discussion about X-Robots here if it is unfamiliar: [webmasterworld.com...]

phranque

8:53 pm on Mar 17, 2021 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



...disallow crawling and use X-Robots for the /pdf/ directory to set your preferences for all files in that directory.

if you disallow crawling the bot won't make the request and will never see the X-Robots-Tag HTTP Response header.

toplisek

5:21 pm on Mar 27, 2021 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sorry for the confusion. PDF file is outside our authority. When we tried to publish a content I have checked PDF's and thought about links. If there is without pure link, is it possible to be seen by search engines? Images can not be read. So, your suggestions gave me other part: if you disallow crawling the bot won't make the request and will never see the X-Robots-Tag HTTP Response header.

I can not force search engines to see such PDF as it is outside our control.