System: The following message was spliced on to this thread from:
http://www.webmasterworld.com/search_engine_spiders/4236027.htm [webmasterworld.com] by incredibill - 7:46 am on Nov 29, 2010
(PST -8)
This posting is in part the result of a report at nodpi.org and partly a follow-on from a discussion in the WebmasterWorld UK & Ireland Search Engines forum at [
webmasterworld.com...]
TalkTalk, who have many aliases (significant one here is Opal), employ the spider to check whether a web page has a virus or not. This would be OK if it were checking on the client machine but it's instead intercepting the communication using Deep Packet Inspection, which is illegal in many parts of the world including Europe (UK is currently being sued by EU Justice over the BT Phorm trials and other privacy issues).
In TalkTalk's Q&A:
"7. Will only customers who sign up to Network Security have the websites they visit scanned?"
"We are scanning all the websites our customer base as a whole visits, in complete anonymity. You have to opt-into the Virus Alerts product itself, so if you don't want the warnings while you browse you don't have to enable the service, or if you activate Virus Alerts, you can switch it off again at any time afterwards."
What this means in practice is that TalkTalk visits EVERY web page any of their users visit regardless of whether the facility is opted-into or not (but see below). I do not know if robots.txt is obeyed (see below).
I have just conducted an experiment with my brother, who has a TalkTalk account. SOME pages he visited were also visited by the bot about 30 seconds later (what good is that to someone visiting a new site?!) but not ALL pages. Javascript/CSS/images were not tested but this may be due to receiving a 403.
The good news: it checked robots.txt first. What I don't know is if it obeyed it, since I do not have specific disallows (or allows). The specific observed action was:
robots.txt
home page (first one brother visited)
robots.txt
2nd visited page
4th visited page
robots.txt
8th visited page
10th page visited
robots.txt
11th page visited
As far as I'm aware there is no restriction on visiting webmail sites etc although, since the bot is purported to be based on phorm, it may or may not be able to read SSL (although it may know at least the original URL).
The bot was trialled in June this year - again illegally and without their customers being aware of it. It was at that time that I discovered the bot for myself, without knowing its function, and blocked it (see below). It appears the system went live on 26th Nov.
I have seen reports that the bot was engineered in China from the original Phorm DPI bot and that pages (or at least results) are forwarded to China. TalkTalk claim that no personal information is retained. See forum on nodpi.org for further details.
There is no immediate indication that adverts or other "personalised" information is to be targetted at Opal customers but with DPI it's always possible/
===========
Adapted from my posting on previously identified WebmasterWorld forum)
At the start of this month (from 2nd Nov but may have begun any time before Nov) all hits were from ChinaCache USA but from 26th Nov they all seem to be from Opal on an IP range I partially blocked back in June when I had around 300 hits in 15 days on two IPs and another couple of thousand on a few others. Knowing what I do now at least some of that was obviously the experimental period of the bot around June/July.
ChinaCache North America: 69.28.48.0 - 69.28.63.255
Owned by Beijing Blue I. T. Technologies Co Ltd
Admin in US by Citynet.
Opal Telecommunications full range: 62.24.128.0 - 62.24.255.255
(bots appearing with about half-dozen IPs in groups of 2 to 5)
Within that range I've also had multiple bot-like hits from webmarketing company Global Media Applications Ltd (G-MAPPS-UK) on 62.24.226/23 as well as many other "bots" going back before April across several sub-ranges. The range is now completely blocked.
I am now seeing hits from ChinaCache with the UA:
page_test_larbin2.6.3@unspecified.mail
Obviously these are being rejected with 403's.
Does anyone else have information on this Opal bot, such as: is it worth putting a block in robots.txt (and if so WHAT?); and has anyone with a TalkTalk/etc account been blocked because of a) a web page virus and b) because a 403 was issued by the site.