Initially, I’ve thought this is something that is done by bing.com haters, but it seems the HTTP referrer SPAM made for bing’s benefit is done from IPs belonging to the Microsoft corporation. The apache logs don’t lie:
65.55.104.61 – Referrer: http://www.bing.com/search?q=month (using IE 6)
65.55.104.64 – Referrer: http://www.bing.com/search?q=month (IE6)
… and so on, for 30 more times in a weekend
Now, try a whois for these IP addresses, like so.
I’ve noticed this two weeks after bing was launched and I’m not sure how legit this is from Microsoft’s side.
65.55.104.61, bing, bing.com, http referrer spam, microsoft, search engines, spam
Been thinking on this for a long time, how would you index content of so many sites and figure out what part of the whole source is actually the one people are interested in? I’ve met this challenge few times already, once in the old dead Octora search engine (luckily it was an RSS search engine), and also for Pingoat, while experimenting a new feature two years ago.
Well, if this paper (PDF) approaching the problem thanks to substring amplification was available at that time, things would have been a bit easier. The study has been completed by Daisuke Ikeda, Yasuhiro Yamada, and Sachio Hirokawa and it has been titled Formulation of Template Discovery Problem and an Algorithm using Substring Amplification.
In simple words the paper explains how a template pattern can be discovered thanks to a linear time algorithm.
I’m mentioning this paper because I’m sure there may be others out there trying to figure out how to extract the content from a web page by removing those elements which are specific to page’s template. It also contains an example on how the algorithm can be applied to a set of pages sharing the same layout.
algorithms, Daisuke Ikeda, indexing, programming, Sachio Hirokawa, search engines, string processing, substring amplification, template discovery problem, template extraction, Yasuhiro Yamada
Loopthing is now a big site, there are over 150.000 pages waiting to be indexed. In order to achieve this indexing as fast as possible, a sitemap of sitemaps has been placed inside the robots.txt file of the service.
I’m going to extract feedback on how Google and Yahoo! manages to process all these feeds and maybe compile a technical report early next year.
The whole process is also supervised thank to Google’s Webmaster Tools service, and Yahoo’s Site Explorer.
The process of indexing the whole site, jumping from link to link it’s very difficult, even for today’s performant spiders. The sitemaps.org website has been put together with help from major search engine market players (Google, Microsoft and Yahoo!) to address this issue.
The end result is enabling webmasters suggest search engines where their pages are located. End result is that indexing becomes a much faster process.
Some reading on sitemaps:
- sitemaps.org – the sitemaps standards home page (ironically, it doesn’t have a sitemap)
- sitemaps on wikipedia (syndicated information from everywhere, condensed in a single page)
bing, google, indexing, microsoft, search engines, SEO, sitemaps, yahoo