This blog is NOFOLLOW Free!

pdf-file-logo-iconBeen thinking on this for a long time, how would you index content of so many sites and figure out what part of the whole source is actually the one people are interested in? I’ve met this challenge few times already, once in the old dead Octora search engine (luckily it was an RSS search engine), and also for Pingoat, while experimenting a new feature two years ago.

Well, if this paper (PDF) approaching the problem thanks to substring amplification was available at that time, things would have been a bit easier. The study has been completed by Daisuke Ikeda, Yasuhiro Yamada, and Sachio Hirokawa and it has been titled Formulation of Template Discovery Problem and an Algorithm using Substring Amplification.

In simple words the paper explains how a template pattern can be discovered thanks to a linear time algorithm.

I’m mentioning this paper because I’m sure there may be others out there trying to figure out how to extract the content from a web page by removing those elements which are specific to page’s template. It also contains an example on how the algorithm can be applied to a set of pages sharing the same layout.

, , , , , , , , , ,

Loopthing - Business NetworkingLoopthing is now a big site, there are over 150.000 pages waiting to be indexed. In order to achieve this indexing as fast as possible, a sitemap of sitemaps has been placed inside the robots.txt file of the service.

I’m going to extract feedback on how Google and Yahoo! manages to process all these feeds and maybe compile a technical report early next year.

The whole process is also supervised thank to Google’s Webmaster Tools service, and Yahoo’s Site Explorer.

The process of indexing the whole site, jumping from link to link it’s very difficult, even for today’s performant spiders. The sitemaps.org website has been put together with help from major search engine market players (Google, Microsoft and Yahoo!) to address this issue.

The end result is enabling webmasters suggest search engines where their pages are located. End result is that indexing becomes a much faster process.

Some reading on sitemaps:

  • sitemaps.org – the sitemaps standards home page (ironically, it doesn’t have a sitemap)
  • sitemaps on wikipedia (syndicated information from everywhere, condensed in a single page)
, , , , , , ,

logo-mysqlI’ve got my lesson learned today while working on a major MySQL based feature for Loopthing. While everything was planned and coded carefully, it seems that one issue got carried away in production mode where things didn’t go just right obviously.

The end result was a slow or non-responsive application. I never though it could be the database system until checking the system’s usage statistics and realizing that mysqld was using over 190% of the CPU power.

This was crazy and unexplainable but what I’m about to reveal now doesn’t have anything to do with database optimization practices, but rather with design practices.  Here’s what happened:

I am a true believer that premature optimization is the mother of all evil (Donald Knuth), after learning it the hard way. The problem now stands in the opposite of this: not optimizing things sufficiently enough: what I did was to put together all my tables structure, setting up the main indexes, and nothing more.

Built my queries around this basic structure, tested on small amounts of data, but then, forgot to fine tune things and see where potential indexes are needed. This was totally forgotten in development code as well, and everything was one day migrated into production.

When production activity started growing things changed and the ugly face of poor queries optimization and index setting was revealed to us here. Geesh.

Now, if your CPU’s cycling like crazy, it’s time to have a look over your db’s structure once again, maybe you’ve “missed” to consider some very important things you always knew you must not be careful not to miss.

On the other hand, there’s a pro of my issue today: it prevented useless hours of premature optimization and benchmarking, LOL :D

, , , , , , ,

google_logoI was just wondering the other day how much time it takes to get a blog post indexed by Google? The answer is somewhere between 5 and 8 minutes. I think the value can be lower than this but I wasn’t able to find out during my test.

Keep in mind though that the indexing efficiency rate of your blog may depend of few things though: frequency of posting, presence or RSS feeds, the fact that you’re using feedburner (which belongs to Google) and possibly blogging platform (e.g. blogger.com/blogspot posts may be indexed faster than wordpress ones, since blogger.com/blogspot belongs to google).

On the other hand, from my personal experience, I don’t think the number of backlinks influences the frequency of the visits you get from Google’s indexing spider.

, , , , , , , ,