Been thinking on this for a long time, how would you index content of so many sites and figure out what part of the whole source is actually the one people are interested in? I’ve met this challenge few times already, once in the old dead Octora search engine (luckily it was an RSS search engine), and also for Pingoat, while experimenting a new feature two years ago.
Well, if this paper (PDF) approaching the problem thanks to substring amplification was available at that time, things would have been a bit easier. The study has been completed by Daisuke Ikeda, Yasuhiro Yamada, and Sachio Hirokawa and it has been titled Formulation of Template Discovery Problem and an Algorithm using Substring Amplification.
In simple words the paper explains how a template pattern can be discovered thanks to a linear time algorithm.
I’m mentioning this paper because I’m sure there may be others out there trying to figure out how to extract the content from a web page by removing those elements which are specific to page’s template. It also contains an example on how the algorithm can be applied to a set of pages sharing the same layout.



