Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.
The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.
Web pages are getting more complex than ever. Thus, identifying different elements from web pages, such as main content, menus, user comments, advertising among others, becomes difficult. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called Blocks or Segments.