Parse html taget url for the detailed info ( image, description, keyworkds etc.); Cleansing, Classifcation;


Need to have the mechanism to have configuration to scrap the html content from the target url.
  • Default meta tags like og:image, og:description, og:title helps to get the minimum details.
  • For certain websites, we can get the better quality images by looking at the specific tags by using jquery like expressions.  In this regard we need to have some way to store the configuration at web site level and menu level.
  • There should be fall back mechanism to identify the informaion. Suppose the specific  configuration could be not match then go to the fall back. So, there should be priority / order to consider which expression needs to be considered.
  • Global expressions (these are used across the websites, like we are using now for the meta tags);
  • Website/domain level expression ( applies only for that perticular domain)
  • Menu level expressions ( applies only for that menu )
  • Url pattern based for a website level and menu level
  • All the above scopes will have an priority like very very specific=>very specifig => Specific => menu level > website level => Global.
    **** Along with above lines we could also have the cleansing mechanism like numeric text in the titiles can be removed from the collection.
  • this configuration also can be managed globally / website level / menu level ;
    *** Think about the writting classification rules based on the url, title and descrition; And also we can have the language detection in the same manner.