Can I control file types crawled by content type

I would like to crawl file servers at a different level of service than WSS sites. Case in point: .html and .htm are file types that are indexed in my farm because they are allowed file types. When I crawl a file server, however, I don't want them indexed. Is there a way to do that It would be nice to still capture the name and location of .html or .htm files but I don't want my indexer to spend the time busting them apart on file servers.

Thanks

Chris Fields



Answer this question

Can I control file types crawled by content type

  • vej

    I have similar problem with crawl rules.
    I need to exclude all from http://server/ , but include only some URLs for example http://server/Pages/
    I created crawl rules: http://server/* to Exlude all content and http://server/Pages/* to Include, tried to re-oder rules, but nothing changes - MOSS didn't crawl anything, but I need that http://server/Pages/* should be crawled only.

  • Wildert

    Chris,

    You can achieve these results using Crawl Rules.

    Example:

    You have a fileshare \\server\sharedfolder that you want to crawl but you want to exclude *.htm files

    1. You add a new content source of type 'File Shares' that maps to this folder.
    2. From Search Settings -> Crawl Rules you add two rules
      1. \\server\sharedfolder\*.htm - choose Exclude all items in this path
      2. \\server\sharedfolder\* - choose Include all items in this path
    3. The rules should be shown in the order above (if not reorder using the order selector)
    4. Test the rules using the text box and sample file names, i.e. \\servers\sharedfolder\mydoc.doc will highlight rule 2 and show it as included, \\server\sharedfolder\myhtml.htm will match rule 1 and will be excluded.

    Rerun a full crawl of the file share, once complete view the crawl log which will show all of the documents that were included and highlight those that were excluded and state 'Deleted by the gatherer (This item was deleted because it was excluded by a crawl rule.)'

    Andrew



  • Can I control file types crawled by content type