Problem crawling with a hierarchical custom protocol handler

I have developed a hierarchical custom protocol handler that I'm integrating with Windows Search on Vista. The handler has a IUrlAccessor and IFilter implementation that enumerates the contents of a directory by returning the search URL of each "file" in the directory as a text chunk with the STAT_CHUNK structure specifying the GathererPropset and PID_GTHR_DIRLINK propid.

My understanding is that the indexer should take these URLs and queue them for indexing, in which case I should see subsequent calls to CreateAccessor with those URLs. However, instead I'm seeing the indexer call CreateAccessor once to get the directory accessor, then it reads the dirlink text chunks, but it never make the expected additional calls to CreateAccessor to index these links.

Is there something I'm missing One thing that concerns me is that I could not find the GathererPropset GUID defined in the current Windows SDK headers and had to pull the file from the Sharepoint SDK. This leads me to suspect that Vista uses a different propset and/or propid for directory links, but I cannot find any documentation on what that might be.

Any info is much appreciated.

David




Answer this question

Problem crawling with a hierarchical custom protocol handler

  • Parker Lewis

    Martin,

    WDS 2.6.6 will eventually be released, but it is still in development. There is a bit of a branch between WDS 3.0 and WDS 2.6.x. To explain, our long-term solution for WDS will be WDS 3.x (indexer running as a service). Currently, however, WDS 3.0 does not provide some of the things needed by our enterprise partners (group policy support, administration information, etc.). We are continuing to provide support and updates to 2.6.x until 3.0 offers all of the functionality that our enterprise partners have come to expect. As a result of this, the development schedules of 3.0 and 2.6.x are not closely linked.

    There have been several updates to the 3.0 SDK information. You can find the 3.0 info here (note the WDS 2.6.x and 3.0 nodes in the left nav pane):

    http://msdn.microsoft.com/library/default.asp url=/library/en-us/shellcc/platform/shell/winshell.asp frame=true

    The Windows header and IDL files can be downloaded on the Windows SDK here:

    http://msdn.microsoft.com/windowsvista/downloads/products/getthebeta/default.aspx

    Paul Nystrom - MSFT



  • TomJ72

    Ivaylo,

    • No, my intent is not to redirect the crawler to an HTTP source. I had added the GetRedirectedURL implementation in an attempt to hide the search URL ("XXXXXX://etc...") from the UI, which was showing up when performing a search on vista RC1. With RC2 I'm not seeing this in the UI so I just took it back out.
    • I tried the URL format you suggested and that has helped - the indexer is now crawling at least the first dirlink URL but it crashes afterwards. So far I have not found a problem in my code that might cause this, but I'm still looking into it.
    • Having an extension is problematic, since the information from this custom store is not a document per-se, just some text I want to index. I am currently using a custom IFilter implementation for both the "directory" and "documents".

    What I am trying to accomplish with the content I'm indexing from my custom store is the following:

    1. Provide the indexer with a fairly small set of keywords and phrases to index for each "document"
    2. Provide a display name for each "document" to be shown to the user in the search results instead of the search URL, which is not user-readable.
    3. Provide a URL for each "document" to be used with the display name in the search results UI to provide the user a link they can click on to get to the content.

    I have not found documentation that explains how to do this, so I have some questions:

    • For #1 do I need to provide an IPersist implementation on the filter to provide the text to index or can I simply return the text via the GetChunk/GetText IFilter methods If I can simply use GetChunk, what propset and id do I need to use
    • For #2 I am currently providing a text chunk with a propset of PSGUID_QUERY and property name of "dav:displayname", based on some sample code I found. I doubt this is correct, and would like to find what property I should use.
    • For #3 I am currently providing a text chunk with a propset of PSGUID_QUERY and property id = 9, again based on some sample code.

    Thanks,

    DavidJ



  • &#169&#59; Ţĩмό Şąļσмāĸ

    David,

    • Do you really want to redirect the crawler to an HTTP source or are you just trying to provide an alternative Path In the later case you shouldn't be using IUrlAccessor::GetRedirectedURL and you should simply return PRTH_E_NOT_REDIRECTED from it. There is another way to override the Path.
    • I would also not use the triple slash in the URLs. It looks redundant, but try this format xxxxxx://xxxxxx/your-custom-id-guid-whatever/etc/etc.extension . In this case SEARCH_ROOT will be defined as xxxxxx://xxxxxx
    • Having the extension for documents and other no-folder items as part of the url, will make thinks much simpler down the road.

    Ivaylo


  • Midnight Conjurer

    Paul,

    Thanks for the clarification on the URL syntax. I have gotten further by adding a dummy site name into the URL.

    The explanation you provided for event 3036 makes sense for a protocol handler that implements GetFileName, in which case the filter daemon will try to directly access files and might therefore run into problems accessing a directory. However my protocol handler implements BindToFilter, not GetFileName, and it accesses a custom store which has no access control restrictions and to which filesystem-specific settings like FANCI are not applicable. Furthermore, I am sometimes finding this warning in the log before my handler's ISearchProtocol::Init() method is even called, so the service is apparently determining it cannot access the content source before it has even tried. Luckily I am not completely blocked by this problem.

    Now that I have gotten further along, I have a new problem. I intended to have my IFilter implementation return 3 pieces of information to the filter daemon for each document: 1) keywords to index, 2) A display name to show the user in search UI, and 3) A URL to where the content can be found, such as file:///somepath or http://somesite/somepath, which the search UI will use to show a link for the item. These properties are returned via the GetChunk/GetText methods.

    I have successfully managed to get keywords indexed. (Yay!) But I have not been successful with #2 and #3.

    For the display name I've tried returning text for these properties:

    DEFINE_PROPERTYKEY(PKEY_ItemName, 0x6B8DA074, 0x3B5C, 0x43BC, 0x88, 0x6F, 0x0A, 0x2C, 0xDC, 0xE0, 0x0B, 0x6F, 100);
    DEFINE_PROPERTYKEY(PKEY_ItemNameDisplay, 0xB725F130, 0x47EF, 0x101A, 0xA5, 0xF1, 0x02, 0x60, 0x8C, 0x9E, 0xEB, 0xAC, 10);

    Neither one works. Instead of displaying the text associated with the property, it always shows the last component of the search URL. For example, if the URL used in the search was, "xxxxxxx://xxx/12345", it shows "12345".

    For the URL, I've tried the following properties:

    DEFINE_PROPERTYKEY(PKEY_Link_TargetUrl, 0x5CBF2787, 0x48CF, 0x4208, 0xB9, 0x0E, 0xEE, 0x5E, 0x5D, 0x42, 0x02, 0x94, 2);
    DEFINE_PROPERTYKEY(PKEY_ItemUrl, 0x49691C90, 0x7E17, 0x101A, 0xA9, 0x1C, 0x08, 0x00, 0x2B, 0x2E, 0xCD, 0xA9, 9);

    Returning the TargetUrl property has no effect on the UI - the item shows up without any link. Returning the ItemUrl property actually causes the search indexer to crash, so I have not been able to find out if it would affect the display of the item in the search UI.

    Can you find out if it is indeed possible to do what I'm trying to accomplish, and if so, how do I do it I am hoping that I can do this by simply returning the right property values to the filter daemon to avoid having to develop a shell extension.

    Thanks,

    DavidJ



  • tabdalla

    David,

    In the vast majority of ways WDS 3.0 = Vista Search. I say the vast majority of ways because of the following:

    The engine and SDK for the two products is essentially the same.

    The UI for the two products is completely different.

    As such, they are on different release schedules.

    Do the thoughts in the post above help at all If not, we can look deeper for you.

    Paul Nystrom - MSFT



  • Lollo79

    David,

    some time ago I developed a PH for WDS 2.6 and I feed the URLs with an IFilter using CHUNK_TEXT, IID_GathererPropset. The PH never startet crawling by itelf and I wrote two messages here regarding this:

    1. http://forums.microsoft.com/MSDN/ShowPost.aspx PostID=368842&SiteID=1
    2. http://forums.microsoft.com/MSDN/ShowPost.aspx PostID=609405&SiteID=1

    Paul Nystrom replied to 2.) on 09-26-2006:

    After a fair amount of investigation, I found that the fact that indexing does not start automatically for 3rd party protocol handlers is a bug in our product. It has been assigned and is due to be fixed with the release of WDS 2.6.6. I'm sorry you are running into this issue, but help is on the way.

    • Will WDS 2.6.6 ever be released We already have WDS 3 Beta 2!
    • My PH prototype does no longer work with WDS3 because there are different interfaces that I do not know how to use.
    • I am awaiting the WDS3 SDK.

    Martin


  • Rischa

    So does WDS 3.0 == Vista WDS, or are they considered separate releases I would like some clarification on what does and does not work in Vista, since that is what I'm targeting.

    ~ David



  • Tanny

    David,

    Below are some comments in line with your questions. These come from one of our SDK developers.

    - If I use a URL such as "xxxxxxx://someguid/" with two slashes instead of three, then the indexer will initialize the handler but never call CreateAccessor at all. What are the syntax rules for these URLs

    Protocol://site/path/file

    If site is LocalHost, you can just leave it empty. So:

    Protocol://LocalHost/path/file is the same as protocol:///path/file

    - I'm seeing the following event appear in the windows search event log:

    Event 3036:

    The content source <xxxxxxx:///{a99e7280-2d36-4133-a9b9-55d2b8d12e38}/> cannot be accessed.

    Context: Windows Application, SystemIndex Catalog

    Details:

    The specified address was excluded from the index. The site path rules may have to be modified to include this address. (0x80040d07)

    That can happen if the Service doesn't have permission to access that directory or if the FANCI bit is set (i.e, you check do not allow fast indexing on the folder properties). It can also be returned if the Protocol Handler (PH) doesn't allow access to that folder.

    My *first* guess here is that for some reason the PH isn't able to access that directory/container.

    Do these help at all

    Paul Nystrom - MSFT



  • woodheadz

    Hi Ivaylo,

    Seemed worth a try, so I switched the code to CHUNK_VALUE instead of CHUNK_TEXT and returned the dirlinks as BSTR PROPVARIANTs. However, the results were the same.

    Regards,

    David



  • xshua

    David,

    1. Your custom IFilter will be sufficient. It doen't have to implement any IPersistXYZ. You can generate a new GUID and use it as your guidPropSet with any number as a propid. WDS will index every custom property that come out of GetChunk/GetText. This seems to work for WDS, but for SharePoint, you will need to register your custom properties and specify whether they should be indexed.
    2. "dav:displayname" might work for SharePoint, but for WDS you need "System.ItemName" . You can get the guidPropSet/propid from "<system_drive>:\Documents and Settings\All Users\Application Data\Microsoft\Search\Data\Config\schema.txt".
    3. How this works for you

    Ivaylo

    P.S.

    RE:#1

    I have never used this, but System.Search.Contents a.k.a {B725F130-47EF-101A-A5F1-02608C9EEBAC}/19 seems like what you were looking for.

     


  • Jehan Badshah

    David,

    I'm going to refer this to one of our developers to see if I can get an answer for you. I hope to be back with additional information in the near future.

    Paul Nystrom - MSFT



  • Thiru_

    Another clue I found is that if I go into the control panel's Indexing Options and deselect all content other than my handler's search root (which is listed in the UI) the indexing service fails to index and puts this message into the event log:

    "The update for the index cannot be started because the specified content sources were not configured for updates. Add at least one content source."

    This indicates there may be some config step I've missed to register a custom content source along with my protocol handler. I've found some mention of adding content sources in the SharePoint literature, but nothing for WDS or Vista search. Anyone have an answer to this

    ~ David



  • Niehls

    Paul,

    Here is some more information on what I'm seeing.

    My protocol handler includes the following code to register the search root and scope rules, which is for now part of the DLL self-registration code:

    // get ISearchCrawlScopeManager
    CComPtr<ISearchCrawlScopeManager> crawlMgr;

    GetCrawlScopeMgr(&crawlMgr);

    // create the search root

    CComPtr<IUnknown> rootUnk;
    _HRCALL(rootUnk.CoCreateInstance(CLSID_CSearchRoot));

    CComQIPtr<ISearchRoot> searchRoot = rootUnk;

    if (!searchRoot)
    throw E_FAIL;

    _HRCALL(searchRoot->put_RootURL(SEARCH_ROOT));
    _HRCALL(searchRoot->put_IsHierarchical(TRUE));
    _HRCALL(searchRoot->put_ProvidesNotifications(FALSE));
    _HRCALL(searchRoot->put_UseNotificationsOnly(FALSE));
    _HRCALL(searchRoot->put_FollowDirectories(TRUE));

    // add the search root

    _HRCALL(crawlMgr->AddRoot(searchRoot));

    // add the default search scope

    _HRCALL(crawlMgr->AddDefaultScopeRule(SEARCH_ROOT, TRUE, FF_INDEXCOMPLEXURLS));

    // add a hierarchical scope

    _HRCALL(crawlMgr->AddHierarchicalScope(SEARCH_ROOT, TRUE, FALSE, FALSE));

    // save the changes

    _HRCALL(crawlMgr->SaveAll());

    This works, in that it completes without any errors and reasonable looking values appear in the registry afterward. (Don't mind the "throw" statement, there is some exception handling wrapping this code higher in the call stack.) I also have a registry script that adds a value under HKLM/Software/Microsoft/Windows Search/ProtocolHandlers, which appears to work fine.

    I registered my DLL with SEARCH_ROOT = "xxxxxxx:///someguid/", where the guid refers to a "directory" within my custom store. After registration I rebuilt the index from Indexing Options in control panel and the indexer initialized the handler, got the URL Accessor, got the Filter, and read the text chunks. These text chunks are dirlinks that refer to additional URLs the indexer should crawl. I expected the indexer to create new URL accessors for these dirlinks and crawl them for content, but it never does. Here is an excerpt from a logfile showing the calls into my code from the indexer and the data that was returned:

    10/12/2006 22:18:54:0113 DEBUG: CHandler::Init exit, hr=0x00000000
    10/12/2006 22:18:54:0300 DEBUG: CHandler::CreateAccessor entry
    10/12/2006 22:18:57:0503 DEBUG: CHandler::CreateAccessor exit, hr=0x00000000
    10/12/2006 22:18:57:0660 DEBUG: CUrlAccessor::IsDirectory entry
    10/12/2006 22:18:57:0816 DEBUG: CUrlAccessor::IsDirectory exit, hr=0x00000000
    10/12/2006 22:18:57:0972 DEBUG: CUrlAccessor::GetSize entry
    10/12/2006 22:18:58:0128 DEBUG: *pullSize = 0
    10/12/2006 22:18:58:0285 DEBUG: CUrlAccessor::GetSize exit, hr=0x00000000
    10/12/2006 22:18:58:0441 DEBUG: CUrlAccessor::GetLastModified entry
    10/12/2006 22:18:58:0613 DEBUG: pftLastModified = 0000000000000000
    10/12/2006 22:18:58:0769 DEBUG: CUrlAccessor::GetLastModified exit, hr=0x80004001
    10/12/2006 22:18:58:0925 DEBUG: CUrlAccessor::GetRedirectedURL entry
    10/12/2006 22:18:59:0081 DEBUG: *pdwLength = 26, wszRedirectedURL =
    http://xxxxxxxxxxxxxxxxxxx
    10/12/2006 22:18:59:0238 DEBUG: CUrlAccessor::GetRedirectedURL exit, hr=0x00000000
    10/12/2006 22:18:59:0394 DEBUG: CUrlAccessor::BindToFilter entry
    10/12/2006 22:18:59:0550 DEBUG: CUrlAccessor::BindToFilter exit, hr=0x00000000
    10/12/2006 22:18:59:0706 DEBUG: CUrlAccessor::IsDirectory entry
    10/12/2006 22:18:59:0863 DEBUG: CUrlAccessor::IsDirectory exit, hr=0x00000000
    10/12/2006 22:19:00:0019 DEBUG: CUrlAccessor::Init entry
    10/12/2006 22:19:00:0175 DEBUG: CUrlAccessor::Init exit, hr=0x00000000
    10/12/2006 22:19:00:0331 DEBUG: CUrlAccessor::GetChunk entry
    10/12/2006 22:19:00:0488 DEBUG: idChunk = 0
    10/12/2006 22:19:00:0644 DEBUG: breakType = 2
    10/12/2006 22:19:00:0800 DEBUG: flags = 1
    10/12/2006 22:19:00:0972 DEBUG: locale = 1033
    10/12/2006 22:19:01:0128 DEBUG: attribute.guidPropset = 0b63e343-9ccc-11d0-bcdb-00805fccce04
    10/12/2006 22:19:01:0285 DEBUG: attribute.psProperty.ulKind = 1
    10/12/2006 22:19:01:0441 DEBUG: attribute.psProperty.propid = 2
    10/12/2006 22:19:01:0613 DEBUG: CUrlAccessor::GetChunk exit, hr=0x00000000
    10/12/2006 22:19:01:0769 DEBUG: CUrlAccessor::GetText entry
    10/12/2006 22:19:02:0050 DEBUG: pcwcBuffer = 55, awcBuffer = xxxxxxx:///{a99e7280-2d36-4133-a9b9-55d2b8d12e38}/12345
    10/12/2006 22:19:02:0269 DEBUG: CUrlAccessor::GetText exit, hr=0x00041709
    10/12/2006 22:19:02:0472 DEBUG: CUrlAccessor::GetChunk entry
    10/12/2006 22:19:02:0628 DEBUG: idChunk = 1
    10/12/2006 22:19:02:0785 DEBUG: breakType = 2
    10/12/2006 22:19:02:0941 DEBUG: flags = 1
    10/12/2006 22:19:03:0097 DEBUG: locale = 1033
    10/12/2006 22:19:03:0253 DEBUG: attribute.guidPropset = 0b63e343-9ccc-11d0-bcdb-00805fccce04
    10/12/2006 22:19:03:0410 DEBUG: attribute.psProperty.ulKind = 1
    10/12/2006 22:19:03:0628 DEBUG: attribute.psProperty.propid = 2
    10/12/2006 22:19:03:0785 DEBUG: CUrlAccessor::GetChunk exit, hr=0x00000000
    10/12/2006 22:19:03:0863 DEBUG: CUrlAccessor::GetText entry
    10/12/2006 22:19:04:0019 DEBUG: pcwcBuffer = 55, awcBuffer = xxxxxxx:///{a99e7280-2d36-4133-a9b9-55d2b8d12e38}/67890
    10/12/2006 22:19:04:0175 DEBUG: CUrlAccessor::GetText exit, hr=0x00041709
    10/12/2006 22:19:04:0331 DEBUG: CUrlAccessor::GetChunk entry
    10/12/2006 22:19:04:0488 DEBUG: CUrlAccessor::GetChunk CUrlAccessor::GetChunk reached end of chunks
    10/12/2006 22:19:04:0706 DEBUG: CUrlAccessor::GetChunk exit, hr=0x80041700
    10/12/2006 22:19:04:0863 DEBUG: CUrlAccessor::IsDirectory entry
    10/12/2006 22:19:05:0019 DEBUG: CUrlAccessor::IsDirectory exit, hr=0x00000000
    10/12/2006 22:19:05:0175 DEBUG: CUrlAccessor::~CUrlAccessor entry
    10/12/2006 22:19:05:0331 DEBUG: CUrlAccessor::~CUrlAccessor exit

    I have also noticed some strange behavior such as:

    - If I use a URL such as "xxxxxxx://someguid/" with two slashes instead of three, then the indexer will initialize the handler but never call CreateAccessor at all. What are the syntax rules for these URLs

    - I'm seeing the following event appear in the windows search event log:

    Event 3036:
    The content source <xxxxxxx:///{a99e7280-2d36-4133-a9b9-55d2b8d12e38}/> cannot be accessed.

    Context: Windows Application, SystemIndex Catalog

    Details:
     The specified address was excluded from the index. The site path rules may have to be modified to include this address.   (0x80040d07)

    Would it be possible to hook me up directly with a dev contact within the search team to get through this I'm getting tired of banging my head on the wall.

    Regards,

    DavidJ

     



  • Jackobolo

    David,

    I'm not sure this will help, but can you try using value chunks instead of text chinks.

    Ivaylo


  • Problem crawling with a hierarchical custom protocol handler