I would like to be able to take 1 file with entries, like:
COMPARE FILE:
_________________
siteone.com
sitetwo.com
sitethree.com
_________________
And search a file for matches... But it seems like if I loop through each entry (siteone.com, sitetwo.com, etc.) it will take forever.
How can I search contents of files to compare to see if there is a string of text in the file It needs to be quick because it will be searching tons of files... I could check with each loop, but then if there are 2000 entries in the compare file, and it takes 2 seconds for each compare, then that would be like 30 minutes per file... I am just throwing numbers out there, so I am sure it is a little quicker than that, but if I am doing 100 files a minute, or more, it would be too slow...
Thanks in advance for your advice...
Michael C. Gates

Searching Text or Unicode strings
nhaas
Mr. Gates, :-)
On b-trees, try http://en.wikipedia.org/wiki/B-tree if you want a brief overview. This is more important with a large number of search strings. I would look for an existing implementation you can plug into, either from a third-party or from a set such as PowerCollections or some other set of generic classes.
You probably don't want an array here. For example, if you're looking for "viagra" and "vitamin" you want a storage structure that allows you to say that after a 'v' and an 'i', you can take either an 'a' or a 't'. You need to link letter by letter into a network so you can follow whatever paths are required. An array might work for the set of possible matches, as you'll need to revisit these each time, and either add or remove to the list. Actually, a doubly linked list might be better for this, to allow a quick insert or delete. (sorry, just thinking out loud).
As to your question, no a 20MG file should not be a problem. The whole point of the Stream interface is to permit you to view a chunk at a time. So you don't want to read an entire file into member, you want to use the standard methods on the Stream class to read a character at a time, and allow the underlying Stream implementation to cache and optimize on your behalf.
By the way, if you're looking for a nice spam filter, try http://spambayes.sourceforge.net/. There is an Outlook plug-in available, which I have used quite successfully.
Erik
tornin2
Thanks Erik,
Ok, that sounds right... I am not sure what you mean by a b-tree, but I will look that up. I am sure it is some type of array. I will want to keep it in memory for each use, and only reload the tree if an entry is made... I can handle the logic for that. So I guess what I should do is stream the file, looking through 1 character at a time, checking for a match against all entries in the b-tree... then keep going until the end of file. That makes sense... I Guess I will use the filestream object to open the file unless it is already in a string.
1 Problem is, if the file is 20 MB, would it crash trying to read that 20 MB into memory
Thanks for your help...
Michael C. Gates
JoeHand
hi,
your question is not clear, you told us the result but you didn't tell us how do you search this file for text
best regards
JawKnee
hi,
i don't know either what should you do there are many ways you can follow
first of all you will have to load the message by using streamreader , and also you will have to have the words that you searching for in an array or something,
you read the string , then you iterate through the array , then you can compair strings by many ways ,
you can use string class if mystring.contains("keyword"),
or you can use System.Text.RegularExpression.Regex to find matchs,
also there is an algorithm called "Levenshtein edit distance" it will be used to give you a percentage of similarity between 2 strings you can google for it
http://www.google.com/search sourceid=navclient&ie=UTF-8&rls=AMSA,AMSA:2006-28,AMSA:en&q=C%23+Edit+Distance+algorithm
here its an article about it
http://www.devarticles.com/c/a/Development-Cycles/How-to-Strike-a-Match/
hope this helps
itsmytime_24
If this is correct, then the way to do this is to prepare your search strings in a way that makes searching all of them simultaneously against each file possible. I would suggest you create a b-tree of your search phrases broken down by character (one character per tree node). You can then search your files by reading through a letter at a time, and match it up against your tree.
While searching, you can then maintain a list of possible "hits" as references into the tree, and with each new letter verify that each "hit" is still valid. This approach allows you to read each file once, and verify which if any search phrases have a hit. I think there is a design pattern for this, but I don't recall the name offhand.
Hope that helps,
Erik
Mitch5713
Sorry about that...
I want to search a file to see if there is matching criteria. So I want to take an email message and compare it with the entries in a comparison file to see if there are matches. It is for spam. So if an email message comes in, I want to check it to see if there is a match in the comparison list. So if there is a link to a certain web site, or certain criteria like "viagra", I want to know. I am not sure how to explain it really well since it is late and my brain is fried for the day.
My first idea is to just open the file into a string and loop through each entry in the criteria file to see if it is in the string. Sort of like the instr() function in VB. But I am sure that is going to be way too slow, and it wouldn't handle email files that are large.
Thanks,
Michael C. Gates