I’ve been working pretty hard on ROMscrape, but the frustrating part of that is that it’s a lot of waiting: Assess - Tweak - Run - Wait - Assess - Tweak - Run, etc.
I am happy to report, though, that my tweaks have become increasingly minor and I’ve seen fairly positive results so far with my current approach.
What I’ve figured out is that for my search to work well in PHP and on this particular server, I need to keep it’s comparisons to no more than around 3000 at a time. I can do a few of those in a row, so I can still search very large sets. The challenge lies in organizing the data set so I can pass a query first to a list of 3000 or so indexes, select one or two, then search each of those indexes. Doing this successfully means creating indexes that accurately correspond to a set of similar data, and that similarity in this regard is relevant to the types of queries I’ll be passing to it.
I don’t want to go into my whole solution, but the bottom line is that I’ve got a program running on my home computer that is generating the best set of 3000 indexes that it can come up with. By my calculation, this requires doing around 1.5 million compare-and-sort operations. That sounds like a lot, but I’ve got it optimized to the point that it can do 100 in about 15 seconds. When I left it this morning, it had 1 million left to go, so I think that means it will be done some time tomorrow.
Working on this has been really interesting, and I’ve learned a great deal about the kinds of logic that must go into big search engines like Google. Assuming they work on systems at least remotely similar to what I’ve come up with, these algorithms probably have plenty of influence from humans making value judgments, despite their apparent mechanical objectivity. I know Google has emphatically denied manipulating their indexes to “hide” certain sites or objectionable material (the infamous “jews” case being a rare exception), but even the algorithm must operate on some values and assumptions that are pre-determined by human choices.
In my case, ROMscrape’s indexer is flexible enough that I could easily make several tweaks that would dramatically change its findings. Deciding which metric for determining similarity is the most obvious one, but even deciding how to store similarity information (I’ve attempted what I think is a “fuzzy” approach, incidentally) might make a big difference down the line. For example, one scale I use arrives at a calculation along a hexadecimal scale of 0 - F, thereby retaining 16 levels of similarity for that slot. If I wanted to, I could double the digits and store that information on a scale of 0 - FF (0 - 255). This would be a finer degree of similarity, but it would cost more in terms of computation. And until I get a result set and try some searching on it, I won’t be able to tell if the two-digit method will be worth it.
At any rate, I’ve had several starts and stops (and restarts) on this, but I feel confident enough in the present indexer that I’m working in earnest on chapter 5 now. Hopefully it will turn out well and I’ll be able to post some good news soon.