Page 48 of 84
years. And it led to some interesting breakthroughs on its There's a serendipitous approach to this because we tell own. the software, called spiders, to sit on this server, open up But, in any event, the issues about language... It turns this one web page and find any key words that we tell it to out that there is a nuance. Taking English speakers as an __ out of this list to start with: "If you find those words, read example, we might know more or less intuitively or back a certain number of words and read forward a certain internally the definitions of, say, about 100,000 words. number of words, copy that, do some stuff with it, and if Depending on your specialty and what you do for a living —_ you find any links in there, well, have a go at it. Go at the time, technically that might be slightly larger or _ follow those and do the same thing down there." slightly less. But any given English-speaker may only use So it would go and eat some net and move and read 11,000 or 12,000 words in any given week. And the more web pages and keep going and going and going. 11,000 or 12,000 words is not static from week to week to And I think we've got a 256 limit on how deep it can go, in week. It shifts. links, before it has to unwind and come back and go on to So if we start thinking about this in terms of set theory the next stage. It can get some huge amounts of text out and fuzzy set theory, which is part of the programming— _ of here, on the order of, usually, about 90 million leads. and I wasn't really into the programming of it all—then And a lead is a construct that we use, where we have you start getting into the idea of: "Well, why, from week 2,048 bytes fore and aft, if you will, of the key word that it to week to week, do some of the words within our basic found. It constitutes a lead, but it also brings back the set fall out and why are they replaced by others?" context of where it found that—in other words, if it's in a And that was my premise: "Oh gardening forum, if it's dating, well, that's occurring because of car repair, whatever, and some something that we are picking up "There's a serendipitous other information, these kinds of as human antennae, walking . things. around vibrating on the planet approach to this because we KC: This is something Bill ani also Picking up information tell the software, called spiders, and I had been discussing, asking just because we're here. each other whether or not you KC: But what do you mean, to sit on this server, open up actually were feeding it key you weren't interested in the thi b d find words that you were looking for. computer modelling side of it? Is one web page and find any CH: That part of the process Or are you saying someone else key words that we tell it to is unique and I don't want to go took care of that? ae . too deep into it because it CH: No, no. I did all of that. out of this list to start with..." actually is the real key to the I was fascinated by the math in thing, I think, and it's a trade the language and so on. I'ma secret. We do have a seed list of programmer. That's basically where I came from. I 300,000 forums to begin hunting in. But, no, it is not programmed for a software company. It's like Microsoft. deterministic in the sense of data-mining where we say: I wrote software for phone companies, worked on some _—"Go on out and count the number of times you run across, very complex stuff, worked for GEC Marconi and very you know, 'tyre' or 'wall' or 'bridge' or something." It large companies, those kinds of things, almost exclusively doesn't work that way. in the software realm. Eventually it rose up to the point Basically, what it's doing is this. Here's a long column where I was working on algorithms and computer theory, of what we call context. These contexts can be thought of as opposed to actual software, over the course of... don't as the name for a larger group of words. You might give it know how many years—15, 20 years or something. I got 30,000 of these names to start with. One of them might be to the point where the software component of it became "forward" or "energy", and we tell it: "Okay, take the less and I was getting down into the deep-sea secrets, if word 'energy' out of this long list of 30,000 words, go over you will. and read the entire context that we've got associated with it, and store that in your memory." And that itself might Tracking significant correlations be 30,000 or 40,000 words. "Then go over to this website KC: I would say maybe the philosophical side of it | and see what you can match out of that in the following began to draw you more. manner." Make sense? CH: Sure. And basically, I developed some software Bill Ryan (BR): What that tells me is that, instead, that goes on out and eats large chunks of the Internet. It | what you're doing is looking for significant correlations. reads public domain stuff off forums and other areas, and _Is that a better way of looking at it? sometimes strays into chat groups. It's not very CH: Correct. We don't actually even look at the words, deterministic and it follows links. So sometimes when we _ the words themselves. The spiders and so forth are in a set it off, we don't really know where it's going toend up ~—- much more deterministic software language called "C", going in terms of what text it's going to eat. And that's and some Perl script. Most of the processing is done by part of the whole thrill of it all, if you will. Prolog. But the Perl script will go through and do a match approach to this because we tell the software, called spiders, to sit on this server, open up this one web page and find any key words that we tell it to out of this list to start with..." Tracking significant correlations KC: I would say maybe the philosophical side of it began to draw you more. CH: Sure. And basically, I developed some software that goes on out and eats large chunks of the Internet. It reads public domain stuff off forums and other areas, and sometimes strays into chat groups. It's not very deterministic and it follows links. So sometimes when we set it off, we don't really know where it's going to end up going in terms of what text it's going to eat. And that's part of the whole thrill of it all, if you will. 48 ¢ NEXUS "There's a serendipitous www.nexusmagazine.com DECEMBER 2008 — JANUARY 2009