Secure Your Portable Files
Spying on Satellites
Successful Picosatellite Launch
Robot Weblogs
You Know You Want Your Own Picosatellite
Critical Communications Infrastructure
Wiretap Access to VoIP Lines
Firefox 1.5 Released and Reviewed
The Tiered Internet
more...
Its an understatement to say that there's a lot of data available on
the internet. For many people, just keeping up with email is too much.
Google helps, but if you don't know to look for something, you won't
know to look for it. Favorite message boards are pretty efficient, if
you can find someplace you like, and even then you gotta deal with
"local politics".
Even if you spend all of your time sifting through the data, you'll still miss stuff. There's some alternatives out there, but they mostly suck.
Even if you spend all of your time sifting through the data, you'll still miss stuff. There's some alternatives out there, but they mostly suck.
There's been a lot of emphasis lately on human-sorted data. Things like del.icio.us, digg, and others, where the content is submitted and rated by the users. The highest rated stories (by number of links, votes, etc) get to the front page. This doesn't suck, but it is quite susceptible to spamming. Automate up a few hundred accounts, and push your stuff to the top.
The same thing happens with Google, actually, and they're continually updating their algorithms to keep out various forms of "link farms", websites which exist only to link to other websites, or forum spamming, where accounts are created just to create links to various websites.
Yea, that's why that crap keeps showing up. That's why Google will let you sign up for Google Mail, but only via cellphone - to keep out the spammers.
Spamming, in other words, is more than an email problem.
Well, we SlaveDogs don't put up with spam. We've got mean and nasty mechanisms in place to keep most of it out. One of the layers includes what's called "Bayesian filtering." This is a statistical method of determining whether something is spam or not. Essentially, it has to be "trained" by the user. We could go on an on with specifics, but that's not the point.
The point is that our mail is sorted by our criteria, not that of others. If you followed that link on filtering, then you might've noticed that it discusses using the theorum for sorting documents, not just spam. What we'd like to see is a search that uses that kind of technology. It wouldn't be incredibly difficult, but it would be quite challenging. There's not only the question of indexing, and storing the results in a form conducive to this kind of search, lexical analyzers need big hardware.
We suspect that the Google toolbar will eventually include something like this. The CPU problem goes away if you move it to the client. We think it is entirely likely that they use this technology in their Adwords, given their pretty good accuracy.
To look at it another way, everything we're not interested in reading is "spam" - not just the stuff that comes through email, or posted without permission, or link farmed, or whatever. Given the sheer quantity of data available, and the increasing importance of sifting it, it only makes sense that we should use more technology to work on the problem. We're big fans of human brainpower, and attempting to derive results from given behavior, but the results have been less than satisfactory.
When is someone gonna put all those cycles to use to actually help us, instead of finding ways of generating more crap?
The same thing happens with Google, actually, and they're continually updating their algorithms to keep out various forms of "link farms", websites which exist only to link to other websites, or forum spamming, where accounts are created just to create links to various websites.
Yea, that's why that crap keeps showing up. That's why Google will let you sign up for Google Mail, but only via cellphone - to keep out the spammers.
Spamming, in other words, is more than an email problem.
Well, we SlaveDogs don't put up with spam. We've got mean and nasty mechanisms in place to keep most of it out. One of the layers includes what's called "Bayesian filtering." This is a statistical method of determining whether something is spam or not. Essentially, it has to be "trained" by the user. We could go on an on with specifics, but that's not the point.
The point is that our mail is sorted by our criteria, not that of others. If you followed that link on filtering, then you might've noticed that it discusses using the theorum for sorting documents, not just spam. What we'd like to see is a search that uses that kind of technology. It wouldn't be incredibly difficult, but it would be quite challenging. There's not only the question of indexing, and storing the results in a form conducive to this kind of search, lexical analyzers need big hardware.
We suspect that the Google toolbar will eventually include something like this. The CPU problem goes away if you move it to the client. We think it is entirely likely that they use this technology in their Adwords, given their pretty good accuracy.
To look at it another way, everything we're not interested in reading is "spam" - not just the stuff that comes through email, or posted without permission, or link farmed, or whatever. Given the sheer quantity of data available, and the increasing importance of sifting it, it only makes sense that we should use more technology to work on the problem. We're big fans of human brainpower, and attempting to derive results from given behavior, but the results have been less than satisfactory.
When is someone gonna put all those cycles to use to actually help us, instead of finding ways of generating more crap?