[dspam-users] problems detecting spam containing certain characters in the tokens?

From: Aaron Wolfe <aawolfe@gmail.com>
Date: Thu Dec 01 2005 - 09:47:36 EST

Hello,

I'm running dspam 3.6 and a mysql 4.1 backend. I have one user that
receives a large amount of spam that are non-english and full of
characters that I am not sure my PC can even represent correctly, they
look like this to me:

网站建设 - 专业快速客户过1000家
or
*应对海关稽查及进出口关务技巧*
or
无效的肾病患者速来退款
etc...

dspam is very accurate for most of my users (and for some types of
spam that this user gets) but despite constant training of 20-30
messages like this a day for several weeks dpsam seems to miss alot of
these "weird" character containing spams. dspam probably catches over
70% of them, but enough get through to annoy the user.

my questions is: could it be that either dspam interpretting or mysql
is storing these characters incorrectly or in a way that is decreasing
the algorithms effectiveness? Maybe one of the underlying libraries
on my system is at fault? and if so, could I possibly improve
accuracy by changing a field type, library, etc? I'm a little lost on
how to research this, so any thoughts are appreciated.

-Aaron
Received on Thu Dec 1 09:49:29 2005

This archive was generated by hypermail 2.1.8 : Fri Dec 02 2005 - 00:00:01 EST