Hi list, hi Jonathan,
since I couldn't believe that dspam works so poor here, I did some tests and
it seems that dspam 3.6.1 doesn't learn correctly. The tests were done on a
Gentoo Linux system (kernel 2.6.10 SMP, gcc 3.4.4 + spp-3.4.4-1.0 +
pie-8.7.8, glibc-2.3.5 with nptl). I disabled the mysql backend and used the
db4 backend to make sure MySQL isn't the problem here. I slightly modified
the train.pl script (changed REPORTING_WINDOW to 9999 so that it reports a
complete summary at the end and replaced --deliver=stdout with
--deliver=innocent,spam --stdout so that it works with dspam 3.4.9). Then I
trained dspam by using train.pl with 499 spams from 20050311_spam_2.tar.bz2
and 499 hams from 20030228_easy_ham_2.tar.bz of the spamassasin publiccorpus
archive). The results are confusing:
First test using dspam 3.4.9:
~ # dspam --version
DSPAM Anti-Spam Suite 3.4.9 (agent/library)
... text stripped ...
Configuration parameters: --prefix=/usr --host=i686-pc-linux-gnu
--mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share
--sysconfdir=/etc --localstatedir=/var/lib --build=i686-pc-linux-gnu
--enable-long-username --enable-domain-scale
--with-dspam-home=/etc/mail/dspam --with-dspam-mode=4755
--with-dspam-owner=dspam --with-dspam-group=dspam
--sysconfdir=/etc/mail/dspam --with-logdir=/var/log/dspam
--with-storage-driver=libdb4_drv
~ # ./train.pl andy publiccorpus
Training publiccorpus/nonspam / publiccorpus/spam corpora...
... FP/SM messages stripped ...
Spam Correct : 488
Spam Missed : 9
Nonspam Correct: 497
Nonspam Missed : 0
andy TS: 488 TI: 497 SM: 9 IM: 0 SC: 0 IC: 0
SR: 98.19% IR: 100.00% OR: 99.09%
Looks pretty good... The 9 missed spams occured very early so that I'd say:
after 50 spams, dspam was able to detect all spam by 100%.
Second test using dspam-3.6.1:
~ # dspam --version
DSPAM Anti-Spam Suite 3.6.1 (agent/library)
... text stripped ...
Configuration parameters: --prefix=/usr --host=i686-pc-linux-gnu
--mandir=/usr/share/man --infodir=/usr/share/info --datadir=/usr/share
--sysconfdir=/etc --localstatedir=/var/lib --build=i686-pc-linux-gnu
--enable-long-username --with-delivery-agent=/usr/bin/procmail
--enable-domain-scale --with-dspam-home=/var/spool/dspam
--sysconfdir=/etc/mail/dspam --with-storage-driver=libdb4_drv
~ # ./train.pl andy publiccorpus
Training publiccorpus/nonspam / publiccorpus/spam corpora...
... FP/SM messages stripped ...
Spam Correct : 0
Spam Missed : 497
Nonspam Correct: 497
Nonspam Missed : 0
andy TP: 0 TN: 497 FP: 0 FN: 497 SC: 0 IC: 0
SR: 0.00% IR: 100.00% OR: 50.00%
What's up here? Everything was classified as non-spam (just like my live dspam
does as well). Both tests were run with the same train.pl script, the same
spam/nonspam messages and the same settings in dspam.conf (TrainingMode teft,
Algorithm graham burton, PValue graham) and on the same system with all same
libraries etc. Only difference is the different dspam version. Of course, I
also made sure that all dictionary data was properly wiped between test so
that every test began with a completely empty dictionary.
Any ideas?
regards,
Andreas Neuhaus
This archive was generated by hypermail 2.1.8 : Thu Dec 01 2005 - 00:00:01 EST