Re: [dspam-users] Very bad accuracy with 'markov' algorithm in 3.6.0

From: Guillaume Laurent <glaurent@telegraph-road.org>
Date: Tue Nov 01 2005 - 05:54:37 EST

(sorry to get back at this so late)

On Sunday 23 October 2005 14:52, Jonathan Zdziarski wrote:
> are you saving it with the signature, or the original message?

With the signature. There was indeed a flaw in my retraining, in that after
the retraining, I was re-checking with the processed message (i.e. with dspam
headers and sig), rather than with the original one, hence conflicting
results.

Nevertheless, I can confirm that retraining with Markov *does* sometimes take
several tries to pass through. I rewrote your train.pl in Ruby, and enhanced
the retraining of errors so it would re-process the original message and
check if the retraining changes the classification. In most cases, it does,
but there definitely are some where you have to retrain several times before
dspam "gets it".

I'm attaching my script for reference, see the retrain_error method. It takes
- a filename pointing to the original msg (which may contain some dspam
headers but these are filtered out by formail)
- the processed message (an array of strings)
- the expected class of the message (spam, innocent)

It retrains the processed msg, then reprocessed the original message, checks
if the dspam result matches the expected one, and loops until the expected
result or a max nb of retries is reached.

> we need to see some debug output to see what's going on.

Here's the relevant part in dspam/system.log for a message which took 2
retrainings before being classified correctly :

1130840228 S "Service Téléspectateur ARTE"
<Telespectateurs@arte-tv.com> 436740a4216567818312239 William Karel
revient sur ARTE 0.061850 glaurent Quarantined
<21664745.1128539965070.JavaMail.SYSTEM@cubitus>
1130840228 F "Service Téléspectateur ARTE"
<Telespectateurs@arte-tv.com> 436740a4216567818312239 William Karel
revient sur ARTE 0.081029 glaurent Retrained
<21664745.1128539965070.JavaMail.SYSTEM@cubitus>
1130840228 S "Service Téléspectateur ARTE"
<Telespectateurs@arte-tv.com> 436740a4216601804284693 William Karel
revient sur ARTE 0.061997 glaurent Quarantined
<21664745.1128539965070.JavaMail.SYSTEM@cubitus>
1130840228 F "Service Téléspectateur ARTE"
<Telespectateurs@arte-tv.com> 436740a4216567818312239 William Karel
revient sur ARTE 0.082768 glaurent Retrained
<21664745.1128539965070.JavaMail.SYSTEM@cubitus>
1130840228 S "Service Téléspectateur ARTE"
<Telespectateurs@arte-tv.com> 436740a4216641192918619 William Karel
revient sur ARTE 0.060288 glaurent Quarantined
<21664745.1128539965070.JavaMail.SYSTEM@cubitus>
1130840228 F "Service Téléspectateur ARTE"
<Telespectateurs@arte-tv.com> 436740a4216567818312239 William Karel
revient sur ARTE 0.080868 glaurent Retrained
<21664745.1128539965070.JavaMail.SYSTEM@cubitus>
1130840228 I "Service Téléspectateur ARTE"
<Telespectateurs@arte-tv.com> 436740a4216681101020035 William Karel
revient sur ARTE 0.057561 glaurent Delivered
<21664745.1128539965070.JavaMail.SYSTEM@cubitus>

-- 
Guillaume.
http://www.telegraph-road.org

Received on Tue Nov 1 05:56:21 2005

This archive was generated by hypermail 2.1.8 : Wed Nov 02 2005 - 00:00:01 EST