Feeds ham OR spam messages to train the {@link BayesianAnalysis} mailet.
The new token frequencies will be stored in a JDBC database.
Sample configuration:
<processor name="root">
<mailet match="RecipientIs=not.spam@thisdomain.com" class="BayesianAnalysisFeeder">
<repositoryPath> db://maildb </repositoryPath>
<feedType>ham</feedType>
<!--
Set this to the maximum message size (in bytes) that a message may have
to be analyzed (default is 100000).
-->
<maxSize>100000</maxSize>
</mailet>
<mailet match="RecipientIs=spam@thisdomain.com" class="BayesianAnalysisFeeder">
<repositoryPath> db://maildb </repositoryPath>
<feedType>spam</feedType>
<!--
Set this to the maximum message size (in bytes) that a message may have
to be analyzed (default is 100000).
-->
<maxSize>100000</maxSize>
</mailet>
<processor>
The previous example will allow the user to send messages to the server
and use the recipient email address as the indicator for whether the message
is ham or spam.
Using the example above, send good messages (ham not spam) to the email
address "not.spam@thisdomain.com" to pump good messages into the feeder,
and send spam messages (spam not ham) to the email
address "spam@thisdomain.com" to pump spam messages into the feeder.
The bayesian database tables will be updated during the training reflecting
the new data
At the end the mail will be destroyed (ghosted).
The correct approach is to send the original ham/spam message as an attachment
to another message sent to the feeder; all the headers of the enveloping message
will be removed and only the original message's tokens will be analyzed.
After a training session, the frequency Corpus used by BayesianAnalysis
must be rebuilt from the database, in order to take advantage of the new token frequencies.
Every 10 minutes a special thread in the BayesianAnalysis mailet will check if any
change was made to the database, and rebuild the corpus if necessary.
Only one message at a time is scanned (the database update activity is synchronized)
in order to avoid too much database locking,
as thousands of rows may be updated just for one message fed. |