FileDocCategorySizeDatePackage
BayesianAnalysisFeeder.javaAPI DocApache James 2.3.111789Fri Jan 12 12:56:28 GMT 2007org.apache.james.transport.mailets

BayesianAnalysisFeeder

public class BayesianAnalysisFeeder extends org.apache.mailet.GenericMailet

Feeds ham OR spam messages to train the {@link BayesianAnalysis} mailet.

The new token frequencies will be stored in a JDBC database.

Sample configuration:


<processor name="root">

<mailet match="RecipientIs=not.spam@thisdomain.com" class="BayesianAnalysisFeeder">
<repositoryPath> db://maildb </repositoryPath>
<feedType>ham</feedType>
<!--
Set this to the maximum message size (in bytes) that a message may have
to be analyzed (default is 100000).
-->
<maxSize>100000</maxSize>
</mailet>

<mailet match="RecipientIs=spam@thisdomain.com" class="BayesianAnalysisFeeder">
<repositoryPath> db://maildb </repositoryPath>
<feedType>spam</feedType>
<!--
Set this to the maximum message size (in bytes) that a message may have
to be analyzed (default is 100000).
-->
<maxSize>100000</maxSize>
</mailet>

<processor>

The previous example will allow the user to send messages to the server and use the recipient email address as the indicator for whether the message is ham or spam.

Using the example above, send good messages (ham not spam) to the email address "not.spam@thisdomain.com" to pump good messages into the feeder, and send spam messages (spam not ham) to the email address "spam@thisdomain.com" to pump spam messages into the feeder.

The bayesian database tables will be updated during the training reflecting the new data

At the end the mail will be destroyed (ghosted).

The correct approach is to send the original ham/spam message as an attachment to another message sent to the feeder; all the headers of the enveloping message will be removed and only the original message's tokens will be analyzed.

After a training session, the frequency Corpus used by BayesianAnalysis must be rebuilt from the database, in order to take advantage of the new token frequencies. Every 10 minutes a special thread in the BayesianAnalysis mailet will check if any change was made to the database, and rebuild the corpus if necessary.

Only one message at a time is scanned (the database update activity is synchronized) in order to avoid too much database locking, as thousands of rows may be updated just for one message fed.

see
BayesianAnalysis
see
org.apache.james.util.BayesianAnalyzer
see
org.apache.james.util.JDBCBayesianAnalyzer
version
CVS $Revision: $ $Date: $
since
2.3.0

Fields Summary
private final org.apache.james.util.JDBCUtil
theJDBCUtil
The JDBCUtil helper class
private org.apache.james.util.JDBCBayesianAnalyzer
analyzer
The JDBCBayesianAnalyzer class that does all the work.
private org.apache.avalon.excalibur.datasource.DataSourceComponent
datasource
private String
repositoryPath
private String
feedType
private int
maxSize
Holds value of property maxSize.
Constructors Summary
Methods Summary
private voidclearAllHeaders(javax.mail.internet.MimeMessage message)

        Enumeration headers = message.getAllHeaders();
        
        while (headers.hasMoreElements()) {
            Header header = (Header) headers.nextElement();
            try {
                message.removeHeader(header.getName());
            } catch (javax.mail.MessagingException me) {}
        }
        message.saveChanges();
    
public java.lang.StringgetMailetInfo()
Return a string describing this mailet.

return
a string describing this mailet

    
                     
       
        return "BayesianAnalysisFeeder Mailet";
    
public intgetMaxSize()
Getter for property maxSize.

return
Value of property maxSize.

    
                  
       

        return this.maxSize;
    
public voidinit()
Mailet initialization routine.

throws
MessagingException if a problem arises

        repositoryPath = getInitParameter("repositoryPath");
        
        if (repositoryPath == null) {
            throw new MessagingException("repositoryPath is null");
        }
        
        feedType = getInitParameter("feedType");
        if (feedType == null) {
            throw new MessagingException("feedType is null");
        }
        
        String maxSizeParam = getInitParameter("maxSize");
        if (maxSizeParam != null) {
            setMaxSize(Integer.parseInt(maxSizeParam));
        }
        log("maxSize: " + getMaxSize());
        
        initDb();
        
    
private voidinitDb()

        
        try {
            ServiceManager serviceManager = (ServiceManager) getMailetContext().getAttribute(Constants.AVALON_COMPONENT_MANAGER);
            
            // Get the DataSourceSelector block
            DataSourceSelector datasources = (DataSourceSelector) serviceManager.lookup(DataSourceSelector.ROLE);
            
            // Get the data-source required.
            int stindex =   repositoryPath.indexOf("://") + 3;
            
            String datasourceName = repositoryPath.substring(stindex);
            
            datasource = (DataSourceComponent) datasources.select(datasourceName);
        } catch (Exception e) {
            throw new MessagingException("Can't get datasource", e);
        }
        
        try {
            analyzer.initSqlQueries(datasource.getConnection(), getMailetContext());
        } catch (Exception e) {
            throw new MessagingException("Exception initializing queries", e);
        }        
        
    
public voidservice(org.apache.mailet.Mail mail)
Scans the mail and updates the token frequencies in the database. The method is synchronized in order to avoid too much database locking, as thousands of rows may be updated just for one message fed.

param
mail The Mail message to be scanned.

        boolean dbUpdated = false;
        
        mail.setState(Mail.GHOST);
        
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        
        Connection conn = null;
        
        try {
            
            MimeMessage message = mail.getMessage();
            
            String messageId = message.getMessageID();
            
            if (message.getSize() > getMaxSize()) {
                log(messageId + " Feeding HAM/SPAM ignored because message size > " + getMaxSize() + ": " + message.getSize());
                return;
            }
            
            clearAllHeaders(message);
            
            message.writeTo(baos);
            
            BufferedReader br = new BufferedReader(new StringReader(baos.toString()));
                
            // this is synchronized to avoid concurrent update of the corpus
            synchronized(JDBCBayesianAnalyzer.DATABASE_LOCK) {
                
                conn = datasource.getConnection();
                
                if (conn.getAutoCommit()) {
                    conn.setAutoCommit(false);
                }
                
                dbUpdated = true;
                
                //Clear out any existing word/counts etc..
                analyzer.clear();
                
                if ("ham".equalsIgnoreCase(feedType)) {
                    log(messageId + " Feeding HAM");
                    //Process the stream as ham (not spam).
                    analyzer.addHam(br);
                    
                    //Update storage statistics.
                    analyzer.updateHamTokens(conn);
                } else {
                    log(messageId + " Feeding SPAM");
                    //Process the stream as spam.
                    analyzer.addSpam(br);
                    
                    //Update storage statistics.
                    analyzer.updateSpamTokens(conn);
                }
                
                //Commit our changes if necessary.
                if (conn != null && dbUpdated && !conn.getAutoCommit()) {
                    conn.commit();
                    dbUpdated = false;
                    log(messageId + " Training ended successfully");
                    JDBCBayesianAnalyzer.touchLastDatabaseUpdateTime();
                }
                
            }
            
        } catch (java.sql.SQLException se) {
            log("SQLException: "
                    + se.getMessage());
        } catch (java.io.IOException ioe) {
            log("IOException: "
                    + ioe.getMessage());
        } catch (javax.mail.MessagingException me) {
            log("MessagingException: "
                    + me.getMessage());
        } finally {
            //Rollback our changes if necessary.
            try {
                if (conn != null && dbUpdated && !conn.getAutoCommit()) {
                    conn.rollback();
                    dbUpdated = false;
                }
            } catch (Exception e) {}
            theJDBCUtil.closeJDBCConnection(conn);
        }
    
public voidsetMaxSize(int maxSize)
Setter for property maxSize.

param
maxSize New value of property maxSize.


        this.maxSize = maxSize;