Pensamos

Volver

Gmail Spam Analyzer

4tm.biz , php , spam , gmail
Jose Luis Canciani (josecanciani at Twitter)

gmailspamanalitics.jpgToo many email address redirect to your gmail account? Watching your Spam folder getting bigger (and boring) everytime? It was for me.

So I decided to write some scripts (thanks to libgmailer) and now I can spend my time reading a newspaper or going out ;)

During the last years I've collected a lot of email addresses: one for each company I've worked for, one for msn, one for gmail, one for yahoo, and the list goes on. I finally decided to foward all my email address to a Gmail box. There's something about the conversation view that no other client has yet...Of course this move had the side effect of multiplying the spam emails I receive in the gmail box. The spam system does a good job, but I think some reports can be useful.

Several years ago I came up with the idea of using a domain name I have to create email address for each webpage I sign up and I don't trust much. So every email I create is something like thesiteurl@mydomain.com. I have all the domain emails redirected to my account. When I see some email address is sending too much spam I simply block the address and that's all.

Anyway it is still a pain in the a** to go through my spam folder every day to control this stuff. So I decided that a script should be doing it for me.

Basically the script connects to gmail, reads the spam folder, and saves certain data from the email to a database table which I later can review from a webpage with some cool effects. On the right you can see a screenshot of the reports. I've add some graphics to make it a little more interesting ;)

The script is written in php. It uses libgmailer (from the gmail-lite project). It reads the folder content for unread messages and when if found a conversation it read each email and saves the recepient, the sender and the subject for later analysis. Using adodb it inserts in a database table the data.

Let's see a bit of the collect.php code.

This scripts should be run from a cron job periodically (every hour should be fine). First, we have to create the table in the database where we are going to save the data the script will collect:

 

CREATE TABLE spam_occurance (
  message_id VARCHAR(30) NOT NULL ,
  ts INT(11) NOT NULL ,
  recv_email VARCHAR(60) NOT NULL ,
  subj VARCHAR(200) NOT NULL ,
  from_email VARCHAR(60) NOT NULL ,
  PRIMARY KEY (message_id),
  INDEX (recv_email,ts),
  INDEX (ts,recv_email)
);

 

Then we modify the config.inc.php and specify a database connection URI to access that table.

Now let's review some insteresting parts of the collect code:

Connect to gmail, apparently the library saves login cookies in a session var, so that it won't have to re-authenticate every time it runs... Anyway, I don't think that this works when running from the command line...

 

 $gm = new GMailer();
 $gm->setLoginInfo($myemail, $pwd, $tz);

 

Next, we get the spam box conversations and cycle in the unread conversations to get the real emails:

 

$gm->fetchBox(GM_STANDARD, "spam", 0);
      $snapshot = $gm->getSnapshot(GM_STANDARD);
      if ($snapshot) {
           debug('Total # of conversations in Spam folder = ' . $snapshot->box_total.$nl.$nl);
         foreach ($snapshot->box as $conv) {
             // we will only inspect unread messages for better performance
             if ($conv['is_read']==1) {
                        debug('Conversation "'.strip_tags($conv['subj']).'" (id: '.$conv['id'].')'.$nl);
                 
                 // get the messages in the conversation
                 $q = "search=spam&view=cv&th=".$conv['id'];
                $gm->fetch($q);
                    $snapshot2 = $gm->getSnapshot(GM_STANDARD | GM_LABEL| GM_QUERY| GM_CONVERSATION);
      ....

 

So now we have the real emails in the conversation in the $snapshot2 object. We only have to collect that email info, verify that it wasn't already analyzed/inserted (could happen when a new mail enters an already read conversation) and finally insert the data in the table:

foreach ($snapshot2->conv as $msg) {
    debug(' From: '.$msg['sender_email'].' ID: '.$msg['id'].' was send to ');
    foreach ($msg['recv_email'] as $recv_email){
        $email = '';
        eregi($regex, $recv_email, $email);
        debug($email[0].' ');
        $sql = 'select 1 from spam_occurance where message_id = ?';
        $rs = $db->execute($sql,array($msg['id']));
        if ($rs->EOF) {
            $sql =  'insert into spam_occurance '.
                    '(message_id,ts,recv_email,subj,from_email) '.
                    ' values (?,?,?,?,?)';
            $db->execute($sql, array( $msg['id'],time(),$email[0],
                                    strip_tags($conv['subj']),$msg['sender_email']));
            debug('INSERTED');
        } else {
            debug('SKIPPED');
        }
    }                
    debug($nl);
}

 

Data is now in the database and ready to be seen by the report script.

The reports.php script is a little bit more complicated (with the ajax stuff) so just go ahead and download it to see it. Anyway the powerfull part was the use of libgmailer. It's a very good library that I could imagine using in a lot of ways!

You can get the source code here: 4TM Open Source Tools


Comentarios


* Anonymous -
Congratulation for your program, however it only fetches the last 100 spams, I quickly checked your code as well as the libgmailer code and did not find where or why it only fetches a limited amount of mails. Any idea ?
Regards
* Anonymous -
I guess that happens becouse libgmailer works parsing gmail webpages and each webpage (when you are viewing a folder) displays only 100 messages at a time. I don't know if there is a fix for that, you should ask that to the libgmailer developers.
I would recommend to run the program more often (I have it in a crontab that runs every couple of hours). If you need to analyze the spam you already have, and you have more than 100 messages, you will have to manually delete viewed messages each time it runs. Anyway, since the program takes the system time and not the message time, all you mails will appear to be received the same day. You can try to change that in the script. Since I run it several times a day, that wasn't a problem for me.
I'm glad you like the script!
Regards,
Jose.
* Anonymous -
Thanks for your quick answer, I have found how to fetch all mails, with the function :

$gm->fetchBox(GM_STANDARD, "spam", $i);

Set $i as the first email occurrence you want, then get the snapshot :
$snapshot = $gm->getSnapshot(GM_STANDARD);

A simple loop from $i=0 to $snapshot->box_total with $i incremented by the page view count does the full fetch.

Regards
* Anonymous -
Damn, my account was disabled for 24 hours because of "unusal usage" ...
Maybe it would be a good idea to insert some random waits between mail fetches...
* Anonymous -
Hi Jose! Thanks for the clever little script! I got it all set up and imported some spam from gmail using collect.php, but when I go to reports.php it gives me:

Parse error: syntax error, unexpected '{' in /home/****/public_html/gmail/reports.php on line 70


Funny thing is, I haven't edited the PHP scripts at all, just uploaded the .tar.gz etc.

Here's line 70-72 in the script:

try {
include('adodb/adodb.inc.php');
$db = NewADOConnection($db_uri);


Do you have any idea what the problem is? We have PHP 4.4.4 with apache 1.3.37. Thanks!
* Anonymous -
Thanks for the tips Oliver!
Ronnie, the script was written for PHP 5. The "try {} catch {}" will not work for PHP 4.x.
You can try commenting them. You will loose the error catching part, but you should still be able to use it.
Regards,
Jose.
* Anonymous -
Hi Jose, thanks for the tip -- Putting a comment block /* ..... */ around the try {} ... catch {} lines allowed me to get it working, thanks! I didn't see any mention that it was a PHP5 script only on your blog, perhaps you could add that mention or offer a PHP4 and a PHP5 version (or have checking in your script to look at the PHP version, which is pretty easy to do :) )
* Anonymous -
Well, reports.php does load. I'll put this on a different PHP5 server if that'll make the script happy :)

In PHP 4.x I just get "Loading..." indefinitely after clicking the Update button on either the daily, monthly, or yearly reports.
* Anonymous -
Script does not seem to be working. Is there an update available, cause it really seems handy...
* Anonymous -
it's been a long time since I don't update it. I have it running in my server without problems, although I'm not using the reports page. Can you specify what's wrong? Please read the comments too for tips. Regards.
Comentá


 


2010 Copyright © 4TM - todos los derechos reservados

www.4tm.biz