Forum Moderators: coopster
I am writing a PHP script that is activated every time an email is sent to a specific email address on my server through a pipe. Ultimately, my plan is to use this for a future website that deals with crossing over SMS messages to web-based instant messaging. So as you could imagine, I would need to save important variables from the piped script that processes incoming messages into a database, where I can then manipulate the data further.
Just for testing purposes, I have made it so my script takes the sender, subject, message, and headers of all incoming emails and sends all of these variables in a message to another one of my email addresses using the mail function (instead of saving the variables to the database). This is my pipe script here:
#!/usr/bin/php -q
<?php
// read from stdin
$fd = fopen("php://stdin", "r");
$email = "";
while (!feof($fd)) {
$email .= fread($fd, 1024);
}
fclose($fd);
// handle email
$lines = explode("\n", $email);
// empty vars
$from = "";
$subject = "";
$headers = "";
$message = "";
$splittingheaders = true;
for ($i=0; $i < count($lines); $i++) {
if ($splittingheaders) {
// this is a header
$headers .= $lines[$i]."\n";
// look out for special headers
if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) {
$subject = $matches[1];
}
if (preg_match("/^From: (.*)/", $lines[$i], $matches)) {
$from = $matches[1];
}
} else {
// not a header, but message
$message .= $lines[$i]."\n";
}
if (trim($lines[$i])=="") {
// empty line, header section has ended
$splittingheaders = false;
}
}
mail("send_results_here@example.com", "Pipe Script Results", "From: $from Subject: $subject Headers: $headers Message: $message", "From: some_email@example.com\n");
return NULL;
?>
In most cases, this works fine, however, whenever I send an email through the pipe from an MSN email account (most likely all Microsoft email accounts will do the same), my $message variable will show as:
This is a multi-part message in MIME format.
------=_NextPart_000_0105_01C8D23F.DE1CA150
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
This is the actual message
------=_NextPart_000_0105_01C8D23F.DE1CA150
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type =
content=3Dtext/html;charset=3Diso-8859-1>
<STYLE></STYLE>
<META content=3D"MSHTML 6.00.6000.16681" name=3DGENERATOR></HEAD>
<BODY id=3DMailContainerBody=20
style=3D"PADDING-LEFT: 10px; FONT-WEIGHT: normal; FONT-SIZE: 10pt; =
COLOR: #000000; BORDER-TOP-STYLE: none; PADDING-TOP: 15px; FONT-STYLE: =
normal; FONT-FAMILY: Verdana; BORDER-RIGHT-STYLE: none; =
BORDER-LEFT-STYLE: none; TEXT-DECORATION: none; BORDER-BOTTOM-STYLE: =
none"=20
leftMargin=3D0 topMargin=3D0 acc_role=3D"text" CanvasTabStop=3D"true"=20
name=3D"Compose message area"><!--[gte IE 5]><?xml:namespace =
prefix=3D"v" /><?xml:namespace prefix=3D"o" /><![endif]-->
<DIV>This is the actual message</DIV></BODY></HTML>
------=_NextPart_000_0105_01C8D23F.DE1CA150--
This will not look very friendly if it goes in the database as the $message variable. Notice that it includes many headers still. If all emails did this, I could probably figure a way to parse the headers and the message to be separate correctly, but not all emails do this, and I do not know which ones do and which ones don't. My question is: Is there any way to separate the email headers and message completely so, in the above message for example, I would only get "This is the actual message" as my $message variable?
And if this cannot be done effectively, would it be possible to somehow tap into a specific variable in the email that I would be able to use to open the email separately through the imap_open function to read the actual message through the imap_fetchbody function, since this function will give me the correct message body?
Any replies would be greatly appreciated.
You could probably pass it through imap/pop3 [php.net], although there's a bit of a learning curve to it.
I think the easier way to do it will be to look at the header that specifies the boundary (_NextPart_000... here), then use that as a parameter to explode(). Then look through the resulting array for the text/plain section.
I don't believe that particular structure is limited to MSN, I think it's just been sent as both plain and html, which I can do from non-MSN accounts from my email client. Along those lines, though, a caveat: you may want to look at an email sent from an AOL account; those have some differences to them necessitating some extra coding.
No, I haven't completely figured this out yet, still working on it.
>>I don't believe that particular structure is limited to MSN, I think it's just been sent as both plain and html, which I can do from non-MSN accounts from my email client. Along those lines, though, a caveat: you may want to look at an email sent from an AOL account; those have some differences to them necessitating some extra coding.
I tried passing an email from my AIM account through the pipe, assuming AIM messages will come out the same as AOL messages, and it is a little different, but not terribly different. Here is my message from my AIM account:
----------MB_8CAA20BE4E0C032_B48_66DA_webmail-nf06.sim.aol.com
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"
test
----------MB_8CAA20BE4E0C032_B48_66DA_webmail-nf06.sim.aol.com
Content-Transfer-Encoding: 7bit
Content-Type: text/html; charset="us-ascii"
test<div id='u8CAA20BE4BF6272-B48-33EF' class='aol_ad_footer'><FONT style="color: black; font: normal 10pt ARIAL, SAN-SERIF;"><HR style="MARGIN-TOP: 10px"><A title="http://toolbar.aol.com/moviefone/download.html?ncid=aolcmp00050000000011" href="http://toolbar.aol.com/moviefone/download.html?ncid=aolcmp00050000000011" target="_blank">Get the Moviefone Toolbar</A>. Showtimes, theaters, movie news, & more!</FONT> </div>
----------MB_8CAA20BE4E0C032_B48_66DA_webmail-nf06.sim.aol.com--
It seems the similarity is in the trailing dashes (----------) that separates the plain text version of the message and the html version. I also found that both messages from MSN and AIM use the "boundary" parameter in the headers to specify this exact string for separation. For MSN's, it was:
boundary="----=_NextPart_000_0105_01C8D23F.DE1CA150"
and AIM's was:
boundary="--------MB_8CAA20BE4E0C032_B48_66DA_webmail-nf06.sim.aol.com"
Using the boundary parameter, should I be confident that explode() could be used to parse the email to retrieve the plain text version of the message for every message that contains this anomaly?
I am assuming something like this will work:
preg_match("/boundary=\".*?\"/i", $headers, $boundary);
$boundaryfulltext = $boundary[0];
if ($boundaryfulltext!="")
{
$find = array("/boundary=\"/i", "/\"/i");
$boundarytext = preg_replace($find, "", $boundaryfulltext);
$splitmessage = explode("--" . $boundarytext, $message);
$actualmessage = trim($splitmessage[1]);
}
else
{
$actualmessage = trim($message);
}
This works for both MSN and AIM email accounts. The only problem with this is that the final $actualmessage still contains some headers because it is within the boundary parameter. MSN seems to have the most:
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Actual message
And here is AIM's:
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"
Actual message
So now that it seems to be narrowed down a little more, my next question is: Is it possible to parse these remaining and somewhat inconsistent headers completely out of the message?
Also, should I be secure in knowing that this boundary format is used in all messages that are not parsed completely from the original script? And if anyone happens to know any other email accounts that have abnormal message layouts, what accounts are they and what kind of layouts do they have?
preg_match("/boundary=\".*?\"/i", $headers, $boundary);
$boundaryfulltext = $boundary[0];
if ($boundaryfulltext!="")
{
$find = array("/boundary=\"/i", "/\"/i");
$boundarytext = preg_replace($find, "", $boundaryfulltext);
$splitmessage = explode("--" . $boundarytext, $message);
$fullmessage = ltrim($splitmessage[1]);
preg_match('/\n\n(.*)/is', $fullmessage, $splitmore);
if (substr(ltrim($splitmore[0]), 0, 2)=="--")
{
$actualmessage = $splitmore[0];
}
else
{
$actualmessage = ltrim($splitmore[0]);
}
}
else
{
$actualmessage = ltrim($message);
}
$clean = array("/\n--.*/is", "/=3D\n.*/s");
$cleanmessage = trim(preg_replace($clean, "", $actualmessage));
This time, it outputs the final message as $cleanmessage. From what I have tried so far, this seems to work at least with MSN, Yahoo, AOL email accounts and Nextel and AT&T text messaging. It also works with AOL email accounts when it is sent from a mobile phone (which strangely produces a slightly different output from an email sent from a computer on the same email account). AOL seemed to be wanting to add an equal sign right after the message, and if it was sent from a mobile phone, it would append =3D. This script takes care of that problem and also removes any banners at the end of the message that start with a line break and then "--" for better entry into a database.
But again, I do not know if this will work for every email account, so I assume it is just process of elimination at this point. But if there seems to be anything wrong with my code or I am missing something, replies are still appreciated.