$cmd = escapeshellcmd('perl -wT ' .
escapeshellarg("/path/to/myscript.pl") . ' ' .
[various args, including the name of the temporary file]);
$perlresult = shell_exec($cmd);
use strict;
use warnings;
use Encode;
use HTML::Scrubber;# strips HTML tags
use HTML::Entities;# converts HTML entities
use Getopt::Long;
# [...code omitted...]
# READ ALL INPUT INTO A SINGLE STRING
# SO THAT SUBSEQUENT SEARCHES FOR TAGS
# CAN SUCCEED EVEN IF OPENING AND CLOSING
# TAGS ARE ON DIFFERENT LINES.
my $intext;# the entire text in one scalar string
while(<>)# THIS READS THE TEXT FROM THE TEMP FILE
{
$intext .= lc($_);# lower case all
}
# ---- FILTER THE INPUT TEXT
# I don't know if the next line's "do-nothing" decoding
# will actually accomplish anything. Its intended purpose
# is to turn all unsupported (i.e. illegally encoded) chars into
# legal supported ones even if the resulting output is garbage.
Encode::from_to($intext, "cp1252", "cp1252", Encode::FB_DEFAULT);
# Must strip out any embedded PHP code
# *before* passing the text to Scrubber,
# whose processing does not strip PHP,
# but does make PHP tags subsequently unfindable,
# while it preserves their potentially malicious contents in the text.
# TODO: also strip ASP and what other code?
$intext =~ s/\<\?(php)?.*?\?\>/ /sig;
# ---- STRIP OUT ALL HTML TAGS, COMMENTS, JAVASCRIPT.
my $scrubber = HTML::Scrubber->new;
$intext = $scrubber->scrub($intext);
# ---- DECODE ENTITIES.
# THE SCRIPT'S PROCESSING NEEDS THE ACTUAL CHARS, NOT "'" etc.
$intext = decode_entities($intext);
# CHANGE CONTROL (0-31) AND SPACE CHARS TO SINGLE SPACE.
$intext =~ s/[[:space:][:cntrl:]]+/ /gi;
# THIS ALTERNATIVE ADDS TESTS FOR ANY REMAINING < AND >
# I AM NOT SURE WHETHER THIS IS NECESSARY.
$intext =~ s/[[:space:][:cntrl:]\<\>]+/ /gi;
# AT THIS POINT, $intext CONTAINS ONLY THE READABLE TEXT FROM THE
# WEB PAGE (IF THAT'S WHAT IT WAS), WITH ALL TAGS REMOVED.
My main concern (at least the one I'm aware of) is whether it's possible for the unscrubbed text in the temporary file to contain any kind of exploit that could subvert or hijack the Perl <> operator while it reads the file, or subvert or corrupt HTML::Scrubber's processing of the text as it strips the tags.
When writing application programs, any string that might contain any of these special characters must be properly escaped before the string is used as a data value in an SQL statement that is sent to the MySQL server. You can do this in two ways:
* Process the string with a function that escapes the special characters. In a C program, you can use the mysql_real_escape_string() C API function to escape characters. See Section 20.8.3.53, "mysql_real_escape_string()" [dev.mysql.com]. The Perl DBI interface provides a quote method to convert special characters to the proper escape sequences. See Section 20.10, "MySQL Perl API" [dev.mysql.com]. Other language interfaces may provide a similar capability.
* As an alternative to explicitly escaping special characters, many MySQL APIs provide a placeholder capability that enables you to insert special markers into a statement string, and then bind data values to them when you issue the statement. In this case, the API takes care of escaping special characters in the values for you.
[edited by: phranque at 2:03 pm (utc) on Nov 4, 2010]