Perl security: how to scrub/handle input data?

I'm starting to prepare to go live with my first web application that uses Perl, and would like to know if there are any security precautions that should be added that I've overlooked. The server is Linux/Apache/PHP/Perl.

Except that its functionality is different, the interface is similar to the W3C HTML validation "by direct input" form: the user selects various options and then pastes a large text block, which can optionally be the source code of a web page, into an HTML textarea.

JavaScript truncates the text block, if necessary, to its maximum allowed length, and the text and options are submitted to the server as POST data.

The PHP receiver script validates all the submitted option values. If any values are invalid (e.g. a non-digit chars in a numeric value), it substitutes default legal values. It again truncates the text block to size in case it wasn't done by the JavaScript.

PHP then writes the text block (in its raw unscrubbed state) to a temporary file with a random name (that is never revealed to the user) in a directory that is blocked from web access (.htaccess: deny from all).

PHP then invokes my Perl script to scrub and perform its processing on the text that's in the temporary file:

$cmd = escapeshellcmd('perl -wT ' . 
escapeshellarg("/path/to/myscript.pl") . ' ' . 
[various args, including the name of the temporary file]);
$perlresult = shell_exec($cmd);

At this point, the text block in the temporary file may contain, and sometimes certainly will contain, malicious iframes, JavaScript, PHP, SQL injection exploits, and any other type of malicious code. Here is what Perl does with it:

It again interprets the option values and restricts them to legal values. Then...

use strict;
use warnings;
use Encode;
use HTML::Scrubber;# strips HTML tags
use HTML::Entities;# converts HTML entities
use Getopt::Long;

# [...code omitted...]

# READ ALL INPUT INTO A SINGLE STRING
# SO THAT SUBSEQUENT SEARCHES FOR TAGS
# CAN SUCCEED EVEN IF OPENING AND CLOSING
# TAGS ARE ON DIFFERENT LINES.

my $intext;# the entire text in one scalar string
while(<>)# THIS READS THE TEXT FROM THE TEMP FILE
{
$intext .= lc($_);# lower case all
}

# ---- FILTER THE INPUT TEXT 
# I don't know if the next line's "do-nothing" decoding 
# will actually accomplish anything. Its intended purpose 
# is to turn all unsupported (i.e. illegally encoded) chars into 
# legal supported ones even if the resulting output is garbage.

Encode::from_to($intext, "cp1252", "cp1252", Encode::FB_DEFAULT);

# Must strip out any embedded PHP code 
# *before* passing the text to Scrubber,
# whose processing does not strip PHP,
# but does make PHP tags subsequently unfindable,
# while it preserves their potentially malicious contents in the text. 
# TODO: also strip ASP and what other code?

$intext =~ s/\<\?(php)?.*?\?\>/ /sig;

# ---- STRIP OUT ALL HTML TAGS, COMMENTS, JAVASCRIPT. 

my $scrubber = HTML::Scrubber->new;
$intext = $scrubber->scrub($intext); 

# ---- DECODE ENTITIES. 
# THE SCRIPT'S PROCESSING NEEDS THE ACTUAL CHARS, NOT "&apos;" etc.

$intext = decode_entities($intext);

# CHANGE CONTROL (0-31) AND SPACE CHARS TO SINGLE SPACE. 
$intext =~ s/[[:space:][:cntrl:]]+/ /gi;

# THIS ALTERNATIVE ADDS TESTS FOR ANY REMAINING < AND >
# I AM NOT SURE WHETHER THIS IS NECESSARY.
$intext =~ s/[[:space:][:cntrl:]\<\>]+/ /gi;

# AT THIS POINT, $intext CONTAINS ONLY THE READABLE TEXT FROM THE 
# WEB PAGE (IF THAT'S WHAT IT WAS), WITH ALL TAGS REMOVED.

In the remaining code, Perl processes the text and prints its summary output, which is received by PHP, which places the report on the page in an HTML textarea. Currently, PHP does not do any entity conversion. I haven't yet determined if that's necessary for textarea use. Then PHP deletes the temp file.

My main concern (at least the one I'm aware of) is whether it's possible for the unscrubbed text in the temporary file to contain any kind of exploit that could subvert or hijack the Perl <> operator while it reads the file, or subvert or corrupt HTML::Scrubber's processing of the text as it strips the tags.

Hopefully, if my code is any good, it can serve as an example to help someone else. If it needs fixing, that information can help anyone who reads what's wrong with it. Thank you to anyone who's willing to look at it.

Perl security: how to scrub/handle input data?

Submitted data will contain malware

SteveWh

janharders

chorny

phranque

chorny

phranque

SteveWh

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week