Forum Moderators: coopster & phranque

Message Too Old, No Replies

Large Text Lists

How would you do it?

         

rocknbil

12:13 am on Oct 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Two email lists in plain text files, one over a million records, the other from 1000 to 3000 records. Find all addresses that are in both lists and make a third list.

Sounds simple huh? It really is, but try to get it to do this in under an hour is the big hoop to jump. I would prefer to just stuff them in a temp mysql DB but it's not available on this project.

Just pushing them on an array and comparing won't work, it's too huge. Right now I'm fiddling with List::Compare using an intersection, but for some reason I think it may be malfunctioning when I try to read in the large list.

I'm down to reading in a "chunk" of the larger list, comparing the chunk against the smaller list, then storing matches and errors incrementally. Still slow.

How would you approach something like this?

phranque

12:50 am on Oct 10, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i would use standard *nix commands but you can essentially do this in perl if you like.

sort -u big_file.txt >big_file_sorted_uniques.txt
sort -u small_file.txt >small_file_sorted_uniques.txt
diff -u big_file_sorted_uniques.txt small_file_sorted_uniques.txtĤgrep -v '^[-+]'

[edit]fix pattern[/edit]

rocknbil

1:55 am on Oct 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Doesn't diff just compare the files and displays what's different? I played around with it today, seemed like it was up the wrong tree, but I could be missing something. :-)

Thinking grep might give me the speed we're after though, that's next on the block. In the meantime I DID build a mysql alternate, the whole thing runs in about 20 minutes.

3 billion comparisons: check each email address in a file of 3000 lines against a file of 1 million. Eak!

rocknbil

5:06 am on Oct 10, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, I have two solutions on the board.

In the first I import the files into mySQL and just do a single query. :-) About 8 minutes.

The other I store the smaller file in an array (about 3000 lines) and do a `grep` on the larger file for each item in the array.

For a Pathologically Eclectic Rubbish Lister I would think Perl would have more mechanisms for huge files. In all my years I've just never dealt with text files this large. Guess I always found a better way!

phranque

5:28 am on Oct 10, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Doesn't diff just compare the files and displays what's different?

diff with -u option uses a unified output format in which all non-matching lines would begin with '+' or '-'.
pipe that into grep with -v option which inverts the match, returning non-matching items, which would therefore be common to both files.

no promises - all i can say is it's worth a try.

phranque

6:02 am on Oct 10, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



For a Pathologically Eclectic Rubbish Lister I would think Perl would have more mechanisms for huge files.

what perl does have is a mechanism for executing system commands and a philosophy of reusing instead of reimplementing.

rocknbil

12:58 am on Oct 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The customer went with a mySQL approach, but my second version indeed used grep. It was pretty fast too . . . . something like

(store smaller list in array @arr)


foreach $v (@arr) {
undef($result);
$result = `grep $v $wdir/$target_file`;
if ($result) { $matches++; &log_matches($v); }
}

Then used wc to get the line count. I guess I've been spoiled by mySQL. :-(

rocknbil

10:18 pm on Oct 15, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, for posterity's sake . . . I spend quite a bit of time searching for approaches to this issue, most of which involved the open (FILE) syntax we all know and love. I played with various mySQL - only approaches, I went down Phranque's road and tried some Unix commands, most of which had minor problems (mostly when you grep stuff into a perl array with these kinds of numbers not all the data comes out the other side . . . ) but the largest turned out to be speed.

I "guestimated" comparing a one million lines file against a second 1 million lines file to take about 12 hours when commands were nested in a program (which is probably due to my lack of proficiency in Unix commands.)

Using a mySQL-only solution - insert into two slim tables, apply different selects on them (joined and not joined-) - I thought would be the fastest, added up to about 50 hours! I say guestimate because I only let these run for half an hour and did the math.

I refused to give up and just say "that's as good as it gets." What I wound up doing was one insert of the first table (error-checking the syntax on the fly,) which tooks 3-5 minutes. Once that table's in, I didn't waste any time on inserting the second. I just opened the second file, cleaned the email up, then dropped it into a select statement

select email from table1 where email='$email'

Which doesn't force you to go through the entire list. At regular intervals, I popped in a value that indicated progress through the second list. As each line was brought up to bat, I also performed error checks for invalid email syntax, and dropped the malcontents into an error log table (one for each list.)

It turned out that this billion computation comparison takes around . . . .

45 minutes! :-)

That's on a Centos single processor machine. Probably would do better on a dual.