This may be the only time I can concentrate hard enough to write down an explanation of what happened. Here goes. I have some notes on how this went, but it has been 1.1 years or so. I couldn't just save a copy of the shell in which the thing ran because the results included thousands and thousands and thousands of lines of output. I wish I had kept an exact transcript of this work, but after I figured out how to do it, it seemed pointless to me to write it all down, since I knew how to do it. Well, that's patently silly. But I'm pretty/highly/really sure it goes like this, because I jotted down some records. 1. Run the ukrainedl.pl script, download 60000+ files. I think I kept them in a big tarball called "ukraine2002-prunedfn.tar.gz" in case you ever want it. The download ran for about 2 and 1/2 days. 2. get rid of all the returned files that are empty. They are empty because there is no corresponding district to the request that we made. Do that by the "find" command that checks for files that have a small number of bytes. They are easy to spot if you look at all the files. I think the empty ones are smaller than 2900 (on a linux file system) Then you are left with html files, one for each valid district. Their names start with pf and have ugly crap in them. I tidied up the file names by cutting off WEBPROC at the beginning of all of them. I think I used the perl-rename program to get rid of them. I attached that. Its from the Unix Power Tools book. usage is like for i in WEB*; do rename-perl s/WEBPROC// $i; done After that, the file names were reduced to big ugly things like pf7335=3&pf745=17.html Here "3" and "17" are district and precinct numbers. We will iterate over those numbers in scripts below. My recollection was that the file names had some bogus characters in them. These would frustrate scripts. THe ones that I thought were real trouble were ? which was in every filename. So was "&". Those made me mad. As far as I recall, there was some escaping required in the command, as in for i in *; do rename-perl s/.*[\?]// $i ; done THe backslash escaping is requried in order to stop the shell from trying to interpret the characters ?. At least I think so. IF it doesn't work try without the \. My notes also say I ripped out the character & from the filenames, by replacing \? with \&. The file ukraine2002-prunedfn.tar.gz already has had this cleaning done to it. 3. Split the pf*.html files into their constituent tables. Like so: for i in pf*; do csplit -s -k -f $i. $i "/ "final/$i"; done So now in the "final" directory, we have the files stripped of html. Then put the stripped info into a "dataset" format. This was the real pinnacle of wonder for me. There were 2 phases. The first part, as I recall, was on tables 04 and 05 for the parties running. The file "masterScript" is, I believe, the one that iterates over all of those files and scans for the numbers and outputs them as raw text. That generated the files that Erik then inspected. There were a handful of precincts in which the numbers were bogus, Erik spotted them and we found that the person who entered the data had goofed. But we looked at the originals and fixed it. The file masterScript78 is the one that goes through for files 07 and 08. That was the information on the candidates running and their votes. I think. It's all Ukranian to me :) After that, I recall Erik wanted the names of the candidates in a set, and I can't remember exactly why, but my recollection is that the script called "namescript8" did that. When you run these scripts, they dump output to stdout, and you can pipe it into files. Well, I'm glad we had this time, together. pj Burt Monroe wrote: > Geez, my scripts go on for pages. Must be nice to be competent. Wouldn't mind seeing the "frobbing" script, too, out of curiosity's sake, if it's to hand. > > At 04:25 PM 09/09/2004, you wrote: > >> Dear Burt >> >> Supposing you have perl and wget, this will make all of your automated downloading dreams come true. If you want data from Ukraine, that is. >> That downloads the html files, and then they have to be frobbed like hell to distill down to the usable stuff. There are Perl scripts for that too, somewhere :) >> >> >> #!/usr/bin/perl -w >> >> my $ndist = 226; >> my $maxp = 351; >> >> >> for (my $i=1; $i < $ndist; $i++) { >> for (my $j=1; $j < $maxp; $j++) >> { >> my $filename = join("","http://195.230.157.53/pls/vd2002/WEBPROC215V?kodvib=400&pf7335=",$i,"&pf7145=",$j); >> `wget "$filename"`; >> } >> } >> >> >> >> >> -- >> Paul E. Johnson email: pauljohn@ku.edu >> Dept. of Political Science http://lark.cc.ku.edu/~pauljohn >> 1541 Lilac Lane, Rm 504 >> University of Kansas Office: (785) 864-9086 >> Lawrence, Kansas 66044-3177 FAX: (785) 864-5700 -- Paul E. Johnson email: pauljohn@ku.edu Dept. of Political Science http://lark.cc.ku.edu/~pauljohn 1541 Lilac Lane, Rm 504 University of Kansas Office: (785) 864-9086 Lawrence, Kansas 66044-3177 FAX: (785) 864-5700 masterScript78 #!/usr/bin/perl #my $fn; # for (my $ind1 =1; $ind1 < 226; $ind1++){ # for (my $ind2 =1; $ind2 < 353; $ind2++) { for (my $ind1 =1; $ind1 < 226; $ind1++){ my $foutname = join("","district",$ind1); open(DISTOUT, "> final/$foutname") or die "no file out"; flock (DISTOUT, 2); for (my $ind2 =1; $ind2 < 353; $ind2++) { my $base1 = "pf7335="; my $base2 = "pf7145="; # print "$base1"; # print "$base2"; my $fn1 = join('',$base1,$ind1,$base2,$ind2,".07"); my $fn2 = join('',$base1,$ind1,$base2,$ind2,".08"); my $foutname = join("","district",$ind1); if(-e $fn1){ print DISTOUT "$ind1 $ind2 "; &process07 ( $fn1, \*DISTOUT ); &process08 ( $fn2, \*DISTOUT ); print DISTOUT "\n"; }else { # print "file did not exist\n"; # print "$fn1 \n"; } } flock(DISTOUT, 8); close (DISTOUT); } sub process07 (){ my $fn = $_[0]; my $DISTOUT = $_[1]; my $dataexists = open (DATA,"$fn"); #if (!$dataexists) { return 0; } my $junk; my $i = 1; while () { my $input = $_; $input =~ s/\ //g; if ($input =~ /(\d+)/s) { $i++; if ( $i % 2 ) { print $DISTOUT $1 . " "; } } } close(DATA); } sub process08 ( ){ my $fn = $_[0]; my $DISTOUT = $_[1]; my $dataexists = open (DATA,"$fn"); #if (!$dataexists) {return 0; } my $junk; $junk = ; $junk = ; $junk = ; $junk = ; my $i = 1; while () { my $input = $_; $input =~ s/\ //g; if ($input =~ /(\d+)/s) { $i++; if ( $i % 2 ) { my $val = $1; $val =~ s/\ //g; print $DISTOUT $val . " "; } } } close (DATA); print "\n"; } exit; masterScript #!/usr/bin/perl #my $fn; for (my $ind1 =1; $ind1 < 226; $ind1++){ for (my $ind2 =1; $ind2 < 353; $ind2++) { # $fn = "WEBPROC215V\?kodvib=400\&pf7335=" . "$ind1" . "\&pf7145=" . "$ind2"; # $fn = "WEBPROC215V\?kodvib=400\&pf7335=9\&pf7145=62"; my $base1 = "WEBPROC215V\?kodvib=400\&pf7335="; my $base2 = "\&pf7145="; # print "$base1"; # print "$base2"; my $fn1 = join('',$base1,$ind1,$base2,$ind2,".04"); my $fn2 = join('',$base1,$ind1,$base2,$ind2,".05"); if(-e $fn1){ print "$ind1 $ind2 "; &process04 ( $fn1 ); &process05 ( $fn2 ); }else { # print "file did not exist\n"; # print "$fn1 \n"; } } } sub process04 (){ my $fn = $_[0]; my $dataexists = open (DATA,"$fn"); #if (!$dataexists) { return 0; } my $junk; my $i = 1; while () { my $input = $_; $input =~ s/\ //g; if ($input =~ /(\d+)/s) { $i++; if ( $i % 2 ) { print $1 . " "; } } } close(DATA); } sub process05 ( ){ my $fn = $_[0]; my $dataexists = open (DATA,"$fn"); #if (!$dataexists) {return 0; } my $junk; $junk = ; $junk = ; $junk = ; $junk = ; my $i = 1; while () { my $input = $_; $input =~ s/\ //g; if ($input =~ /(\d+)/s) { $i++; if ( $i % 2 ) { my $val = $1; $val =~ s/\ //g; print $val . " "; } } } close (DATA); print "\n"; } exit; namescript8 #!/usr/bin/perl #my $fn; # for (my $ind1 =1; $ind1 < 226; $ind1++){ # for (my $ind2 =1; $ind2 < 353; $ind2++) { my $foutname = join("","names"); open(DISTOUT, "> $foutname") or die "no file out"; flock (DISTOUT, 2); for (my $ind1 =1; $ind1 < 226; $ind1++){ # for (my $ind2 =1; $ind2 < 353; $ind2++) { my $ind2 = 1; my $base1 = "pf7335="; my $base2 = "pf7145="; # print "$base1"; # print "$base2"; # my $fn1 = join('',$base1,$ind1,$base2,$ind2,".07"); my $fn2 = join('',$base1,$ind1,$base2,$ind2,".08"); # print $fn2 . "\n"; my $count = 0; while (! -e $fn2 && $count < 400){ $count ++; $ind2++; $fn2 = join('',$base1,$ind1,$base2,$ind2,".08"); } if(-e $fn2){ print "$ind1 $ind2 "; &process08 ( $fn2, \*DISTOUT ); print "\n \n"; }else { print "$ind1 $ind2 missing \n"; # print "$fn1 \n"; } } flock(DISTOUT, 8); close (DISTOUT); sub process08 ( ){ my $fn = $_[0]; my $DISTOUT = $_[1]; my $dataexists = open (DATA,"$fn"); if (!$dataexists) {print "no 08 here"; return 0; } my $junk; $junk = ; $junk = ; $junk = ; $junk = ; $junk = ; $junk = ; my $lagin=0; my $i = 1; my @myArray=(); while () { my $input = $_; #$input =~ s/\ //g; if ($input =~ /(\d+)/s) { $i++; if ( $i % 2 ) { my $val = $1; $val =~ s/\ //g; # print "dig" . $val . " "; } } if ($i % 2 == 0){ # if ($i == 2*$lagin){ #can't because candidates dont count up if (1){ my $newin = $input; $newin =~ s/\n \r \j \f//; $newin =~ /(\w+)/s; chop ($newin); push @myArray, $newin; } } $lagin = $input; } close (DATA); pop(@myArray); #throw away that last one; my $lastval = pop(@myArray); if ($lastval != 8) {print "major error pruning $ind1 $ind2 ";} foreach $value (@myArray){ print $value . "\t" ; } } exit; stripHTML #! /bin/sh # remove SGML tags from a file. sed -e s/\<[\^\<\>]*\>//g -e s/\^[\^\<\>]*\>// -e s/\<[\^\<\>]*\$// $1 rename-perl #!/usr/bin/perl # # rename script examples from lwall: # rename 's/\.orig$//' *.orig # rename 'y/A-Z/a-z/ unless /^Make/' * # rename '$_ .= ".bad"' *.f # rename 'print "$_: "; s/foo/bar/ if =~ /^y/i' * $op = shift; for (@ARGV) { $was = $_; eval $op; die $@ if $@; rename($was,$_) unless $was eq $_; }