Sunday, April 20, 2008

striping html

I wanted to download this and covert it to a flat text file, and basic html file. Here's the script to do it, basically runs it though html2text, skips everything up to "ACKNOWLEDGEMENTS" then removes any line with "continue..." at the start. Then strips off the first and last lines (which are blank).

html2text -nobs ch1-a.html | awk '{if(p==1) if($1 != "continue...") print $0; if($1 == "ACKNOWLEDGEMENTS") p=1; n++;}' | sed '1d;$d' >> ch1-a.txt


Here's the complete version which iterates over all the chNUMBER-* files in a directory, creates a single textfile for each chapter then converts that to html:

#!/bin/bash

for ((chapnum=1; chapnum<=24; chapnum++)) do
for i in ./ch$chapnum-*
do
html2text -nobs $i | awk '{if(p==1) if($1 != "continue...") print $0; if($1 == "ACKNOWLEDGEMENTS") p=1; n++;}' | sed '1d;$d' >> $chapnum.txt
done
txt2html $chapnum.txt > $chapnum.html
done

No comments: