linuxjunk: striping html

Sunday, April 20, 2008

striping html

I wanted to download this and covert it to a flat text file, and basic html file. Here's the script to do it, basically runs it though html2text, skips everything up to "ACKNOWLEDGEMENTS" then removes any line with "continue..." at the start. Then strips off the first and last lines (which are blank).

html2text -nobs ch1-a.html | awk '{if(p==1) if($1 != "continue...") print $0; if($1 == "ACKNOWLEDGEMENTS") p=1; n++;}' | sed '1d;$d' >> ch1-a.txt

Here's the complete version which iterates over all the chNUMBER-* files in a directory, creates a single textfile for each chapter then converts that to html:

#!/bin/bash

for ((chapnum=1; chapnum<=24; chapnum++)) do
        for i in ./ch$chapnum-*
        do
                html2text -nobs $i | awk '{if(p==1) if($1 != "continue...") print $0; if($1 == "ACKNOWLEDGEMENTS") p=1; n++;}' | sed '1d;$d' >> $chapnum.txt
        done
        txt2html $chapnum.txt > $chapnum.html
done

linuxjunk

Sunday, April 20, 2008

striping html

No comments:

Blog Archive

Contributors

Links