wrapping with awk
Jul. 31st, 2011 10:09 amA while ago I posted about unwrapping the lines inside paragraphs of a text file.
"Wrapping" lines usually (though not always) means breaking lines so that long lines don't run off the right-hand side of the display. One of the great breakthroughs of writing web pages in HTML is that it automatically wraps text to the window.
It has become fairly common for a lot of text files to be pre-wrapped to a certain line length (often 79 characters, inherited from the old 80 character-wide display screens from twenty years ago.) All the ebooks from Project Gutenberg are wrapped that way. I like to read a lot of their books on my little Palm computer which has a fairly narrow display. The Palm is perfectly happy to wrap long lines on its display, so the pre-wrapped text present a problem. They end up looking something like this:
(The first two paragraphs of "A Little Bush Maid" by Mary Grant Bruce.)
The Palm doesn't need lines to be pre-wrapped. In order to read them more easily I prefer to unwrap them, simply preserving the blank lines that separate paragraphs. To this end, some months ago I wrote this little sed program:
I finally got around to learning enough awk to try the same thing with it. In contrast to the sed program, the awk one is delightfully lucid:
Yay!
"Wrapping" lines usually (though not always) means breaking lines so that long lines don't run off the right-hand side of the display. One of the great breakthroughs of writing web pages in HTML is that it automatically wraps text to the window.
It has become fairly common for a lot of text files to be pre-wrapped to a certain line length (often 79 characters, inherited from the old 80 character-wide display screens from twenty years ago.) All the ebooks from Project Gutenberg are wrapped that way. I like to read a lot of their books on my little Palm computer which has a fairly narrow display. The Palm is perfectly happy to wrap long lines on its display, so the pre-wrapped text present a problem. They end up looking something like this:
| Norah's home was on a big station in the north of Victoria--so large that you could almost, in her own phrase, "ride all day and never see any one you didn't want to see"; which was a great advantage in Norah's eyes. Not that Billabong Station ever seemed to the little girl a place that you needed to praise in any way. It occupied so very modest a position as the loveliest part of the world! The homestead was built on a gentle rise that sloped gradually away on every side; in front to the wide plain, dotted with huge gum trees and great grey box groves, and at the back, after you had passed through the well-kept vegetable garden and orchard, to a long lagoon, bordered with trees and fringed with tall bulrushes and waving reeds. |
(The first two paragraphs of "A Little Bush Maid" by Mary Grant Bruce.)
The Palm doesn't need lines to be pre-wrapped. In order to read them more easily I prefer to unwrap them, simply preserving the blank lines that separate paragraphs. To this end, some months ago I wrote this little sed program:
sed ':a; /^$/!N; /\n$/!s/\n/\ /; ta'It is astoundingly opaque. Even after writing it I need the comments I wrote at the time to help me understand how it works.I finally got around to learning enough awk to try the same thing with it. In contrast to the sed program, the awk one is delightfully lucid:
awk '{RS="" ; gsub("\n", " "); print $0 "\n"}'First, it sets the record separator (RS) to an empty string so that it will read in a whole paragraph as a single record. Normally the record separator is the newline character (\n) that ends each line. Next, it globally replaces (gsub) all newline characters with spaces. Finally it prints out each record (paragraph) and adds an extra newline character to the one that the print statement normally adds. Incredibly simple and easy to read.Yay!