smart wrap of paragraphs
Sep. 12th, 2010 07:28 pmAnother Edit: One tiny extra change I added was to make the joining of lines conditional so that it was only done if the current line is non-blank.
Edit: Holy cow! I woke up at 3:30am, and was unable to get back to sleep because I had come up with a simpler way to do this. I've now stuck most of the original post behind a cut tag and present the much better way to do paragraph wrapping in sed below.
Woo hoo!!!
I have finally managed to work out a small command that has stymied me for an embarrassingly long time.
A quick backgrounder: Years ago I bought a copy of a truly marvelous text editor called TextPad. Since I've moved to Linux I have had to use TextPad inside Wine, which is really annoying. I'd like to drop TextPad and use a native Linux text editor, but I've yet to find one that comes even close to its capabilities. One of the major failings of all Linux text editors that I've found is the lack of an intelligent text wrap. That is, something that removes line-endings from lines within paragraphs but keeps line endings on the blank lines that separate paragraphs.
Okay, the new, improved way:
sed ':a; /^$/!N; /\n$/!s/\n/\ /; ta' input.txt >output.txtJust look at how small that is. No need to double blank lines or insert an arbitrary string into the text to mark the ends of non-blank lines. This little gem sees the newline character itself. And I worked out how to eliminate the two "-e" options too.
How it works:
We start with a label (:a) that marks the beginning of a loop.
The semicolon (;) simply separates sed commands.
The first thing we do is to begin with a condition (/^$/!) which means "if the current line is not blank" ('^' stands for the beginning of a line, '$' stands for the end of a line, and '!' means 'not'). So, if the current line is not blank then use the "N" command to append the next line to this one.
Then we set a condition (/\n$/!) that lets the rest of the loop work only if a newline (\n) is NOT (!) at the end of line ($).
This is tricky and is the insight that wouldn't let me get back to sleep. Easiest to see the logic if you follow its two possibilities: If the line being appended has text on it then the newline character at the end of the current line will be wedged between the text on this one and the next one. That is, it won't be at the end. However, if the next line being appended is an empty line then the only thing that will be appended will be a newline... at the end of the current line. Neat, huh?
If the newline was not at the end of the line (the next line was not a blank line) then we simply substitute (s) the newline (\n) with a space (\ ).
The last bit (ta) closes the loop.
So this loop ripples through a paragraph replacing newline characters with spaces til it finds a blank line, which it leaves alone, drops out of the loop, then does the whole command on the next line, which probably being another paragraph gets the same treatment. This whips through the whole file in a fraction of a second.
So my final comments are even more true of this revised version:
Yay! Hard to believe that this was so damn hard to work out. It appears pretty easy, looking at it now. Still... 20/20 hindsight and all that.
Here is the old, clumsy way I did it:
sed 's/^$/\n/' $1 | sed '/^$/!s/$/@eol@/' | sed -e :a -e '/@eol@$/N; s/@eol@\n/\ /; ta' >$2Putting this in a file called "wrap" lets me use it as a command like this:
wrap input.txt output.txtThis takes a text file called, in this example "input.txt", wraps the paragraphs in it, and sends the result out to create another file called "output.txt"
The "$1" and "$2" in the above script are placeholders for the first and second filenames handed to the "wrap" command.
How does it work?
The first part of the script:
sed 's/^$/\n/' $1 invokes sed on the input file ($1) to do a substitute (s), searching for empty lines (^$), literally nothing between the beginning and end (because "^" stands for the beginning of a line and "$" stands for the end of a line) and replacing that nothing with a newline, making it now two newlines. This is to fix something that happens later.Then there is a pipe (|) to take the output of this first part and pipe it into the next part.
The second part:
sed '/^$/!s/$/@eol@/' looks for anything that is not an empty line (/^$/!) -- the "!" means "not" -- and substitute (s) at the end of the line ($) the letters "@eol@", which I chose to be extremely unlikely to appear in any text.The output of this is piped (|) to the next part.
The third bit is the really cool part:
sed -e :a -e '/@eol@$/N; s/@eol@\n/\ /; ta'.The option "-e" tells sed to execute this immediate command. Usually it is not needed, but here it is useful for tacking some sed commands together in a mini-script inside a single sed command. The first "-e" option simply sets a label (:a) for the beginning of a loop. Then the second "-e" option introduces the real meat. It looks for lines with the "@eol@" text and appends (N) the next line to it. Normally sed can't see newline characters because they are beyond the end of the line, but the join puts one smack in the middle of a line where it can see it. Now it substitutes (s) the "@eol@" and newline character (\n) with a space (\ ). (I am not sure why I need to use a backslash to keep the space character; it didn't work properly when I didn't.) Now it reaches the end of the loop (ta) and goes back to the beginning of the loop (:a) to try looking for another "@eol@", so that it can join and replace again, and so on till it hits a blank line, when it uses up one of the extra blank lines I added in the very first command. At this point the loop exits because the blank line doesn't have the "@eol@" at its end.
This whole thing gets done on every line through the whole input file.
Lastly, the result gets directed out to the second file ($2) and we are done.