miriam_e | formatting text (Reply)

Here's a nice trick for Linux users.

When outputting text to the terminal, most commands don't do smart-wordwrapping, however text is generally much easier to read if words are NOT broken at the ends of lines. Here is a way to make any text output have smart-wordwrapping.

In your .bashrc add this line:

alias ww="fmt -s -w $(tput cols) - "

It uses the fmt command to wordwrap text according to how many columns wide (-w) the terminal window is (tput cols). The -s option prevents it joining short lines, so it only splits long lines. The final '-' makes it take its input from the standard input.

Now, when you output any text to the terminal just pipe it through ww and it will display as smart-wordwrapped, for example:

cat test.txt | ww

Don't use it to output to a file though, unless you want the extra linefeeds inserted into your file for some reason.

A big shortcoming of the fmt command is that you can't remove all carriage returns inside paragraphs, if you want to undo its formatting.

I solve this using a surprisingly complicated-looking sed command:

sed -z -r 's/[ \t]*\n[ \t]*/\n/g; s/([^\n])\n([^\n])/\1 \2/g'

It's simpler than it looks though.

The -z option treats any zero bytes as end-of-line markers, so since most text files don't contain any zero bytes it treats the entire file as a single line. This lets us treat newline characters as just any other character.

The -r option lets sed use extended regular expressions. This simplifies things because it means a lot of regular expression characters don't have to be prefixed with the '\' escape character.

The sed command breaks down into 2 instructions. They're both substitute commands: s/search/replace/g
The 's' means substitute. The text to be searched for and the text to replace it are both surrounded by '/' characters. And the 'g' means global replace, so every occurrence is replaced

The first command is:
s/[ \t]*\n[ \t]*/\n/g
It searches for any amount of spaces and/or tabs [ \t]* that immediately precede or follow a newline character and deletes those, leaving just the newline character \n. The asterisk means any number of the preceding pattern [ \t].

The second command does the real work:
s/([^\n])\n([^\n])/\1 \2/g
It is a bit weird. It marks some patterns with parentheses () so that sed can make use of what they match. If the first character of a pattern inside square brackets is '^' then it means the reverse, so [^\n] means any character that is not a newline character. So this search pattern means any newline character with ordinary characters on either side of it.

The replace pattern is a lot simpler. The first pattern character in parentheses is remembered and put back into the output. The newline is replaced by a space. And the second pattern character is remembered and replaced back into the output.

So this replaces with a space all lone newlines at the ends of lines, but doesn't touch paragraph ends that have multiple newline characters.

It took me quite a while to work out what is a pretty simple command. I tried a number of different languages. In the end sed rules them all, despite its dense-looking form.

I should rewrite it to let it ignore short lines, such as those in poetry... but that gets a bit complicated. I would have to find lines that are shorter than "normal" and leave them alone. What is "normal" or average line length for the file? How much shorter qualifies it as verse? Tricky.

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Miriam's Dreamwidth blog

formatting text (Reply)

formatting text

Profile

May 2025

Style Credit

Expand Cut Tags