I was pointed out to book called "Mastering Regular Expressions".
Well, this was my very quick and small response.
Note, '\{0,s\}' is my proposition for shortest match in BRE.
Date: Mon, 21 Apr 2008 18:54:47 +0100
From: "Oleg Verych"
To: "sed users"
Subject: more on design of some UNIX tools (Re: gsed man pages; custom sed news; `sed` in the wild.)
> > Please, note word "change", not "design". Change is simple
>
> A simple design is something you can see from the design document, which is
> my and your's last e-mails. A simple change is something you can see from
> the code, so I won't believe that until I see the patch. :-P
I've just looked that "Mastering RE" book. And it's very sad, that
author didn't start
from BRE and `sed` usage, i.e basic things.
'^(Subject|Date):' example is a mix of meat and flies.
sed -n '
/^[FS]/{
/From:/p
/Subject:/p
}'
I wonder, what is faster, less memory hungry, more flexible and readable in
the end (think all headers). Or just, what is weighted average?
,-- Aha, page 122:
|Even in a script, efficiency is also context-dependent. For example,
with an NFA,
|something long like |^-(display|geometry|cemap|=BC|quick24|random|raw)$ to
|check command-line arguments is inefficient because of all that alternation,
`--
...no comments...
,-- match IP address --
|^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$
|--
|Using Perl's \d as a shorthand for [0-9] , this becomes the more-readable*
|^\d+\.\d+\.\d+\.\d+$ , but it still allows things that aren't IP addresses,
|--
|To enforce that each number is three digits long, you could use
|^\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d$
|--
|but now we are too specific. We still need to allow one- and two-digit numbers
|(as in 1.2.3.4). If the tool's regex flavor provides the {min, max} notation,
|you can use ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$
`--
Right, classic BRE is the answer! I wonder about BRE syntax date now!
Anyway, the start from `egrep` and `perl` isn't right, even from POV of having
Basic RE and then to prove, they are too naive/basic/stupid/whatever, no
other way around.
Or gee, it is even mister Wall, who have envisioned lazy (non-greedy) RE and
thus implemented "ungreedy" quantifiers. Yet somehow
"very difficult (or even impossible)" RE things were done for years with buggy
proprietary UNIX tools.
Man, every C program with multiple inline C comments to match is an
example of all this bone-headed constant RE setups. Cool `perl` did
everything, have it done shell? It even has ASCII its own way, e.g. '\n'...
IMHO \{0,s\} seems perfect example of BRE development, not inventing wheels,
breaking everything. This book Mastering Regular Expressions, is all, but about
`sed` and its power of text processing by RE, not even about those flexibilities
`sed` has above just RE -- simple logic glue to match *and* process.
But UNIX "a tool for a job" have had it's "marvelous" designs.
On early shells it was normal to count length of a shell variable, to do simple
arithmetic with `expr` :
# prints the number characters in variable a
expr "$a" : '.*' # (perfect result of mathematical study of RE in 60s
applied to shell)
expr 2 + 3 * 10
this tool with its syntax, which in shell requires quoting nearly everywhere,
is a good example of failed design. shell soon fixed all that with ${}
$(()), but
not by initial design. Thus, maybe using `perl` in book is more natural for
readers who are programmers and readers are only programmers...
Many shell functionality in "custom sed" proposal shows, that sed+shell both
have incomplete designs. Something needs to be redesigned using proved
basic experience and wisdom. Even /bin/sh itself has no development since
late 80s. They talk about compatibility of the scripts, source code. They just
don't know what `sed` really is, and what it can do.
So, it's not another `perl` for sure! It is something to look in
bright future of
using and transforming text, which in open source is source code.......
Source code without whitespace damage, with clear coding style to make
pattern (RE) composing easy, thus which is easy to change, maintain, port
etc., etc... `perl` and Emacs failed to do so for decades and yet `sed` is
known just by 's' command.
--
http://kerneltrap.org/blog/olecom [1]
http://kernelnewbies.org/olecom [2]
sed 'sed && sh + olecom =3D love' << ''
-o--=3DO`C
#oo'L O
<___=3DE M