login
Header Space

 
 

philosophically about sed and text processing

April 21, 2008 - 2:40pm
Submitted by olecom on April 21, 2008 - 2:40pm.

I was pointed out to book called "Mastering Regular Expressions".
Well, this was my very quick and small response.
Note, '\{0,s\}' is my proposition for shortest match in BRE.

Date: Mon, 21 Apr 2008 18:54:47 +0100
From: "Oleg Verych"
To: "sed users"
Subject: more on design of some UNIX tools (Re: gsed man pages; custom sed news; `sed` in the wild.)

> > Please, note word "change", not "design". Change is simple
>
>  A simple design is something you can see from the design document, which is
> my and your's last e-mails.  A simple change is something you can see from
> the code, so I won't believe that until I see the patch. :-P

I've just looked that "Mastering RE" book. And it's very sad, that
author didn't start
from BRE and `sed` usage, i.e basic things.

 '^(Subject|Date):' example is a mix of meat and flies.
sed -n '
/^[FS]/{
/From:/p
/Subject:/p
}'

I wonder, what is faster, less memory hungry, more flexible and readable in
the end (think all headers). Or just, what is weighted average?

,-- Aha, page 122:
|Even in a script, efficiency is also context-dependent. For example,
with an NFA,
|something long like |^-(display|geometry|cemap|=BC|quick24|random|raw)$ to
|check command-line arguments is inefficient because of all that alternation,
`--
...no comments...

,-- match IP address --
|^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$
|--
|Using Perl's \d as a shorthand for [0-9] , this becomes the more-readable*
|^\d+\.\d+\.\d+\.\d+$ , but it still allows things that aren't IP addresses,
|--
|To enforce that each number is three digits long, you could use
|^\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d$
|--
|but now we are too specific. We still need to allow one- and two-digit numbers
|(as in 1.2.3.4). If the tool's regex flavor provides the {min, max} notation,
|you can use ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$
`--
Right, classic BRE is the answer! I wonder about BRE syntax date now!


Anyway, the start from `egrep` and `perl` isn't right, even from POV of having
Basic RE and then to prove, they are too naive/basic/stupid/whatever, no
other way around.

Or gee, it is even mister Wall, who have envisioned lazy (non-greedy) RE and
thus implemented "ungreedy" quantifiers. Yet somehow
"very difficult (or even impossible)" RE things were done for years with buggy
proprietary UNIX tools.

Man, every C program with multiple inline C comments to match is an
example of all this bone-headed constant RE setups. Cool `perl` did
everything, have it done shell? It even has ASCII its own way, e.g. '\n'...

IMHO \{0,s\} seems perfect example of BRE development, not inventing wheels,
breaking everything. This book Mastering Regular Expressions, is all, but about
`sed` and its power of text processing by RE, not even about those flexibilities
`sed` has above just RE -- simple logic glue to match *and* process.

But UNIX "a tool for a job" have had it's "marvelous" designs.
On early shells it was normal to count length of a shell variable, to do simple
arithmetic with `expr` :

# prints the number characters in variable a
expr "$a" : '.*' # (perfect result of mathematical study of RE in 60s
applied to shell)
expr 2 + 3 * 10

this tool with its syntax, which in shell requires quoting nearly everywhere,
is a good example of failed design. shell soon fixed all that with ${}
$(()), but
not by initial design. Thus, maybe using `perl` in book is more natural for
readers who are programmers and readers are only programmers...

Many shell functionality in "custom sed" proposal shows, that sed+shell both
have incomplete designs. Something needs to be redesigned using proved
basic experience and wisdom. Even /bin/sh itself has no development since
late 80s. They talk about compatibility of the scripts, source code. They just
don't know what `sed` really is, and what it can do.

So, it's not another `perl` for sure! It is something to look in
bright future of
using and transforming text, which in open source is source code.......

Source code without whitespace damage, with clear coding style to make
pattern (RE) composing easy, thus which is easy to change, maintain, port
etc., etc... `perl` and Emacs failed to do so for decades and yet `sed` is
known just by 's' command.
--
http://kerneltrap.org/blog/olecom
http://kernelnewbies.org/olecom
sed 'sed && sh + olecom =3D love' << ''
-o--=3DO`C
 #oo'L O
<___=3DE M

started from: gsed man pages; custom sed news; `sed` in the wild

April 21, 2008 - 2:54pm
Date: Sun, 20 Apr 2008 05:14:26 +0100
From: "Oleg Verych"
To: "sed users"
Subject: gsed man pages; custom sed news; `sed` in the wild.

== GNU sed functionality documentation ==

When i've reported some bugs[0] with GNU sed recently, i've screen some
old ones.Very little work required to close all those man-page related stuff.
Many users without `info` journeys can know much more features implemented
there.

[0] http://bugs.debian.org/475464 475474
   Bug#475464: sed: lost cycle in address range with last-line-address operation
   Bug#475474: sed: performance anomaly of /^$/

== custom sed news ==

Making short RE match was outlined in custom sed[1]
,--
|with Perl-style minimal pattern matching:
|           *?   minimal match of *
|           +?   minimal match of +
|           ??   minimal match of ?
`--
I don't know if this was implemented somewhere. Anyway, this approach seem
to be very bad. BRE or ERE syntax is already too twisted, i haven't even see a
reliable highlighting for it. E.g. colored expand [2] didn't do \( \) matched,
GNU Emacs 21, i use, hasn't any so called fontlocking for `sed` or BRE.

What i'm proposing[3] is S/// command, which will have shortest RE match.
I don't know if glibc regexp based implementation of GNU sed can have it, but
ordinary `sed` must have shortest match in context (line) addresses with RE.
To scan more there, is quite waste of time and energy. Thus same short
functionality flag can be used in new S/// command.

[1] http://www.ptug.org/sed/custom_sed.html
[2] http://sed.sourceforge.net/local/scripts/expand.sed.html
[3] http://kernelnewbies.org/olecom/sed-and-sh++#sed

== `sed` in the wild ==

I also don't know if posts like this one are OK for mostly Q&A forums, but let
me share some of my experience and scripts. While i didn't get any feedback
about text processing WRT coding style, this time i want to show some really
simple yet very useful for C programmers stuff.

Real security guys have hard time maintaining and porting big patches to ever
growing number of changes and versions of 2.6 Linux. I've started to look at
PaX patch[4] to see, what i can do for more automated and simple porting of it.

[4] http://kerneltrap.org/node/15996

Any comments are welcome. TIA.

== last, but not least ==
[2] http://sed.sourceforge.net/local/scripts/expand.sed.html

i think, i've made more optimized version of this one[5]. Don't claim,
i'm a guru,
just a small funny experience. Evaluation is welcome.

[5] ftp://flower.upol.cz/crazy/expand.sed.sh
-- 
sed 'sed && sh + olecom = love' << ''
-o--=O`C
 #oo'L O
<___=E M

discussion about what backtrack is

April 21, 2008 - 2:58pm
Date: Mon, 21 Apr 2008 09:11:51 +0200

|| Matching /^".*"/ against "ABCDEFGHIJKLMNOPQRSTUVWXYZ" require 1
>  >  don't know yet what backtrack is, especially in practice.
>
>
> taking slightly modified Paolo's example:
>
>  consider matching /.*"/ against "ABCDEFGHIJKLMNOPQRSTUVWXYZ,

OK. But it may be too different.

>  it first checks if " matches the .*, it does,

Really?

'.*' means zero or more, so i'd say the first step is to see if _"_ matches _"_,
match is successful, save position/state, next(). Here the shortest match can be
applied, if no next() is used.

>  than it checks if "A match the .*, it does, than it checks if "AB matches .*,
>  it does ..... than it
>  checks if "ABCDEFGHIJKLMNOPQRSTUVWXYZ it does, than it reaches the
>  end, oops, next thing to do is check if "ABCDEFGHIJKLMNOPQRSTUVWXYZ
>  matches .*", it doesn't than it checks if "ABCDEFGHIJKLMNOPQRSTUVWXY
>  matches the .*", it doesn't ... the proces of going back char by char
>  is called backtracking.

I see it in general now, thanks.
______

i gladly explain my language and meaning; asking humans is good

April 21, 2008 - 3:02pm
Date: Mon, 21 Apr 2008 14:58:28 +0200

Paolo Bonzini:
>  For example, I have no clue what you meant here:

Alright, it's always OK, if somebody just asks.

> and it can be changed in simple matcher

"the change" is any simple one, like short RE match or nested
addresses proposed earlier.

"simple matcher" i mean BRE implementation in small and simple
LANG=C (i.e. no i18n, internationalization):
* `minised`
  <http://www.google.com/search?q=minised>
* `expr` in classic `Ash`
  <ftp://flower.upol.cz/dts/Ash0000/history/ash/bltin/regexp.c>

>  to be a flexible, useful variable,

short match is a high-level concept of (basic) regular expressions,
thus is fundamental constant, which has problems, e.g. using inverse
matching for simple one-character short match: ',[^,]', and no
simple way to multiple ones.

for another fundamental RE thing:
"the earliest match wins" (quote from mastering RE)

`sed` has flags in s///here: 1,2,999 (thanks, GNU), g (i wonder where
'g' comes from :)

thus, whole my rant here is about more flexibility in `sed`
(just in `sed`, not glibc and scary RE matchers there). perl-like RE
is not an options, as it was noted in the first post.

>  to "fit perfectly" to more places.

this quote from double-quoted books (sed&awk and mastering RE)
is sarcasm.

>  ???

Hope this helps.
-- 
sed 'sed && sh + olecom = love'  <<  ''
-o--=O`C
 #oo'L O
<___=E M

just simple example:[Re: Sed cut within line]

April 22, 2008 - 8:41pm
To: sed users
Date: Wed, Apr 23, 2008 at 12:59 AM

Tim Chase @ Wed, Apr 23, 2008 at 12:39 AM:

> > I am trying to get SED to keep the first three characters in a line,
>  > and discard all others.
>
>  You should be able to use
>
>   sed 's/^\(...\).*/\1/'
>
olecom@flower:$ echo 123uio | sed 's-^\(...\).*-\1-'
123

more funny examples (:

olecom@flower:$ echo 123uio | sed -n 's-...-&\n-;P'
123
olecom@flower:$ echo 123uio | sed 's-...-&\n-;P;d'
123

Which is faster?

,-- man perl:
|The Perl motto is "There's more than one way to do it."  Divining how
|many more is left as an exercise to the reader.
`--

ho-ho-ho! ha-ha-ha!

Yea, i can have many ways to laugh at it. Anyway.

And answer is!

== input ==

root@flower:/tmp# ls -lh /tmp/func.def-0#7-
-rw-r----- 1 olecom root 12M 2008-03-19 02:53 /tmp/func.def-0#7-

== output ==

root@flower:/tmp# time nice -n -19 sed 's-...-&\n-;P;d' /tmp/func.def-0#7- >/dev/null

real 0m0.189s
user 0m0.176s
sys  0m0.008s
root@flower:/tmp# time nice -n -19 sed 's-...-&\n-;P;d' /tmp/func.def-0#7-  >/dev/null

real 0m0.189s
user 0m0.184s
sys  0m0.004s
root@flower:/tmp# time nice -n -19 sed 's-^\(...\).*-\1-' /tmp/func.def-0#7- >/dev/null

real 0m3.086s
user 0m3.072s
sys  0m0.008s
root@flower:/tmp# time nice -n -19 sed 's-^\(...\).*-\1-' /tmp/func.def-0#7- >/dev/null

real 0m3.120s
user 0m3.112s
sys  0m0.004s
root@flower:/tmp#

Wow, nice trick.

April 23, 2008 - 10:20am

From: Paolo Bonzini
Subject: Re: which is faste? (Re: Sed cut within line)

Oleg Verych wrote:
>> Which is faster?

Wow, nice trick.

Paolo

some more fun

May 19, 2008 - 7:34am

kid+bit+sed http://mid.gmane.org/20080519104809.GJ21428@flower.upol.cz

and <20080519114731.GK21428@flower.upol.cz>:

Forgot one thing.

> == Let strange quark decay ==
>
> $time sed -n '/:Version: /s`.*/\(.*\):Version: \([^:]*:\)*`\1_`p' <sed.input >here2
>
> real    0m0.284s
> user    0m0.268s
> sys     0m0.020s
> $diff -u here here2 | head
> $
>
> So, nearly one order of magnitude?

$time sed -n '/:Version: /s`.*/\(.*\):Version: \([^:]*:\)*`\1_`p' <sed.input >here2

real    0m0.283s
user    0m0.268s
sys     0m0.012s
$
$time sed -n '/:Version: /s`.*/``p' <sed.input | sed 's`\(.*\):Version: \([^:]*:\)*`\1_`' >here3

real    0m0.142s
user    0m0.188s
sys     0m0.012s
$
$time sed -n '/:Version: /s`.*/``p' <sed.input | sed 's`\(.*\):Version: \([^:]*:\)*`\1_`' >here3

real    0m0.142s
user    0m0.184s
sys     0m0.008s
$
$diff -u here3 here2
$

Linuxthreads in past TODO or yet TO DO NPTL for faster ERE in glibc?

I think, making processes in kernel light, new scheduler, splice(), etc,
etc, is better way, rather than inventing userspace-multi-tasking.

I wonder, who needs it anyway, except big closed enterprise...
____

optimizing `gcc` cult

May 21, 2008 - 8:41am

quoting myself:

optimizing `gcc` cult.

Actually that is NIH:
http://article.gmane.org/gmane.linux.kernel/274882

Sometimes lkml is not a waste of time; its archives (thanks to gmane, fsck vger) are the waste of knowledge...

new addition:

http://article.gmane.org/gmane.linux.kernel/683674
<alpine.LFD.1.10.0805201752230.3081@woody.linux-foundation.org>

So, ideas of text-processing aware super-macro language aren't without base. Now situation is just a manual handcraft with cult of GNU C Compiler, spaghetti coding and kernel-devs, who, unlike Linus, don't look into asm...

I hope i'll finish my IDE to ease and make it fun to see what asm gcc generates and when it does it better or not (among all others things that developer just skips, due to dumb monkey job, inflexibility, no user interface).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
speck-geostationary