Confusion with grep & locale?

Discussion:

(too old to reply)

Helge Oldach

2021-08-20 09:03:26 UTC

Hi all,

I'm confused about the FreeBSD behaviour with respect to locale's
and grep - specifically, it seems case sensitivity is not handled
consistently when grepping character ranges. It looks to me like 11 and
13 are not behaving consistently however I'm unclear why.

# uname -a
FreeBSD 11STABLE 11.4-STABLE FreeBSD 11.4-STABLE #1059 r368289M: Thu Dec 3 01:48:30 UTC 2020 ***@XXX amd64
# export LANG=en_US.ISO8859-1
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=C
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=en_US.UTF-8
# (echo bla; echo Bla) | grep '[A-Z]'
bla
Bla
#

# uname -a
FreeBSD 13STABLE 13.0-STABLE FreeBSD 13.0-STABLE #49 stable/13-n246779-64085efb677-dirty: Mon Aug 16 08:42:53 CEST 2021 ***@XXX amd64
# export LANG=en_US.ISO8859-1
# (echo bla; echo Bla) | grep '[A-Z]'
bla
Bla
# export LANG=C
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=en_US.UTF-8
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
#

For comparison, a Linux RHEL box delivers the expected results:

# uname -a
Linux rhel.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
# export LANG=en_US.ISO8859-1
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=C
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=en_US.UTF-8
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
#

There is nothing special in the environment, specifically no LC_xxx nor
MM_CHARSET in either case.

Any guidance is appreciated... Thanks!

Kind regards
Helge

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Eugene Grosbein

2021-08-20 09:18:40 UTC

Permalink

FreeBSD 11 uses GNU grep by default but newer version switched to using bsdgrep as grep.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Helge Oldach

2021-08-20 09:33:52 UTC

Permalink

Hi EUgene,

Post by Eugene Grosbein

Post by Helge Oldach
I'm confused about the FreeBSD behaviour with respect to locale's
and grep - specifically, it seems case sensitivity is not handled
consistently when grepping character ranges. It looks to me like 11 and
13 are not behaving consistently however I'm unclear why.

FreeBSD 11 uses GNU grep by default but newer version switched to using bsdgrep as grep.

Thanks, that might explain the 11 oddity. However 13 is also exposing
strange behaviour (note the ISO8859 case):

# export LANG=en_US.ISO8859-1
# (echo bla; echo Bla) | grep '[A-Z]'
bla
Bla
# export LANG=C
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=en_US.UTF-8
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
#

Kind regards
Helge

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

parv/freebsd

2021-08-20 09:36:25 UTC

Permalink

On Thu, Aug 19, 2021 at 11:04 PM Helge Oldach wrote:
...

Post by Helge Oldach
# uname -a
FreeBSD 13STABLE 13.0-STABLE FreeBSD 13.0-STABLE #49
stable/13-n246779-64085efb677-dirty: Mon Aug 16 08:42:53 CEST 2021
# export LANG=en_US.ISO8859-1
# (echo bla; echo Bla) | grep '[A-Z]'
bla
Bla
# export LANG=C
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=en_US.UTF-8
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
#
# uname -a
Linux rhel.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST
2019 x86_64 x86_64 x86_64 GNU/Linux
# export LANG=en_US.ISO8859-1
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=C
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=en_US.UTF-8
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
#
There is nothing special in the environment, specifically no LC_xxx nor
MM_CHARSET in either case.
Any guidance is appreciated... Thanks!

Please file a PR, if one does not already exist, about FreeBSD grep(1)
producing unexpected result under some locale(s).

If desired, as workarounds instead of FreeBSD grep built with base
regex(3) library ...
- compile base grep with gnugrep library from ports;

- or, use gnugrep (installed as /usr/local/bin/grep ;-<), ack,
the_silver_surfer, among others until regex(3) would be fixed (does not
look like would be by 13.1 release).

- parv

Stefan Ehmann

2021-08-20 10:04:26 UTC

Permalink

It's not necessarily a bug, see here:
<https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>

"[...] that the 2008 standard had changed the definition of ranges, such that
outside the "C" and "POSIX" locales, the meaning of range expressions was
undefined."

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Stefan Esser

2021-08-20 12:47:11 UTC

Permalink

Post by Helge Oldach
Hi all,
I'm confused about the FreeBSD behaviour with respect to locale's
and grep - specifically, it seems case sensitivity is not handled
consistently when grepping character ranges. It looks to me like 11 and
13 are not behaving consistently however I'm unclear why.
# uname -a
# export LANG=en_US.ISO8859-1
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=C
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=en_US.UTF-8
# (echo bla; echo Bla) | grep '[A-Z]'
bla
Bla

This is not unexpected, since the default collating sequence for many UTF-8
locales is to have lower case letters precede their upper case versions in
the sequence, i.e.: "aAbBcC..."

https://developer.mimer.com/services/sql-unicode-collation-charts/

Here is a collation chart for English:

https://download.mimer.com/pub/developer/charts/english.htm

But POSIX makes no guarantees for locales other than POSIX or C.

Post by Helge Oldach
# uname -a
# export LANG=en_US.ISO8859-1
# (echo bla; echo Bla) | grep '[A-Z]'
bla
Bla

This one is unexpected, the upper case should be a range of its own
and should not include any lower case letters.

Post by Helge Oldach
# export LANG=C
# (echo bla; echo Bla) | grep '[A-Z]'
Bla

Correct.

Post by Helge Oldach
# export LANG=en_US.UTF-8
# (echo bla; echo Bla) | grep '[A-Z]'
Bla

Here I had expected the result you got with en_US.ISO8859-1 ...

Post by Helge Oldach
# uname -a
Linux rhel.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Mon Dec 2 08:31:54 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
# export LANG=en_US.ISO8859-1
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=C
# (echo bla; echo Bla) | grep '[A-Z]'
Bla
# export LANG=en_US.UTF-8
# (echo bla; echo Bla) | grep '[A-Z]'
Bla

Seems that this version uses a POSIX style collating sequence for UTF-8.
It would be interesting to test with ranges that contain accented
characters or German Umlaut characters.

Post by Helge Oldach
There is nothing special in the environment, specifically no LC_xxx nor
MM_CHARSET in either case.

LANG defines LC_COLLATE, unless overridden.

Post by Helge Oldach
Any guidance is appreciated... Thanks!

Definitely a bug in the definition of the collating sequences.

And I have just verified that de_DE.ISO8859-1 wrongly considers "Ã¶"
to be within [a-z], while de_DE.UTF-8 does not (but should).

Seems that the correct collating sequences for ISO8859-1 and UTF-8 are
each assigned to the other one.

Some platforms have switched to use the POSIX style collating sequence
to support traditional style [A-Z] for [[:upper:]], since a lot of shell
script have been written with that assumption for decades.

BTW, character classes work for your examples and more:

# (echo bla; echo Bla) | LANG=en_US.ISO8859-1 grep '[[:upper:]]'
Bla
# (echo bla; echo Bla) | LANG=en_US.UTF-8 grep '[[:upper:]]'
Bla

# (echo "o"; echo "Ã¶") | LANG=de_DE.ISO8859-1 grep '[[:lower:]]'
o
# (echo "o"; echo "Ã¶") | LANG=de_DE.UTF-8 grep '[[:lower:]]'
o
Ã¶

Regards, STefan

Warner Losh

2021-08-20 15:09:03 UTC

Permalink

Post by Stefan Esser
But POSIX makes no guarantees for locales other than POSIX or C.

OK, thanks for the explanation. That clarifies a lot for me. Although
it's not really POLA. :-)
Thanks a lot also to Stefan Ehmann for the pointer to gawk oddities.

Post by Stefan Esser

Post by Helge Oldach
# export LANG=en_US.ISO8859-1
# (echo bla; echo Bla) | grep '[A-Z]'
bla
Bla

This one is unexpected, the upper case should be a range of its own
and should not include any lower case letters.

Post by Helge Oldach
# export LANG=en_US.UTF-8
# (echo bla; echo Bla) | grep '[A-Z]'
Bla

Here I had expected the result you got with en_US.ISO8859-1 ...
Definitely a bug in the definition of the collating sequences.
And I have just verified that de_DE.ISO8859-1 wrongly considers "Ã¶"
to be within [a-z], while de_DE.UTF-8 does not (but should).
Seems that the correct collating sequences for ISO8859-1 and UTF-8 are
each assigned to the other one.

PR 257972 raised.

I've looked at that, and I don't think it's a bug since posix says it's
undefined behavior.

Post by Stefan Esser

Post by Helge Oldach
There is nothing special in the environment, specifically no LC_xxx nor
MM_CHARSET in either case.

LANG defines LC_COLLATE, unless overridden.

Indeed. I just explicitly mentioned *no* LC_xxx to clarify that it's not
overriden. :-)
Certainly they do. But they harder to type... :-)

I think that A-Za-z is undefined, but :letter: is well defined. Most shell
scripts use the 'C' locale for this very reason.

Warner