1

I am trying to parse a list of command-line arguments (contained in the special "$@" Bash variable) that shall be passed on to the Linux ls binary.

The arguments may optionally contain keyword arguments like -h or --color or --hide 'pattern' and they may optionally contain a directory or file path (absolute /home, relative ./bin or implicitly relative Documents) at any position.

Is there a way to extract only the positional path argument from the list, if any is contained?

If it is not reasonably possible to rule out all edge cases, can we at least achieve the goal and extract the path argument whenever it is given as last argument on the line?

In case it helps for context, I am going to process the path argument further with awk in the next step to modify the output of ls -alF by appending further information to each line (in relation to "show git branch in ls -l" on Ask Ubuntu). Here I have the problem that if I run e.g. ls -alF /home, I have entries about ./ and ../ in the output, which awk will then process as relative paths to the current working directory and not to the specified /home.


The actual script (Bash function) can be found in the linked answer above on Ask Ubuntu. Here is a minimal version that skips calling the external git command and parsing its output:

lg (){ 
    ls -alF "$@" | 
    awk '{
        # Split into columns 1-8 and 9+ (file name):
        match($0, /^(\S+\s+){8}(.+)$/, f);
        # do something with the file name using an external command
        # result is stored in b, e.g. "(master)" or an empty string
        b=f[2];
        print $0, b
    }'
; }

Example output (of the original function that includes the external call):

$ lg
total 48 
drwxrwxr-x 12 bytecommander bytecommander 4096 Aug 26 14:48 ./ 
drwxr-xr-x 74 bytecommander bytecommander 4096 Aug 26 15:30 ../ 
drwxrwxr-x  6 bytecommander bytecommander 4096 Aug 26 14:43 git1/ (master)
drwxrwxr-x  7 bytecommander bytecommander 4096 Aug 26 14:42 git2/ (develop)
drwxrwxr-x  4 bytecommander bytecommander 4096 Aug 26 14:45 no-git/ 
-rw-rw-r--  1 bytecommander bytecommander    0 Aug 26 14:42 regular-file 

Now if my current working directory is e.g. git1 from above (which is a repository) and I run it like lg /home, the ./ entry in the output of ls will correctly correspond to /home, but the repository information I add is incorrectly the one of the current working directory, i.e. git1. Same for the parent directory entry ../

12
  • What are you actually trying to do? Working with the output of ls is a bad idea to start with. Commented Aug 26, 2017 at 19:59
  • @chepner I think my final goals are pretty well described in the last paragraph. What I want to achieve is to take the output of ls -alF as it is and extend it by appending some git meta information to all directories listed in the output. I have a working solution for that (see link), which just currently has the problem that the ./.. entries are treated relative to the working directory and not to the optionally specified location argument. And yes, I know that parsing ls is not the best thing to do, but the alternative is to rewrite ls, which I want to do even less. Commented Aug 26, 2017 at 20:03
  • Are you maybe trying to write a script that's called like myscript -alF /home and internally calls ls "$@" and you want to then also inside the script call awk '...' /home? If so that awk call would just be awk '...' "${!#}" in bash - is that what you want? If not then clarify your question with a concrete example script plus sample input expected output, i.e.a minimal reproducible example. Commented Aug 27, 2017 at 13:24
  • @EdMorton I copied the stuff from my linked Ask Ubuntu answer that contained the code. Does that meet your requirements, or do you need anything else for clarification? Commented Aug 27, 2017 at 13:34
  • 1
    @EdMorton Oh, now I see. Thank your for that hint. Commented Aug 27, 2017 at 14:32

4 Answers 4

1

If you are ok with perl then I suggest to use the Getopt::Long package to parse the arguments. Here's a sample:

#!/usr/bin/perl
use Getopt::Long;

my @args = @ARGV;

GetOptions(
  'a'            => \$opt{'a'},
  'all'          => \$opt{'all'},
  'A'            => \$opt{'A'},
  'almost-all'   => \$opt{'almost-all'},
  'author'       => \$opt{'author'},
  'b'            => \$opt{'b'},
  'escape'       => \$opt{'escape'},
  'block-size=s' => \$opt{'block-size'},
  # ...
  'version'      => \$opt{'version'}
)
or die "Invalid options: @args\n";

my @fileargs = @ARGV;
print "ls args: @args\n";
print "file args: @fileargs\n";

For example perl script.pl -a --author file1 file2 will print ls args: -a --author file1 file2 file args: file1 file2. You may add further code to process the file arguments.

As for the relative path name problem: I suggest to run find -maxdepth 1 fileordir instead of ls fileordir to get absolute paths.

Sign up to request clarification or add additional context in comments.

2 Comments

Can I get only the positional path argument and simply ignore anything else, without configuring all possible parameters somewhere first?
Well you could throw everything away that's starting with - since ls knows only arguments in the form -(\w|-\w+(=\w+)?), but wouldn't that be a bit of a hack?
1

My best guess so far is you're writing a script "myscript" which you want to call as:

myscript -<ls-args> <path>

and inside myscript do:

ls "$@"
awk 'stuff' "${!#}"

If that's not what you're looking for then edit your question to clarify.

2 Comments

And what does "${!#}" do here?
$# is the number of args passed to myscript so ${!#} is the value of the last arg passed to myscript.
0

My other answer describes a solution that recognizes all six date formats from ls except when GNU ls specifies a custom format. This answer, which is longer but has no dependencies beyond ls and git and doesn't itemize the locale's months, instead adding a mark after the date.

This only works well with GNU's ls --time-style=+FORMAT, though it mostly works with BSD's ls -D FORMAT, which does not have a way to distinguish between old (≥ 6mo) and recent (< 6mo) formats (so we assume everything is old). Other ls implementations, including BusyBox, do not work since they do not have a way to specify alternate time formats.

Like my other answer, this is accomplished using POSIX shell without any bashisms, but it also works in bash, zsh, etc.

lg2() {
  local CLICOLOR_FORCE opt next= color= opt_reset=yes n='
' time="${TIME_STYLE:-%b %e  %Y$n%b %e %H:%M}" mark="<$$#${RANDOM}#$$>"
  if [ "$ls_ver" != GNU ] && [ "$ls_ver" != BSD ]; then
     if command ls -ld --time-style=+%j / >/dev/null 2>&1; then ls_ver=GNU
     elif command ls -ldD %j / >/dev/null 2>&1; then ls_ver=BSD
     else echo 'This only supports GNU & BSD `ls`' >&2; return 2
     fi
  fi

  if [ -z "$CLICOLOR_FORCE" ] && [ -t 1 ]; then CLICOLOR_FORCE=1; fi

  for opt in "$@"; do
    if [ "$opt_reset" = yes ]; then set --; opt_reset=; fi
    if [ "$next" = 1 ]; then    # BSD-style -D
      time="$opt" next=
      continue
    fi
    case "$opt_reset$opt$n$ls_ver" in
      --"$n"* )                                 opt_reset=$$; set -- "$@" -- ;;
      --color=auto"$n"* | --colour=auto"$n"* )  : no-op since this is default ;;
      --color=n* | --colour=n* )                CLICOLOR_FORCE= ;;
      --color* | --colour* )                    CLICOLOR_FORCE=1 ;;
      --full-time"$n"BSD )                      time="%F %T %z" ;;
      --full-time"$n"GNU )                      time="%F %T.%N %z" ;;
      --time-style=[pf]*ull-iso"$n"BSD )        time="%F %T %z" ;;
      --time-style=[pf]*ull-iso"$n"GNU )        time="%F %T.%N %z" ;;
      --time-style=+?*"$n"[BG]?? )              time="${opt#*+}" ;;
      -D"$n"BSD )                               next=1 ;;
      -D?*"$n"BSD )                             time="${opt#-D}" ;;
      --time-style=[pl]*ocale"$n"BSD )          time="%b %e %Y" ;;
      --time-style=[pl]*ocale"$n"GNU )          time="%b %e  %Y$n%b %e %H:%M" ;;
      --time-style=[pl]*ong-iso"$n"[BG]?? )     time="%F %H:%M" ;;
      --time-style=[pi]*so"$n"BSD )             time="%F %H:%M" ;;
      --time-style=[pi]*so"$n"GNU )             time="%F $n%m-%d %H:%M" ;;
      * )                                       set -- "$@" "$opt" ;;
    esac
  done

  case "$time@$ls_ver" in
    *"$n"*"$n"*@??? | *"$n"*@BSD )
      echo "Extra line break(s) in time style" >&2; return 2 ;;
    *"$n"* )    time="${time%$n*}$mark$n${time##*$n}$mark" ;;
    * )         time="$time$mark" ;;
  esac

  if [ "$ls_ver" = GNU ]; then
    if [ -n "$CLICOLOR_FORCE" ]; then set -- "$@" --color=always; fi
    set -- "$@" --time-style=+"$time"
  else
    set -- "$@" -D "$time"
  fi

  ls -alF $color "$@" |awk -v q="'" -v mark="$mark" '{
    if (NF == 2) { print; next }
    if (/^[dl].*\/$/) { # directory or link to a directory
      if (!file_pos) {  # calculate this once
        file_pos = match($0, mark) + 1 + RLENGTH
      }
      cmd = substr($0, file_pos, length($0) - file_pos)   # the directory name
      branch = ""
      if (/^l/) {
        l = sub(/\033\[[0-9:;]*m -> .*/, "", cmd) # remove colored link target
        if (!l) sub(/ -> .*/, "", cmd)  # remove uncolored link target
      }
      gsub(/\033\[[0-9:;]*m/, "", cmd)  # remove color codes
      gsub(q, "\\&", cmd)   # escape all apostrophes in the directory name
      cmd = sprintf("cd %s && git rev-parse --abbrev-ref HEAD", q cmd q)
      cmd | getline branch
      if (branch) { $0 = sprintf("%s (%s)", $0, branch) }
    }
    gsub(mark, "")
    print
  }' 2>/dev/null

}

This populates a global $ls_ver variable to denote the ls implementation (which I don't expect to change within a shell session), saving it either as GNU or BSD depending on what is found. It does this by asking for a directory listing of just the root directory (no children, nice and fast). The GNU form will cause BSD to say ls: illegal option -- - (if we didn't direct that to /dev/null) and exit with code 1. The BSD form will cause GNU to say ls: cannot access '%j': No such file or directory and exit with code 2. BusyBox will complain about either --time-style=%j or else -D as unrecognized or invalid options, exiting with code 1.

Then we set colors using the BSD convention with the assumption that auto-colors are desired (see my other answer).

The for loop runs through the options, seeking -- (which stops option parsing) or else a color or time format cue. Anything else is passed back to the $@ array, which we clear on the first line of the loop in its first iteration (so we can rebuild it).

The case stanza parses options (unless $opt_reset has been set by -- in a prior loop, at which point nothing will match). I've also loaded the implementation ($ls_ver) in there after a line break, as BSD and GNU have some differences here (notably in -D, which means something else in GNU ls, but BSD's strftime lacks %N support and BSD ls -D does not support the old\nrecent syntax needed to mimic POSIX ls, so we just always go with the 6+mo form).

After that, we have a second case to police the use of line breaks (at most 1 for GNU, at most 0 for BSD) and we add marks after each date to harvest later.

GNU needs an argument to trigger its colors. Both have their own way to add the time format.

The rest is pretty much the same as my other answer, with slightly different substr extraction and the extra removal of the mark at the end.

Comments

0

Parsing ls is very tricky and should generally be avoided.

That said, I've done it here and there. I've found that it is best to key on the date because its position before the file name is quite reliable.

This answer uses portable POSIX shell but also works in bash, zsh, etc.

Parsing the time

This does get difficult when supporting locales beyond POSIX/C and en_US.UTF-8. I've decided to answer this question on Hard Mode™ by implementing a solution for all locales.

POSIX ls has two locale-specific date formats, which the POSIX ls spec defines as:

The field shall contain the appropriate date and timestamp of when the file was last modified. In the POSIX locale, the field shall be the equivalent of the output of the following date command:

date "+%b %e %H:%M"

if the file has been modified in the last six months, or:

date "+%b %e %Y"

That %b is the locale's "abbreviated month". Let's extract that with locale:

$ locale abmon
Jan;Feb;Mar;Apr;May;Jun;Jul;Aug;Sep;Oct;Nov;Dec
$ LC_TIME=es_MX.UTF-8 locale abmon
ene;feb;mar;abr;may;jun;jul;ago;sep;oct;nov;dic
$ LC_ALL=ga_IE locale abmon
Ean;Feabh;Márta;Aib;Beal;Meith;Iúil;Lún;MFómh;DFómh;Samh;Noll

This demo was run with GNU (libc) locale, which gives us a nice list of the abbreviated months (abmon). I've also shown the Mexican Spanish months and the Irish Gaelic months, keyed by $LC_TIME (which overrides $LANG) and $LC_ALL (which overrides all other POSIX Internationalization Variables). Gaelic offers a good example of something that uses Unicode and has variable width, which means we can't just match the month name with a regex like [A-z][a-z][a-z].

However, this usage of locale does not work on BSD or other POSIX systems. The best way to get this from BSD is with the verbose locale abmon_1 abmon_2 … abmon_12 (which outputs one abbreviated month per line). There is no way to do this with POSIX locale (and BusyBox doesn't even provide a locale applet), so I'll use date -d as a fallback (warning: that requires GNU or BusyBox date!)

Here's how I extract the current locale's list of abbreviated months as portably as possible:

_set_time_re() {
  local re="$(locale abmon 2>/dev/null |tr ';' '|')"
  if [ -z "$re" ]; then  # BSD locale isn't as flexible. Ask for each month:
    abmon12x() { n=0; while [ $((n+=1)) -le 12 ]; do echo abmon_$n; done; }
    re="$(locale $(abmon12x) |xargs |tr ' ' '|')"
  fi
  if [ -z "$re" ]; then  # embedded systems can lack locale yet support date -d
    re= p= n=1
    while d=$(date -d 99-$n-15 +%b); do re="$re$p$d" p='|' n=$((n+1)); done
  fi 2>/dev/null

  # at this point, $re in the POSIX/C or en_US.UTF-8 locale is:
  # Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec

  re=" ((${re:-[^ 0-9][^ 0-9][^ 0-9]+}) +|[01][0-9]-)" # fallback: non-sp/num
  re="$re""[0-9]?[0-9] +((1[89]|2[01])[0-9][0-9]|[0-2]?[0-9]:[0-5][0-9]) +"
  re="$re| (1[89]|2[01])[0-9][0-9]-[01][0-9]-[0-3][0-9]( [0-2][0-9]:[0-5][0-9]"
  re="$re(:[0-5][0-9]([.][0-9]+ [-+][0-9][0-9][0-9][0-9])?)?)? "

  echo "$re"
}

(This uses four lines because I'm a stickler for 80-column views.)

If we were using bash, zsh, or similar shells, we could simplify the BSD locale query to one line with re="$(locale abmon_{1..12} |xargs |tr ' ' '|')" but I like to stay POSIX-compliant.

I've used ${variable:-fallback} syntax to add a fallback in case we haven't yet obtained an abmon list. This simply says "three or more non-space non-number characters", which at least prevents matching file sizes (which is most commonly the previous column). It will not match e.g. ja_JP.UTF-8, whose abmon values are 1月 to 12月 (spaces and numbers!). Some locales have rather long abbreviations (or don't abbreviate at all). For example, November in Iraqi Arabic (ar_IQ.UTF-8) is تشرين الثاني.

The extra "" prevents zsh from interpreting $re[0-9] as an array reference.

Given en_US.UTF-8 or C or POSIX, the value of $re (full explanation on Regex101.) is now:

 ((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +|[01][0-9]-)[0-9]?[0-9] +((1[89]|2[01])[0-9][0-9]|[0-2]?[0-9]:[0-5][0-9]) +| (1[89]|2[01])[0-9][0-9]-[01][0-9]-[0-3][0-9]( [0-2][0-9]:[0-5][0-9](:[0-5][0-9]([.][0-9]+ [-+][0-9][0-9][0-9][0-9])?)?)? 

This complex regex additionally matches the TIME_STYLE/--time-style options introduced by GNU ls 4.1.1 in 2003¹: full-iso (aka --full-time), iso, locale, and long-iso as well as the +%F %T customization, which simply adds :%S to the end of long-iso. (If you intercept the --time-style option, you can wrap it with a code that marks things, radically simplyfing this effort, but I'm not going into that complexity here.) Ignoring other customizations, this becomes:

TIME_STYLE Code strftime
locale (POSIX, recent, w=11+) Mmm D HH:MM +%b %e %H:%M
locale (POSIX, 6+mo old, w=11+) Mmm D YYYY +%b %e %Y
full-iso (--full-time, w=35) YYYY-MM-DD HH:MM:SS.NNNNNNNNN (+|-)ZZZZ +%F %T.%N %z
long-iso (w=16) YYYY-MM-DD HH:MM +%F %H:%M
iso (recent, w=11) MM-DD HH:MM +%m-%d %H:%M
iso (6+mo old, w=11) YYYY-MM-DD +%F
custom (w=*) (see man date or man strftime) +…

Parsing ls

lg() {
  local re="$(_set_time_re)" color=
  if [ -t 1 ]; then color='--color=always'; fi

  ls -alF $color "$@" |awk -v q="'" -v re="$re" '{
    if (NF == 2) { print; next }
    if (/^[dl].*\/$/) { # directory or link to a directory
      if (!file_pos && match($0, re)) {  # calculate this once
        file_pos = RSTART + RLENGTH
      }
      branch = ""
      cmd = substr($0, file_pos, length($0) - file_pos)   # the directory name
      if (/^l/) {
        l = sub(/\033\[[0-9:;]*m -> .*/, "", cmd) # remove colored link target
        if (!l) sub(/ -> .*/, "", cmd)  # remove uncolored link target
      }
      gsub(/\033\[[0-9:;]*m/, "", cmd)  # remove color codes
      gsub(q, "\\&", cmd)   # escape all apostrophes in the directory name
      cmd = sprintf("cd %s && git rev-parse --abbrev-ref HEAD", q cmd q)
      cmd | getline branch
      if (branch) { $0 = sprintf("%s (%s)", $0, branch) }
    }
    print
  }' 2>/dev/null
}
$ lg
total 48 
drwxrwxr-x 12 bytecommander bytecommander 4096 Jul  9 14:48 ./
drwxr-xr-x 74 bytecommander bytecommander 4096 Aug 26  2017 ../
drwxrwxr-x  6 bytecommander bytecommander 4096 Aug 26  2017 git1/ (master)
drwxrwxr-x  7 bytecommander bytecommander 4096 Aug 26  2017 git2/ (develop)
drwxrwxr-x  4 bytecommander bytecommander 4096 Aug 26  2017 no-git/
-rw-rw-r--  1 bytecommander bytecommander    0 Aug 26  2017 regular-file

After defining $re as noted in the two code blocks above this one, there's one more piece of housekeeping to do: color support (colors improve legibility!). Since we're piping through awk, standard output is closed to ls and therefore --color=auto won't work. That's okay, this code defaults to recreating that same logic. Since this function simply hands all parameters to ls, so you can override that with --color=none. When colors are enabled, this code has to remove the color codes from harvested directory names (not their displayed versions!) in order to run it through git to get the branch name.

awk isn't terribly good at external commands that require quotes, so this code stores an apostrophe in q and escapes each apostrophe as \' (explanation), which should handle spaced directory names (I believe this handles all varieties of characters in directory names except line breaks—don't use those!). Using -v VAR=VALUE is also the best way to pass in our regex (ENVIRON["re"] would require us to export re, which isn't great in a sourced function).

The first condition of our stanza, if (NF == 2) { print; next }, keys on the first line (total 48), simply recognizing it has nothing to parse, printing it, and moving on to the next line of input.

The second condition keys on directories and links to directories (taking advantage of ls -F having added a trailing slash). Since ls columns are all lined up, we only need to calculate the file_pos once. It locates the date field and, when found, saves file_pos as the character after the end (RSTART is the beginning of the match, RLENGTH is the end, so their sum is the first character after the match. Note that awk is 1-indexed).

We set branch to be blank (rather than inheriting its last value), then we go about getting the directory name (everything from file_pos to one before the end of the line, so we can omit the trailing slash from ls -F). If it was a symbolic link, we need to pull out the target. This is pretty easy when we have colors in place since we can key on the color code and then remove the subsequent -> and the rest of the line. The l variable stores the number of substitutions we made (either zero or one), so when it was zero, there are no colors and we have to hope the link's name doesn't contain -> in it.

We then remove the color codes and escape apostrophes as noted earlier. Now we set the git command to check the branch (I used the old method over git 2.22's git branch --show-current for compatibility), saving the output in the branch variable. If it's not empty (and not 0), we add a space and the parenthesized branch name to the end of the line.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.