Parsing ls is very tricky and should generally be avoided.
That said, I've done it here and there. I've found that it is best to key on the date because its position before the file name is quite reliable.
This answer uses portable POSIX shell but also works in bash, zsh, etc.
Parsing the time
This does get difficult when supporting locales beyond POSIX/C and en_US.UTF-8. I've decided to answer this question on Hard Mode™ by implementing a solution for all locales.
POSIX ls has two locale-specific date formats, which the POSIX ls spec defines as:
The field shall contain the appropriate date and timestamp of when the file was last modified. In the POSIX locale, the field shall be the equivalent of the output of the following date command:
date "+%b %e %H:%M"
if the file has been modified in the last six months, or:
date "+%b %e %Y"
That %b is the locale's "abbreviated month". Let's extract that with locale:
$ locale abmon
Jan;Feb;Mar;Apr;May;Jun;Jul;Aug;Sep;Oct;Nov;Dec
$ LC_TIME=es_MX.UTF-8 locale abmon
ene;feb;mar;abr;may;jun;jul;ago;sep;oct;nov;dic
$ LC_ALL=ga_IE locale abmon
Ean;Feabh;Márta;Aib;Beal;Meith;Iúil;Lún;MFómh;DFómh;Samh;Noll
This demo was run with GNU (libc) locale, which gives us a nice list of the abbreviated months (abmon). I've also shown the Mexican Spanish months and the Irish Gaelic months, keyed by $LC_TIME (which overrides $LANG) and $LC_ALL (which overrides all other POSIX Internationalization Variables). Gaelic offers a good example of something that uses Unicode and has variable width, which means we can't just match the month name with a regex like [A-z][a-z][a-z].
However, this usage of locale does not work on BSD or other POSIX systems. The best way to get this from BSD is with the verbose locale abmon_1 abmon_2 … abmon_12 (which outputs one abbreviated month per line). There is no way to do this with POSIX locale (and BusyBox doesn't even provide a locale applet), so I'll use date -d as a fallback (warning: that requires GNU or BusyBox date!)
Here's how I extract the current locale's list of abbreviated months as portably as possible:
_set_time_re() {
local re="$(locale abmon 2>/dev/null |tr ';' '|')"
if [ -z "$re" ]; then # BSD locale isn't as flexible. Ask for each month:
abmon12x() { n=0; while [ $((n+=1)) -le 12 ]; do echo abmon_$n; done; }
re="$(locale $(abmon12x) |xargs |tr ' ' '|')"
fi
if [ -z "$re" ]; then # embedded systems can lack locale yet support date -d
re= p= n=1
while d=$(date -d 99-$n-15 +%b); do re="$re$p$d" p='|' n=$((n+1)); done
fi 2>/dev/null
# at this point, $re in the POSIX/C or en_US.UTF-8 locale is:
# Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec
re=" ((${re:-[^ 0-9][^ 0-9][^ 0-9]+}) +|[01][0-9]-)" # fallback: non-sp/num
re="$re""[0-9]?[0-9] +((1[89]|2[01])[0-9][0-9]|[0-2]?[0-9]:[0-5][0-9]) +"
re="$re| (1[89]|2[01])[0-9][0-9]-[01][0-9]-[0-3][0-9]( [0-2][0-9]:[0-5][0-9]"
re="$re(:[0-5][0-9]([.][0-9]+ [-+][0-9][0-9][0-9][0-9])?)?)? "
echo "$re"
}
(This uses four lines because I'm a stickler for 80-column views.)
If we were using bash, zsh, or similar shells, we could simplify the BSD locale query to one line with re="$(locale abmon_{1..12} |xargs |tr ' ' '|')" but I like to stay POSIX-compliant.
I've used ${variable:-fallback} syntax to add a fallback in case we haven't yet obtained an abmon list. This simply says "three or more non-space non-number characters", which at least prevents matching file sizes (which is most commonly the previous column). It will not match e.g. ja_JP.UTF-8, whose abmon values are 1月 to 12月 (spaces and numbers!). Some locales have rather long abbreviations (or don't abbreviate at all). For example, November in Iraqi Arabic (ar_IQ.UTF-8) is تشرين الثاني.
The extra "" prevents zsh from interpreting $re[0-9] as an array reference.
Given en_US.UTF-8 or C or POSIX, the value of $re (full explanation on Regex101.) is now:
((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +|[01][0-9]-)[0-9]?[0-9] +((1[89]|2[01])[0-9][0-9]|[0-2]?[0-9]:[0-5][0-9]) +| (1[89]|2[01])[0-9][0-9]-[01][0-9]-[0-3][0-9]( [0-2][0-9]:[0-5][0-9](:[0-5][0-9]([.][0-9]+ [-+][0-9][0-9][0-9][0-9])?)?)?
This complex regex additionally matches the TIME_STYLE/--time-style options introduced by GNU ls 4.1.1 in 2003¹: full-iso (aka --full-time), iso, locale, and long-iso as well as the +%F %T customization, which simply adds :%S to the end of long-iso. (If you intercept the --time-style option, you can wrap it with a code that marks things, radically simplyfing this effort, but I'm not going into that complexity here.) Ignoring other customizations, this becomes:
| TIME_STYLE |
Code |
strftime |
| locale (POSIX, recent, w=11+) |
Mmm D HH:MM |
+%b %e %H:%M |
| locale (POSIX, 6+mo old, w=11+) |
Mmm D YYYY |
+%b %e %Y |
full-iso (--full-time, w=35) |
YYYY-MM-DD HH:MM:SS.NNNNNNNNN (+|-)ZZZZ |
+%F %T.%N %z |
| long-iso (w=16) |
YYYY-MM-DD HH:MM |
+%F %H:%M |
| iso (recent, w=11) |
MM-DD HH:MM |
+%m-%d %H:%M |
| iso (6+mo old, w=11) |
YYYY-MM-DD |
+%F |
| custom (w=*) |
(see man date or man strftime) |
+… |
Parsing ls
lg() {
local re="$(_set_time_re)" color=
if [ -t 1 ]; then color='--color=always'; fi
ls -alF $color "$@" |awk -v q="'" -v re="$re" '{
if (NF == 2) { print; next }
if (/^[dl].*\/$/) { # directory or link to a directory
if (!file_pos && match($0, re)) { # calculate this once
file_pos = RSTART + RLENGTH
}
branch = ""
cmd = substr($0, file_pos, length($0) - file_pos) # the directory name
if (/^l/) {
l = sub(/\033\[[0-9:;]*m -> .*/, "", cmd) # remove colored link target
if (!l) sub(/ -> .*/, "", cmd) # remove uncolored link target
}
gsub(/\033\[[0-9:;]*m/, "", cmd) # remove color codes
gsub(q, "\\&", cmd) # escape all apostrophes in the directory name
cmd = sprintf("cd %s && git rev-parse --abbrev-ref HEAD", q cmd q)
cmd | getline branch
if (branch) { $0 = sprintf("%s (%s)", $0, branch) }
}
print
}' 2>/dev/null
}
$ lg
total 48
drwxrwxr-x 12 bytecommander bytecommander 4096 Jul 9 14:48 ./
drwxr-xr-x 74 bytecommander bytecommander 4096 Aug 26 2017 ../
drwxrwxr-x 6 bytecommander bytecommander 4096 Aug 26 2017 git1/ (master)
drwxrwxr-x 7 bytecommander bytecommander 4096 Aug 26 2017 git2/ (develop)
drwxrwxr-x 4 bytecommander bytecommander 4096 Aug 26 2017 no-git/
-rw-rw-r-- 1 bytecommander bytecommander 0 Aug 26 2017 regular-file
After defining $re as noted in the two code blocks above this one, there's one more piece of housekeeping to do: color support (colors improve legibility!). Since we're piping through awk, standard output is closed to ls and therefore --color=auto won't work. That's okay, this code defaults to recreating that same logic. Since this function simply hands all parameters to ls, so you can override that with --color=none. When colors are enabled, this code has to remove the color codes from harvested directory names (not their displayed versions!) in order to run it through git to get the branch name.
awk isn't terribly good at external commands that require quotes, so this code stores an apostrophe in q and escapes each apostrophe as \' (explanation), which should handle spaced directory names (I believe this handles all varieties of characters in directory names except line breaks—don't use those!). Using -v VAR=VALUE is also the best way to pass in our regex (ENVIRON["re"] would require us to export re, which isn't great in a sourced function).
The first condition of our stanza, if (NF == 2) { print; next }, keys on the first line (total 48), simply recognizing it has nothing to parse, printing it, and moving on to the next line of input.
The second condition keys on directories and links to directories (taking advantage of ls -F having added a trailing slash). Since ls columns are all lined up, we only need to calculate the file_pos once. It locates the date field and, when found, saves file_pos as the character after the end (RSTART is the beginning of the match, RLENGTH is the end, so their sum is the first character after the match. Note that awk is 1-indexed).
We set branch to be blank (rather than inheriting its last value), then we go about getting the directory name (everything from file_pos to one before the end of the line, so we can omit the trailing slash from ls -F). If it was a symbolic link, we need to pull out the target. This is pretty easy when we have colors in place since we can key on the color code and then remove the subsequent -> and the rest of the line. The l variable stores the number of substitutions we made (either zero or one), so when it was zero, there are no colors and we have to hope the link's name doesn't contain -> in it.
We then remove the color codes and escape apostrophes as noted earlier. Now we set the git command to check the branch (I used the old method over git 2.22's git branch --show-current for compatibility), saving the output in the branch variable. If it's not empty (and not 0), we add a space and the parenthesized branch name to the end of the line.
lsis a bad idea to start with.ls -alFas it is and extend it by appending some git meta information to all directories listed in the output. I have a working solution for that (see link), which just currently has the problem that the./..entries are treated relative to the working directory and not to the optionally specified location argument. And yes, I know that parsinglsis not the best thing to do, but the alternative is to rewritels, which I want to do even less.myscript -alF /homeand internally callsls "$@"and you want to then also inside the script callawk '...' /home? If so that awk call would just beawk '...' "${!#}"in bash - is that what you want? If not then clarify your question with a concrete example script plus sample input expected output, i.e.a minimal reproducible example.