Linux: SED Deep-dive
I didn't know about sed, did you?
intro
i consider myself good with linux…
I mean - I have my Linux+ cert, my daily driver is Arch (obligatory: i use arch btw), I manage a handful of servers that use different flavors of Debian, and I work in IT so I’m surrounded by Linux…
I think I use Linux more than Windows and my work gave me a Windows laptop.
but i dont know much about sed
So lets change that. sed is one of the commands that everyone seems to use but no one knows how to use it. It’s one of those elusive commands that is somewhere in between “basic” and “power user”.
Today, I am going to do a deep dive into sed and how it works and, hopefully, learn something along the way.
there are two versions…
… of sed. There is sed(1) and sed(1p).
For those who aren’t aware, the number in the MAN page entry is like a chapter. (1) just means that it’s a “User Command” [1].
sed(1) is the GNU version of sed and is primarily what I’ll be focusing on here [2].
sed(1p) is the POSIX version which will be primarily left out [3]. The GNU version is just an extension of the POSIX version in most cases and Ill try to point out when something is GNU-only
the basics
what is sed?
Well, sed is short for stream editor. It’s just a fancy search-and-replace. It’s pretty simple on the surface, really; you shove stuff in the input stream and it just edits them and spits them out [2, Sec. “Description”].
lets look at the syntax
Just a quick note: all long-form flags (--flags) are GNU-only.
sed’s syntax is simple enough, all it is is just sed [options] '[script]' [input file] [4, Sec. 2.2]. If no input file is provided, it’ll read from STDIN. Linux commands love their flags and sed is no exception [2, Sec. “Description”, 4 Sec. 2.2]:
-n,--quiet,--silent: suppresses the pattern space being printed every cycle.--debug: prints the input stream and annotates the program execution.-e [script],--expression=[script]: adds this script to the commands/script that is to be executed.-f [script file],--file=[script file]: adds the commands in the file to the commands/scripts that is to be executed.-i [suffix],--in-place[=suffix]: edits the files in-place. If a suffix is provided, it’ll make a backup first. This is a GNU/BSD extension, not POSIX, but in BSD a suffix is required (-i ''=-i)--follow-symlinks: only has an effect when used with-i. If the file being edited is a symlink, it will follow the symlink and edit the final destination in-place. The default behavior is to break the symlink and edit the provided link in-place, leaving the final destination alone.-l [N],--line-length=[N]: changes the number of characters before forcing the line to wrap. A value of0means to never wrap long lines. The default value is 70. This is GNU-only, POSIX can still usesed’slcommand.--posix: disables all of the GNU features to allow for portable scripts to work on GNU and POSIX systems.-E,-r,--regexp-extended: use extended regular expressions in scripts. For portable scripts, use-Eto ensure compatibility with POSIX.-s,--separate: process files one-by-one instead of a very long continuous input. This is GNU-only.--sandbox: disables thee,r, andwcommands (not to be confused with the flags/options).eexecutes the command and replaces the command with the output,rappends the text read from a file,wwrites the pattern to the file. This is GNU-only.-u,--unbuffered: loads as little data as possible and flushes output buffers more often. Use this when the input stream is continuous, like a log file or something. This is GNU-only.-z,--null-data,--zero-terminated: do not print new-line characters or carriage returns, all strings will be null-terminated instead. This is GNU-only.--help: HEEEEELLLLLLPPPPPPP MEEEEEEEEEEEEEE--version: hopefully the largest number possible… if not, you need to update your packages.
script/command syntax
You know that lovely little block in the syntax that’s inside of single-quotes? The one that looks like '[script]' or '[commands]', depending on the source?
That’s where the magic happens…
But first, two rules to keep in mind [4, Sec. 2.1 Sec. 3.1]:
- All commands/scripts are concatenated. Let’s say you have a bunch of
-eor-foptions, starting from the first flag on the left, the command is added to the END of the final list of commands to run. - All commands are run in order from top-to-bottom, even if the same or similar command is present multiple times.
Just as a quick overview: it looks like '[addr][command][option]'.
[addr]is the line address. It’s just a line number or lines that contain specific regex. No biggy…/s[command]is a single-charactersedcommand. For exampledis the delete command.[option], some commands take an extra option with it.
So, if we wanted to append some text after the 5th line we would have:
[addr]would be 5[command]would bea[option]is the text to add, let’s say “ImHereNow”.
That makes our sed command: '5a ImHereNow'…
And that makes the full CLI command: sed -ni '5a ImHereNow' input.txt.
So, this is input.txt before the command:
1
2
3
4
5
6
7
We used the -n flag to suppress the extra data and only give us the output and we used the -i flag to modify the file in-place…
This is input.txt now:
1
2
3
4
5
ImHereNow
6
7
a closer look at sed commands
line addresses
Since sed commands can start with a line address, it’s only natural we optionally cover line addresses first…
Commands can start with an [addr]. This is the line number(s) that the command applies to [4, Sec 3.1 Sec. 4]. It can be:
- a single value (e.g.
'30d', which executes thedcommand on line 30) - a range (e.g.
'30,35d', for lines 30 through 35, inclusive) - a line containing regex (e.g.
'/^foo/d', which operates on lines starting with the stringfoo).
Here are some fun facts [4, Sec 4.4]:
- If you use an empty regex for the line address (e.g.
//d), it will repeat whatever the last regex string was. - When you specify a range, it is a beginning line addr and an ending line addr… and either line addr can be regex, e.g.
'4,/^foo/d', which would be from line 4 and ends on the line that begins withfoo. The range can never be one line, if line 4 starts withfoo, it will go until it finds another line that starts withfoo - If the range you specify is backwards (e.g.
'4,2d') or only contains one line (e.g.'4,4d'), then it will only operate on that first line (line 4)
Here are some GNU extensions that those POSIX-losers don’t get [4, Sec 4.2, Sec 4.4 Sec. 4.5 Sec. 7.9]:
0can be used to start a range if you want to end on a regex match that might be on line 1. For example:'0,/^foo/d'will start at the beginning of the file and if the very first line starts withfoo, it’ll end there.- The range can do math. If you specify a range like
'4,+3d', then it will operate on lines 4, 5, 6, 7… and that works for regex as well:'/^foo/,+2d', which will start on the line that begins withfooand operates on two additional lines. - The range can do more math. This one is a bit goofy, not gonna lie. GNU might have lost the plot here. If you specify a range like
'7,~5d', it will start at line 7 and end at the next line number that’s a multiple of 5… so lines 7, 8, 9, 10… Oh and the beginning can be regex as well:'/^foo/,~13d' - You can specify a line-step selection. If you use the line address
'2~3'd, thedcommand will operate on lines 2, 5, 8, 11… Basicallyfirst_step~scalarmeans .0also works here:'0~2d'would be lines 2, 4, 6, 8…
Just because we mentioned 0, let’s cover the only other time you’ll see it [4, Sec. 4.5 Sec. 7.9]:
- With the command
r, which reads an input file to the output. Meaning if you do3r test.txt, the contents oftest.txtwill be outputted directly on the 3rd line…0r test.txtwill prependtest.txtto the beginning of the output, before anything else, including line 1.
deep breath now, the actual command part…
All of that stuff about line addresses are optional… Just a little prefix. They aren’t even the commands. So, let’s actually talk about the powerful part of sed.
lets start with the number of rows they can manage…
Not every command can take multiple line addresses, so let’s break down which commands can operate on which addresses [2, Sec. “Command Synopsis”]:
Zero-address commands
These commands don’t take an address parameter and it doesn’t really make sense in their context.
: [label]: this is a “go-to” marker and is used byb,t, andTcommands# [text]: this is a line comment so we can annotate what different commands/parts of a script does}: oddly enough, themanpage lists this as a command. It’s just a closing bracket for a block of commands.
Zero- or One- address commands
This group can be either zero or one addresses. They just cant take a range.
=: it just prints the current line number to the outputa [\] [text]: append text to the next line. Any additional new lines can be entered by being preceded with a backslash. POSIX requires a backslash followed by a literal line break, so your script would take up that extra line. GNU will automatically assume the text is supposed to be on the next line, making your script a little neater.i [\] [text]: insert text to the line before. Any additional new lines can be entered by being preceded with a backslash. POSIX requires a backslash followed by a literal line break, so your script would take up that extra line. GNU will automatically assume the text is supposed to be on the next line, making your script a little neater.q [code]: immediately exit thesedscript, if not suppressed it will print the pattern space, then return the exit code. POSIX doesn’t support an exit code.Q [code]: immediately exit thesedscript and return the exit code. This is GNU-only.r [file]: append the contents of a file to the outputR [file]: append a line from a file to the output, calling it again will continue to output lines from the file. This is GNU-only.
Multi-address commands
And finally, here are the commands that can take a range of addresses…
{: this is for a command block, you can specify some addresses and apply it to the whole block… also you need to end the block with}as mentioned above.b [label]: branch tolabel. If you omit the label, it just branches to the end of the script.c [\] [text]: replaces the selected lines with text. Just likeaandi, any embedded newlines must be preceded by a backslash. POSIX requires a backslash followed by a literal line break, so your script would take up that extra line. GNU will automatically assume the text is supposed to be on the next line, making your script a little neater.d: deletes the pattern space and starts the next cycle.D: if the pattern space contains no newline, it starts a normal new cycle just like thedcommand. Otherwise, it deletes text in the pattern space up to the first newline, and restarts the cycle with the resulting pattern space (without reading a new line of input).h,H: move the pattern space to the hold space. Usehto replace whats stored, useHappend to the end of whats stored. The pattern space is cleared every new line cycle, this is like just putting the result in your pocket to save it across cycles.g,G: move the hold space to the pattern space. Usegto replace what is in the pattern space, useGto append to the end of the pattern space. This is how you recall a result from the hold space… like taking it out of your pocket.l: lists out the current line in a “visually unambiguous” form.l [width]: same as above, but breaks it atwidthcharacters. This is a GNU extension.n: print the pattern-space and the output (if not suppressed), then replace whatever is in the pattern space with the next line from the input stream. If there is no more input,sedwill exit.N: add a newline to the pattern space, then append the next line from the input stream to the pattern space. If there is no more input,sedwill exit.p: prints the current pattern space.P: prints up to the first embedded newline of the current pattern space.s/regexp/replacement/: attempts to matchregexpagainst the pattern space and replace it withreplacement. You can also use the delimiters|and_. We’re gonna look at this command a bit deeper later…t [label]: if a substitute (s///) was successful since the last input line was read (or since the lasttorTcommand), branch tolabel. No label means branch to the end. This is likeif true:T [label]: the opposite oft. If no substitution was successful, branch tolabel. This is likeif false:. This is GNU-only.w [filename]: writes the current pattern space tofilename.W [filename]: writes just the first line of the current pattern space tofilename. This is GNU-only.x: swap the contents of the hold and pattern spaces.y/source/dest/: transliterates the characters in the pattern space fromsourceto the corresponding character indest. Basicallyy/0123/abcdis a search-and-replace specifically for0=a,1=b,2=c,3=d
lets talk about s
There is one sed command that is insanely powerful and equally complex… s.
Most of this info is described in [4, Sec 3.3] as well.
So, as I mentioned above, the basic syntax for the ‘substitute’ command is 's/regexp/replacement/'. The first field is your pattern to match and whatever is matched is replaced with the second field. That syntax can be delimited with /, |, _, or any other single-byte character allowing a lot of flexibility depending on the data you are working with. One example is if you’re looking for specific file paths, your search can be 's|/home/[^/]+/Downloads/.*\.txt|replacement|'.
Search Flags
The first thing we should cover is the search flags that s supports. I’ve been saying the basic syntax is 's/regexp/replacement/' but it is missing the flag, which might look like 's/regexp/replacement/g'. That g is a search flag and you can have zero or more of them.
g: stands for “global” and it means it will replace every match it finds.[n]: this would only replace the nth match. So3will look for the third match and only replace that one.p: if there was a substitution made, it’ll print the new pattern space.e: this allows a shell command to be piped into the pattern space, and if a substitution was made, it will execute the new command, then the command output is put into pattern space. If the command output contains aNULL, then the results are undefined. This is also a GNU extension.w [filename]: if a substitution was made, write the result to a named file. In GNU systems, you can also give it/dev/stderrwhich writes to standard error and/dev/stdoutwhich writes to standard out.I,i: this makes the regex match case-insensitive. This is also just for GNU users.M,m: this flag allows the regex to match in “multiline” mode, instead of just single-line mode…Also GNU-only.
So, there are some caveats to some of those search flags, so we should touch on those now:
The first thing worth mentioning, is that every flag used uses it’s effect once and the order doesn’t really matter, except for p and e, specifically. When you use ep, it will make a substitution, execute the new command, replace the pattern space with the command’s output and THEN print the new pattern space - this is what most people are expecting. When you use pe, it will find a command, make the substitution in pattern space, print the pattern space, then evaluate the command, and store the output to pattern space. Following that logic, you can probably see how pe is useful for testing and how you might use pep.
The second thing I want to mention is a few rules for the m/M flag:
- In addition to normal behavior,
^and$would also match empty strings before/after a newline. \`and\'will strictly match the beginning and end of the entire pattern space (ignoring embedded newlines).- In this case, the
.doesn’t match carriage returns or new lines.
Capture Groups
Just like other powerful regex-search-and-replace tools, you can capture portions of the search and paste them in the replacement. sed uses \( and \) to capture (or just () if you use the -E flag) and \1 through \9 to paste. There is also a special symbol (&) that refers to the entire matched portion of the pattern space.
For example, if you are looking at a list of API endpoints, like:
http://10.37.45.2/api/v1/get/data
https://10.22.120.73/users/online
https://192.168.11.250/report
http://172.20.19.50/upload/pics
And you wanted to just return a list of the IPs, you could use something like:
sed -nE 's|^https?:\/\/([0-9\.]{7,15})\/[[:alnum:]_/]+$|"The IP is \1"|gip' /path/to/api_list.txt
Modify output with special sequences
Also like other powerful regex-search-and-replace tools, you can also modify the output of the capture groups:
\L: all following characters in the current replacement will be lowercase until the end of the cycle, a\U, or a\E.\U: all following characters in the current replacement will be uppercase until the end of the cycle, a\L, or a\E.\E: stops the case conversion that was done by\Land\U.\l: make the next character be lowercase\u: make the next character be uppercase.
When you use the g flag to search-and-replace multiple occurrences, \L and \U do NOT continue into future matches - once the current line/buffer/cycle is done, it “resets”.