Hi everyone !
I’m in need for some assistance for string manipulation with sed and regex. I tried a whole day to trial & error and look around the web to find a solution however it’s way over my capabilities and maybe here are some sed/regex gurus who are willing to give me a helping hand !
With everything I gathered around the web, It seems it’s rather a complicated regex and sed substitution, here we go !
What Am I trying to achieve?
I have a lot of markdown guides I want to host on a self-hosted forgejo based git markdown. However the classic markdown links are not the same as one github/forgejo…
Convert the following string:
[Some text](#Header%20Linking%20MARKDOWN.md)
Into
[Some text](#header-linking-markdown.md)
As you can see those are the following requirement:
- Pattern:
[
]( - Only edit what’s between parentheses
- Replace
space (%20)
with-
- Everything as lowercase
- Links are sometimes in nested parentheses
- e.g. (look here
[
) ](
- e.g. (look here
- Do not change a line that begins with
https
(external links)
While everything is probably a bit complex as a whole the trickiest part is probably the nested parentheses :/
What I tried
The furthest I got was the following:
sed -Ei 's|\(([^\)]+)\)|\L&|g' test3.md #make everything between parentheses lowercase
sed -i '/https/ ! s/%20/-/g' test3.md #change every %20 occurrence to -
These sed/regx substitution are what I put together while roaming the web, but it has a lot a flaws and doesn’t work with nested parentheses. Also this would change every %20
occurrence in the file.
The closest solution I found on stackoverflow looks similar but wasn’t able to fit to my needs. Actually my lack of regex/sed understanding makes it impossible to adapt to my requirements.
I would appreciate any help even if a change of tool is needed, however I’m more into a learning processes, so a script or CLI alternative is very appreciated :) actually any help is appreciated :D !
Thanks in advance.
Thank you, thank you very much for taking your time to help me out here ! I really appreciate your full breakdown and complete development ! I didn’t tried it out yet but skimming through your post I’m sure it will work out !
However, I forgot to mention something:
This is only true for links in the same file, if i link to another file it look something like this:
[Why SVT-AV1 over AOM?](readme.md#Why%20SVT-AV1%20over%20AOM?)
I can try to wrap my head around and find a solution by myself, with your well written breakdown I’m sure I can try something out. But if you think it will be to complex for my limited knowledge feel free to adjust :).
Do you mind If I ping you if I’m not able to solve the issue?
Thank again !!! 👍
Feel free to ping me if you want some help! I’d say I’m intermediate with regex, and I’m happy to help where I can.
Regarding the other file, you could pretty easily modify the command I gave you to adapt to the example you gave. There’s two approaches you could take.
This is focused on the first regex in the command. The second should work unmodified on the other files if they follow the same pattern.
Here’s the original chunk:
s|]\(#.+\)|\L&|
In the new example given, the
#
is preceded byreadme.md
. The easy modification is just to insertreadme\.md
before the#
in the expression, adding the\
before the.
to escape the metacharacter and match the actual period character, like so:s|]\(readme\.md#.+\)|\L&|
However, if you have other files that have similar, but different patterns, like
(faq.md#%20link%20text)
, and so on, you can make the expression more universal by using the.*
metacharacter sequence. This is similar to the.+
metacharacter sequence, with one difference. The+
indicates one or more times, while the*
indicates zero or more times. So by using.*
before the#
you can likely use this on all the files if they follow the two pattern examples you gave.If that will work, this would be the expression:
s|]\(.*#.+\)|\L&|
What this expression does is:
Find find a closing bracket followed by a opening parentheses followed by any sequence of characters (including no characters at all) until finding a pound/hash symbol then finding one or more characters until finding a closing parentheses, and convert that entire matched string to lowercase.
And with that modified expression, this would be the full command:
sed -ri 's|]\(#.+\)|\L&|; s|%20|-|g' /path/to/somefile
Edit: grammar
Edit 2: added the full modified command.
Haha we cross-replied !
.*
did the trick and removes my additionals|]\(.+#.+\)
to include that pattern form my last reply !Last question
https/ ! s|%20|-|
change all occurrence of%20
in the whole file except if it begins withhttps
, is there any way to just change that occurrence when it appears in the markdown link pattern[]()
?e.g. replace in
[Some text](some%20text.md)
but not IfHello I'm just some%20place holder text
?Thanks again for your easy to read and very informative walk through ! 🤩