Hi everyone !

I’m in need for some assistance for string manipulation with sed and regex. I tried a whole day to trial & error and look around the web to find a solution however it’s way over my capabilities and maybe here are some sed/regex gurus who are willing to give me a helping hand !

With everything I gathered around the web, It seems it’s rather a complicated regex and sed substitution, here we go !

What Am I trying to achieve?

I have a lot of markdown guides I want to host on a self-hosted forgejo based git markdown. However the classic markdown links are not the same as one github/forgejo…

Convert the following string:

[Some text](#Header%20Linking%20MARKDOWN.md)

Into

[Some text](#header-linking-markdown.md)

As you can see those are the following requirement:

  • Pattern: [Some text](#link%20to%20header.md)
  • Only edit what’s between parentheses
  • Replace space (%20) with -
  • Everything as lowercase
  • Links are sometimes in nested parentheses
    • e.g. (look here [Some text](#link%20to%20header.md))
  • Do not change a line that begins with https (external links)

While everything is probably a bit complex as a whole the trickiest part is probably the nested parentheses :/

What I tried

The furthest I got was the following:

sed -Ei 's|\(([^\)]+)\)|\L&|g' test3.md #make everything between parentheses lowercase

sed -i '/https/ ! s/%20/-/g' test3.md #change every %20 occurrence to -

These sed/regx substitution are what I put together while roaming the web, but it has a lot a flaws and doesn’t work with nested parentheses. Also this would change every %20 occurrence in the file.

The closest solution I found on stackoverflow looks similar but wasn’t able to fit to my needs. Actually my lack of regex/sed understanding makes it impossible to adapt to my requirements.


I would appreciate any help even if a change of tool is needed, however I’m more into a learning processes, so a script or CLI alternative is very appreciated :) actually any help is appreciated :D !

Thanks in advance.

  • N0x0n@lemmy.mlOP
    link
    fedilink
    arrow-up
    0
    ·
    edit-2
    1 day ago

    Thank you, thank you very much for taking your time to help me out here ! I really appreciate your full breakdown and complete development ! I didn’t tried it out yet but skimming through your post I’m sure it will work out !

    However, I forgot to mention something:

    The goal of this expression is to find markdown links, and to ignore https links. In your post you indicate the markdown links all start with a # symbol, so we don’t have to explicitly ignore the https as much as we just have to match all links starting with #.

    This is only true for links in the same file, if i link to another file it look something like this:

    [Why SVT-AV1 over AOM?](readme.md#Why%20SVT-AV1%20over%20AOM?)
    

    I can try to wrap my head around and find a solution by myself, with your well written breakdown I’m sure I can try something out. But if you think it will be to complex for my limited knowledge feel free to adjust :).

    Do you mind If I ping you if I’m not able to solve the issue?

    Thank again !!! 👍

    • harsh3466@lemmy.ml
      link
      fedilink
      arrow-up
      0
      ·
      edit-2
      1 day ago

      Feel free to ping me if you want some help! I’d say I’m intermediate with regex, and I’m happy to help where I can.

      Regarding the other file, you could pretty easily modify the command I gave you to adapt to the example you gave. There’s two approaches you could take.

      This is focused on the first regex in the command. The second should work unmodified on the other files if they follow the same pattern.

      Here’s the original chunk:

      s|]\(#.+\)|\L&|

      In the new example given, the # is preceded by readme.md. The easy modification is just to insert readme\.md before the # in the expression, adding the \ before the . to escape the metacharacter and match the actual period character, like so:

      s|]\(readme\.md#.+\)|\L&|

      However, if you have other files that have similar, but different patterns, like (faq.md#%20link%20text), and so on, you can make the expression more universal by using the .* metacharacter sequence. This is similar to the .+ metacharacter sequence, with one difference. The + indicates one or more times, while the * indicates zero or more times. So by using .* before the # you can likely use this on all the files if they follow the two pattern examples you gave.

      If that will work, this would be the expression:

      s|]\(.*#.+\)|\L&|

      What this expression does is:

      Find find a closing bracket followed by a opening parentheses followed by any sequence of characters (including no characters at all) until finding a pound/hash symbol then finding one or more characters until finding a closing parentheses, and convert that entire matched string to lowercase.

      And with that modified expression, this would be the full command:

      sed -ri 's|]\(#.+\)|\L&|; s|%20|-|g' /path/to/somefile
      

      Edit: grammar

      Edit 2: added the full modified command.

      • N0x0n@lemmy.mlOP
        link
        fedilink
        arrow-up
        0
        ·
        edit-2
        1 day ago

        Haha we cross-replied !

        .* did the trick and removes my additional s|]\(.+#.+\) to include that pattern form my last reply !

        Last question https/ ! s|%20|-| change all occurrence of %20 in the whole file except if it begins with https, is there any way to just change that occurrence when it appears in the markdown link pattern []()?

        e.g. replace in [Some text](some%20text.md) but not If Hello I'm just some%20place holder text ?

        Thanks again for your easy to read and very informative walk through ! 🤩