1

How can I get just a part of XML node text?

I have this piece of XML:

  <CorpusLink>../Metadata/A_short_autobiography_of_Herculino_Alves.xml</CorpusLink>
  <CorpusLink >../Metadata/Wordlist_and_phrases_-_modifiers.xml</CorpusLink>
  <CorpusLink >../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml</CorpusLink>
  <CorpusLink >../desano-silva-0151/Metadata/The_Turtle_and_the_Deer.xml</CorpusLink>
  <CorpusLink >../desano-silva-0151/Metadata/Wordlist_and_phrases_parts_of_a_tree.xml</CorpusLink>
  <CorpusLink >../desano-silva-0151/Metadata/Wordlist_and_phrases_.xml</CorpusLink>

I need to extract only this piece of text in each one:

../Metadata

../desano-silva-0151/Metadata

I have this code :

$j = 0
$TrgContent.METATRANSCRIPT.Corpus.CorpusLink | ForEach-Object {
[String]$_.'#text'= % {$alltext[$j] + "xml" $j++}}

But it gives me all the text:

../Metadata/A_short_autobiography_of_Herculino_Alves.xml

../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml

Thanks in advance for any help.

1 Answer 1

1

To achieve what you have asked. I think we have two main steps here:

  1. Extract the content of XML nodes.
  2. Trim the content and take what you need only.

I'm not really familiar with your existing scripts so I will explain all two steps here. The first step is optional to you.

Extract content of XML nodes

My example XML document:

<Corpus>
    <CorpusLink>../Metadata/A_short_autobiography_of_Herculino_Alves.xml</CorpusLink>
    <CorpusLink>../Metadata/Wordlist_and_phrases_-_modifiers.xml</CorpusLink>
    <CorpusLink>../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml</CorpusLink>
    <CorpusLink>../desano-silva-0151/Metadata/The_Turtle_and_the_Deer.xml</CorpusLink>
    <CorpusLink>../desano-silva-0151/Metadata/Wordlist_and_phrases_parts_of_a_tree.xml</CorpusLink>
    <CorpusLink>../desano-silva-0151/Metadata/Wordlist_and_phrases_.xml</CorpusLink>
</Corpus>

PS script to get the content:

[xml] $XmlDocument = Get-Content D:\Path_To_Your_File
$XmlDocument.Corpus.CorpusLink # Content of the nodes you need

Trim the content

There are many methods but I think I will go with regex. Simply loop through all the contents and run the regex.

$XmlDocument2.Corpus.CorpusLink | Foreach-Object {
    if ($_ -match "\.\.\/.*?\/") {
        $Matches.Values
    }    
}

About the regex, it matches any character except for line terminators between ..\ and /:

\.\.  # Escape for 2 dots `..`
\/    # Escapefor slash `/`
.*?   # Takes any character except for line terminators in between other listed characters (above and below)
\/    # Escape for slash `/`

I imply the structure of these strings is stable like that, hence the regex.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot for your hiper explanation and example.Now it´s working perfectly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.