0

Suppose I wish to pass a URL like so to httr::GET():

https://www.uniprot.org/uniprot/?query=name%3A"dna+methyltransferase"

How would I go about getting the quoted portion of this string (i.e., "dna+methyltransferase") passed as input correctly? My input URL string is stored as follows, and passing it directly does not work as the escaped double quotes are not being evaluated:

> urlinp <- "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\""
> status_code(GET(urlinp))
# [1] 400

The one idea I had was to use capture.output() with cat() to try and pass the (parsed) string, but that didn't work either:

> status_code(GET(capture.output(cat(urlinp))))
[1] 400

I frankly don't know how to do this. Googling did not really help (or I was searching with inappropriate terms). Any pointers would be much appreciated.

Edit: updated context below.

So, I basically have a small function that takes two strings SoughtProtein and SoughtTaxon as inputs, and formulates a URL query (?) out of it as shown below.

UniProtQueryConstructor <- function(SoughtProtein = NULL, SoughtTaxon = NULL){

  #Function constants
  tmpUniProtBaseURL <- "https://www.uniprot.org/uniprot/"
  tmpUniProtURLRetFormat <- "&format=tab"

  #Formatting steps below
  if(!is.null(SoughtProtein)){


    #If protein name has more than one word (e.g., "DNA methyltrasferase"), then having that string enclosed in double quotes

    if(stringr::str_detect(SoughtProtein, "\\s")){

      #Lowercaseing the string, and replaceing punctuation with "+"
      innertmpProtName <- stringr::str_replace_all(paste0(tolower(SoughtProtein)), regex("[[:punct:]\\s]+"), "+")

      #Enclosing the multi-word string in double quotes
      innertmpProtName <- paste0('\"', innertmpProtName, '\"')

      #Writing it to a temporary variable that will be passed on for final URL assembly
      tmpProtName <- paste0("name%3A", innertmpProtName)

    } else{

      #Else condition is a simple case, since there is no multi-word string to be dealt with

      tmpProtName <- paste0("name%3A", stringr::str_replace_all(paste0(tolower(SoughtProtein)), regex("[[:punct:]\\s]+"), "+"))

    }

  } else{ 

    #Else assign empty string to protin name if user input is non-existent

    tmpProtName <- ""

  }

  #Input string prep for taxon selection
  if(!is.null(SoughtTaxon)){

    tmpTaxon <- paste0("taxonomy%3A", stringr::str_replace_all(paste0(tolower(SoughtTaxon)), regex("[[:punct:]\\s]+"), "+"))

  } else{

    tmpTaxon <- ""

  }


  #Combining user inputs into once single string
  tmpInpTermList <- c(tmpProtName, tmpTaxon)


  #Preparing query string
  tmpAssembledUniProtQuery <- paste0("?query=", paste(tmpInpTermList[which(nchar(tmpInpTermList) > 0)], sep = "", collapse = "+AND+"))


  #Full query URL
  tmpFullUniProtSearchURL <- paste0(tmpUniProtBaseURL, tmpAssembledUniProtQuery, tmpUniProtURLRetFormat)

  return(tmpFullUniProtSearchURL)
}

#Test case below

TestSearch <- UniProtQueryConstructor(SoughtProtein = "DNA methyltransferase", SoughtTaxon = "Eukaryota")

#Double quotes within the string not dealt with properly.
TestSearch

# [1] "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\"+AND+taxonomy%3Aeukaryota&format=tab"

The problem is that this function needs to be able to handle inputs where the input strings contain more than one word separated by a space (e.g. "DNA methyltransferse") by having them enclosed in double quotes within the query string as follows:

query=name%3A"dna+methyltransferase"

And this is where I'm running into my problem, in that I'm unable to have the escaped double quotes show up properly (as can be seen in the sample output).

I've written this update this just as the multiple answers with URLencode() arrived. I think the proposed solutions solve the problem at hand (of parsing the string properly), and also slightly alleviate the problem at large (of me being terrible at writing code; I learned something new today!).

4
  • 1
    Wrap the inner string in double quotes and the outer string in single quotes, or vice versa Commented Nov 23, 2019 at 19:32
  • @camille that works, but it's absolutely necessary that the inner string is enclosed in double quotes. How can I force the outer string to be enclosed in single quotes? (It's not being generated manually as in my toy example above, so I apologize in advance if more context is necessary here.) Commented Nov 23, 2019 at 20:32
  • In that case, could you add an example of how the URL gets generated? If you replace the quotation marks with their unicode equivalents (I believe there are some function to do this for you), you'll get the url "uniprot.org/uniprot/?query=name%3A%22dna+methyltransferase%22" which, when I called GET on it, gets a status code 200 Commented Nov 23, 2019 at 20:39
  • @camille I updated the OP and I also accepted your answer since that solves the problem at hand. Commented Nov 23, 2019 at 21:11

2 Answers 2

3

I tried to find posts that covered this already, but there's a little detail here that threw me off. You can use utils::URLencode the encode the URL so that the quotation marks will be replaced with their percent-encoded equivalents.

URLencode has an argument repeated, which defaults to false:

repeated—logical: should apparently already-encoded URLs be encoded again?

An ‘apparently already-encoded URL’ is one containing %xx for two hexadecimal digits.

Your URL already has one piece encoded with %3A, the encoded version of :; because an encoded substring already exists, no further encoding is done by default. Instead, set repeated = FALSE, and the quotation marks get encoded as well:

library(httr)

urlinp <- 'https://www.uniprot.org/uniprot/?query=name%3A"dna+methyltransferase"'

URLencode(urlinp, repeated = FALSE)
#> [1] "https://www.uniprot.org/uniprot/?query=name%3A\"dna+methyltransferase\""
URLencode(urlinp, repeated = TRUE)
#> [1] "https://www.uniprot.org/uniprot/?query=name%253A%22dna+methyltransferase%22"

status_code(GET(URLencode(urlinp, repeated = TRUE)))
#> [1] 200
Sign up to request clarification or add additional context in comments.

Comments

1

Let's do a few things to handle this:

  1. As @camille rightly points out, this is way easier to wrap this in single quotes.
  2. And while we're at it, let's replace the "%3A" in the URL template with the colon that it represents.
  3. Now, let's use URLencode. This will deal with the quotes, the colon and anything else for us.

Then we're all set.

library(httr)
# Correct sample format for URL
# https://www.uniprot.org/uniprot/?query=name%3A%22dna+methyltransferase%22&sort=score
query_url <- 'https://www.uniprot.org/uniprot/?query=name:"dna+methyltransferase"' 
encoded_url <- URLencode(query_url)
resp <- httr::GET(encoded_url)
status_code(resp)
#> [1] 200

Created on 2019-11-23 by the reprex package (v0.3.0)

2 Comments

How would you decode the %3A programmatically? Right now you've done that manually, but the OP says they're generating URL, not hard-coding them
Thank you for the contritbution. Although I've accepted @camille's answer (since it completely solved my problem), I saw your answer first, and it put me on the right track.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.