1

I am developing a nextflow based pipeline, I have two processes for downloading files which are given below,

process templateExample{
publishDir "data_analysis_files", mode:'copy'     

output:
path "*_gex.csv" , emit: count_files        

script:
'''
"download_files.sh"
'''   

}



process read_count_p{

publishDir "results",mode:'copy'
input:
path count_files


output:
path "result.txt"

"""
Rscript read_count.R ${count_files}
"""
 }


 workflow {
 
 templateExample()
 read_count_p(templateExample.out.count_files)
 
   }

The script download_files.sh and read_count.R are present in the bin folder but the problem is that when I execute nextflow it founds and executes the bash script named download_files.sh from bin folder but not the R script named read_count.R. The bash script and R script are given below. The error is also given below,

#!/bin/bash

# Define the URLs of the files to download
urls=(
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3832nnn/GSM3832735/suppl/GSM3832735_wt_naive_gex.csv.gz"
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3832nnn/GSM3832736/suppl/GSM3832736_wt_naive_adt.csv.gz"
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3832nnn/GSM3832737/suppl/GSM3832737_wt_tumor_gex.csv.gz"
    "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3832nnn/GSM3832738/suppl/GSM3832738_wt_tumor_adt.csv.gz" 
    "https://zenodo.org/records/5511975/files/negative_cDC1_relative_signatures.csv?download=1"
    "https://zenodo.org/records/5511975/files/positive_cDC1_relative_signatures.csv?download=1"
    "https://github.com/SIgN-Bioinformatics/sgCMAP_R_Scripts/blob/main/sgCMAP_R_Scripts/sgCMAP-internal.R"
    "https://github.com/SIgN-Bioinformatics/sgCMAP_R_Scripts/blob/main/sgCMAP_R_Scripts/sgCMAP_score.R"
    )


# Download each file using wget
for url in "${urls[@]}"; do
    wget "$url"
done

# Unzip each downloaded file using gunzip
for file in *.gz;do
    gunzip "$file"
done

The R script is

#!/user/bin/R
args <- commandArgs(trailingOnly = TRUE)
print(args[0])
my_vec <- c(args[0],args[1],args[0],class(args),args[2])
write.table(my_vec,"result1.txt")

And the error is given below,

acheema@acri-AS-1124US-TNRP:~$ nextflow run single_cell.nf
N E X T F L O W  ~  version 23.10.1
    Launching `single_cell.nf` [soggy_sanger] DSL2 - revision: f55ed68615
    executor >  local (2)
[8d/2e0586] process > templateExample [100%] 1 of 1 ✔
[6d/17dc6a] process > read_count_p    [100%] 1 of 1, failed: 1 ✘
    ERROR ~ Error executing process > 'read_count_p'

Caused by:
     Process `read_count_p` terminated with an error exit status (2)

Command executed:

     Rscript read_count.R GSM3832735_wt_naive_gex.csv GSM3832737_wt_tumor_gex.csv

Command exit status:
      2

Command output:
     Fatal error: cannot open file 'read_count.R': No such file or directory

Command error:
      Fatal error: cannot open file 'read_count.R': No such file or directory

Work dir:
      /home/acheema/work/6d/17dc6ad0908c96df730a0f7c28c428

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

-- Check '.nextflow.log' file for details

The .nextflow.log is given below,

acheema@acri-AS-1124US-TNRP:~$ cat .nextflow.log
May-08 15:18:54.580 [main] DEBUG nextflow.cli.Launcher - $> nextflow run single_cell.nf
May-08 15:18:54.712 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 23.10.1
May-08 15:18:54.734 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/home/acheema/.nextflow/plugins; core-plugins: [email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected],[email protected]
May-08 15:18:54.743 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
May-08 15:18:54.744 [main] INFO  o.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
May-08 15:18:54.747 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
May-08 15:18:54.757 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
May-08 15:18:54.817 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 from script declararion
May-08 15:18:54.832 [main] INFO  nextflow.cli.CmdRun - Launching `single_cell.nf` [soggy_sanger] DSL2 - revision: f55ed68615
May-08 15:18:54.833 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[]
May-08 15:18:54.833 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins resolved requirement=[]
May-08 15:18:54.840 [main] DEBUG n.secret.LocalSecretsProvider - Secrets store: /home/acheema/.nextflow/secrets/store.json
May-08 15:18:54.846 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@783ec989] - activable => nextflow.secret.LocalSecretsProvider@783ec989
May-08 15:18:54.899 [main] DEBUG nextflow.Session - Session UUID: 34564b50-df93-4baa-8861-cba8231186f4
May-08 15:18:54.900 [main] DEBUG nextflow.Session - Run name: soggy_sanger
May-08 15:18:54.901 [main] DEBUG nextflow.Session - Executor pool size: 128
May-08 15:18:54.908 [main] DEBUG nextflow.file.FilePorter - File porter settings maxRetries=3; maxTransfers=50; pollTimeout=null
May-08 15:18:54.911 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=10; maxSize=384; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
May-08 15:18:54.938 [main] DEBUG nextflow.cli.CmdRun -
  Version: 23.10.1 build 5891
  Created: 12-01-2024 22:01 UTC (18:01 ADT)
  System: Linux 5.4.0-150-generic
  Runtime: Groovy 3.0.19 on OpenJDK 64-Bit Server VM 11.0.19+7-post-Ubuntu-0ubuntu118.04.1
  Encoding: UTF-8 (ANSI_X3.4-1968)
  Process: 18550@acri-AS-1124US-TNRP [127.0.1.1]
  CPUs: 128 - Mem: 1007.8 GB (709.6 GB) - Swap: 2 GB (2 GB)
May-08 15:18:54.958 [main] DEBUG nextflow.Session - Work-dir: /home/acheema/work [ext2/ext3]
May-08 15:18:55.011 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[]
May-08 15:18:55.023 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
May-08 15:18:55.057 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory
May-08 15:18:55.066 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 129; maxThreads: 1000
May-08 15:18:55.114 [main] DEBUG nextflow.Session - Session start
May-08 15:18:55.644 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
May-08 15:18:55.705 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: null
May-08 15:18:55.705 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'local'
May-08 15:18:55.710 [main] DEBUG nextflow.executor.Executor - [warm up] executor > local
May-08 15:18:55.714 [main] DEBUG n.processor.LocalPollingMonitor - Creating local task monitor for executor 'local' > cpus=128; memory=1007.8 GB; capacity=128; pollInterval=100ms; dumpInterval=5m
May-08 15:18:55.716 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: local)
May-08 15:18:55.821 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: null
May-08 15:18:55.821 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'local'
May-08 15:18:55.828 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: templateExample, read_count_p
May-08 15:18:55.828 [main] DEBUG nextflow.Session - Igniting dataflow network (2)
May-08 15:18:55.829 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > templateExample
May-08 15:18:55.830 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > read_count_p
May-08 15:18:55.831 [main] DEBUG nextflow.script.ScriptRunner - Parsed script files:
  Script_1e152ad49ae18340: /home/acheema/single_cell.nf
May-08 15:18:55.831 [main] DEBUG nextflow.script.ScriptRunner - > Awaiting termination
May-08 15:18:55.831 [main] DEBUG nextflow.Session - Session await
May-08 15:18:55.991 [Task submitter] DEBUG n.executor.local.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
May-08 15:18:55.995 [Task submitter] INFO  nextflow.Session - [8d/2e0586] Submitted process > templateExample
May-08 15:19:07.473 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 1; name: templateExample; status: COMPLETED; exit: 0; error: -; workDir: /home/acheema/work/8d/2e0586013131bee894e6322a38edf7]
May-08 15:19:07.504 [Task monitor] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'PublishDir' minSize=10; maxSize=384; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
May-08 15:19:07.537 [Task submitter] DEBUG n.executor.local.LocalTaskHandler - Launch cmd line: /bin/bash -ue .command.run
May-08 15:19:07.538 [Task submitter] INFO  nextflow.Session - [6d/17dc6a] Submitted process > read_count_p
May-08 15:19:07.610 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 2; name: read_count_p; status: COMPLETED; exit: 2; error: -; workDir: /home/acheema/work/6d/17dc6ad0908c96df730a0f7c28c428]
May-08 15:19:07.618 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=read_count_p; work-dir=/home/acheema/work/6d/17dc6ad0908c96df730a0f7c28c428
  error [nextflow.exception.ProcessFailedException]: Process `read_count_p` terminated with an error exit status (2)
May-08 15:19:07.632 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'read_count_p'

Caused by:
  Process `read_count_p` terminated with an error exit status (2)

Command executed:

  Rscript read_count.R GSM3832735_wt_naive_gex.csv GSM3832737_wt_tumor_gex.csv

Command exit status:
  2

Command output:
  Fatal error: cannot open file 'read_count.R': No such file or directory

Command error:
  Fatal error: cannot open file 'read_count.R': No such file or directory

Work dir:
  /home/acheema/work/6d/17dc6ad0908c96df730a0f7c28c428

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
May-08 15:19:07.635 [main] DEBUG nextflow.Session - Session await > all processes finished
May-08 15:19:07.638 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Process `read_count_p` terminated with an error exit status (2)
May-08 15:19:07.654 [main] DEBUG nextflow.Session - Session await > all barriers passed
May-08 15:19:07.655 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: local) - terminating tasks monitor poll loop
May-08 15:19:07.667 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=1; failedCount=1; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=11.4s; failedDuration=41ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=1; peakCpus=1; peakMemory=0; ]
May-08 15:19:07.856 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
May-08 15:19:07.879 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

But when I give the absolute path to the R script then it works fine.

script:
"""
Rscript /home/acheema/bin/read_count.R ${count_files}
"""

Now it works fine as given below,

acheema@acri-AS-1124US-TNRP:~$ nextflow run single_cell.nf
N E X T F L O W  ~  version 23.10.1
Launching `single_cell.nf` [astonishing_lorenz] DSL2 - revision: f279637f1a
executor >  local (2)
[26/99ed30] process > templateExample [100%] 1 of 1 ✔
[35/db989a] process > read_count_p    [100%] 1 of 1 ✔

Is there a way that R script can be found and read from the bin folder? I have tried the solutions suggested here but it did not work. Is there a solution?

3 Answers 3

3

The way this is typically done with nextflow, unless you want to put the whole R script into your script block, would be to set the shebang to #!/usr/bin/env Rscript. For that to work, your Rscript needs to be in bin of the pipeline directory (which in your case seems to be ~) and should be executable (i.e. chmod +x). Then your script block would look like this:

"""
read_count.R ${count_files}
"""

Are you certain that your R binary is in /user/bin (not /usr)?

Sign up to request clarification or add additional context in comments.

4 Comments

The shebang line is almost certainly wrong but it also doesn’t matter because OP’s invocation never looks at it, it’s just a regular comment. And there’s no bin directory involved in any of this. RScript looks for the script in the current working directory, and OP didn’t stage the script file.
nextflow will automatically add bin in the pipeline directory to PATH nextflow.io/docs/latest/…
Yes, but the Rscript invocation does not look up the script name in PATH.
Yeah, hence my suggestion to fix the shebang; I see your point though, with Rscript scriptname.Radding the file to bin will not solve the issue
0

This might have something to do with file permissions, but it's hard to say since the first process works.

What I do is read the R script into a value channel, and read it in like any other script. The benefit is also you can add a check if file exists function that will throw an error before the pipeline starts, rather than half way through if the R script is missing.

Also, I would just paste the download_files.sh into the script box of the process. It's what nextflow was designed for. Same with the R script, but it would be more annoying to change, so I'll leave it.

process templateExample {
  publishDir "data_analysis_files", mode:'copy'     

  output:
  path "*_gex.csv" , emit: count_files        
  
  script:
  """
  #!/bin/bash
  
  # Define the URLs of the files to download
  urls=(
      "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3832nnn/GSM3832735/suppl/GSM3832735_wt_naive_gex.csv.gz"
      "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3832nnn/GSM3832736/suppl/GSM3832736_wt_naive_adt.csv.gz"
      "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3832nnn/GSM3832737/suppl/GSM3832737_wt_tumor_gex.csv.gz"
      "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM3832nnn/GSM3832738/suppl/GSM3832738_wt_tumor_adt.csv.gz" 
      "https://zenodo.org/records/5511975/files/negative_cDC1_relative_signatures.csv?download=1"
      "https://zenodo.org/records/5511975/files/positive_cDC1_relative_signatures.csv?download=1"
      "https://github.com/SIgN-Bioinformatics/sgCMAP_R_Scripts/blob/main/sgCMAP_R_Scripts/sgCMAP-internal.R"
      "https://github.com/SIgN-Bioinformatics/sgCMAP_R_Scripts/blob/main/sgCMAP_R_Scripts/sgCMAP_score.R"
      )
  
  
  # Download each file using wget
  for url in "${urls[@]}"; do
      wget "$url"
  done
  
  # Unzip each downloaded file using gunzip
  for file in *.gz;do
      gunzip "$file"
  done
  """
}


process read_count_p {
  publishDir "results",mode:'copy'

  input:
  path count_files
  path read_counts_rscript

  output:
  path "result.txt"

  """
  Rscript ${read_counts_rscript} ${count_files}
  """
}


 workflow {
   templateExample()
   read_count_p(templateExample.out.count_files, read_counts_rscript )
 }

And add the following channel to your channel creation script block

Channel
   .fromPath(params.read_counts_rscript)
   .ifEmpty { error "No merging Rscript supplied: ${params.read_counts_rscript}" }
   .set { read_counts_rscript }

EDIT: Didn't update the workflow declaration with the new Rscript channel.

4 Comments

Since it works when passed the absolute path this is definitely not related to file permissions. OP simply didn’t stage the script file into the work directory.
This is exactly the error you get when calling a script from the bin folder when you don't have execution permissions for that file. @KonradRudolph
@Pallie Only if you try to execute the script directly. Not if you are passing it to another command as OP does. The code used by OP does not need to give execute permissions to the R script file.
Yes you're absolutely right @KonradRudolph, I misunderstood the situation.
0

As far as Nexflow is concerned, read_count.R is an input into the read_count_p process; therefore you need to declare it as such, otherwise the file won’t be present in the process working directory, and Rscript cannot find it: Nextflow executes each process inside its own custom working directory, and it ensures that all input files will be present inside the directory (and output files will be copied out of it).

So either declare read_count.R as an input into the process and add the path the script as an input when invoking the read_count_p process. Or replace RScript read_count.R with read_count.R, set the PATH environment variable so that includes the directory which contains read_count.R, and fix its shebang line as mentioned in the other answer.

Alternatively, you can also copy the contents of the R script directly into the script block of the process.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.