1

I have a Snakemake pipeline (https://github.com/V-Varga/SPOT-BGC/tree/main), where I generate input and output file names for various intermediate steps using wildcards that refer back to file and file path names in a Pandas dataframe. The full Snakefile is available in the repository, so I will only be including the most relevant parts here.

I use both lambda wildcards and wildcards with expand() in rule inputs. I realized recently that this structure somehow leads to Snakemake not always recognizing outputs as necessary inputs for further rules to run, despite being defined that way. For example, the outputs of rule bbnorm_pe and bbnorm_se are defined as inputs for rule filt_100k:

# rule bbnorm_pe normalizes PE reads
rule bbnorm_pe:
    input:
        mapped_fq1 = lambda wildcards: fq_R1_names_df.loc[wildcards.cohort_mapped_with_sample, 'Location_NonHuman'],
        mapped_fq2 = lambda wildcards: fq_R2_names_df.loc[wildcards.cohort_mapped_with_sample, 'Location_NonHuman']
    output:
        input_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_mapped_with_sample}_NON-human_map_input_kmers.png",
        output_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_mapped_with_sample}_NON-human_map_output_kmers.png",
        normalized_fq1 = "results/DataNonHuman/BBNorm_Reads/{cohort_mapped_with_sample}_norm.1.fq",
        normalized_fq2 = "results/DataNonHuman/BBNorm_Reads/{cohort_mapped_with_sample}_norm.2.fq"
    threads: config['threads_bbnorm']
    resources:
        mem_mb = config['memory_maximum']
    params:
        memory_alloc = config['bbnorm_memory']
    log:
        "logs/BBnorm/{cohort_mapped_with_sample}_read_mapping_log.txt"
    shell:
        """
        apptainer exec workflow/containers/bbtools.sif /bbmap/bbnorm.sh {params.memory_alloc} \
        in={input.mapped_fq1} in2={input.mapped_fq2} \
        out={output.normalized_fq1} out2={output.normalized_fq2} \
        target=80 min=3 threads={threads} \
        hist={output.input_kmers} histout={output.output_kmers}
        """

# rule bbnorm_se normalizes SE reads
rule bbnorm_se:
    input:
        mapped_fqse = lambda wildcards: fq_SE_names_df.loc[wildcards.cohort_with_sample, 'Location_NonHuman']
    output:
        input_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_NON-human_map_input_kmers.png",
        output_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_NON-human_map_output_kmers.png",
        normalized_fqse = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.SE.fq"
    threads: config['threads_bbnorm']
    resources:
        mem_mb = config['memory_maximum']
    params:
        memory_alloc = config['bbnorm_memory']
    log:
        "logs/BBnorm/{cohort_with_sample}_read_mapping_log.txt"
    shell:
        """
        apptainer exec workflow/containers/bbtools.sif /bbmap/bbnorm.sh {params.memory_alloc} \
        in={input.mapped_fqse} out={output.normalized_fqse} threads={threads} \
        target=80 min=3 hist={output.input_kmers} histout={output.output_kmers}
        """


# filter down to files with 100k reads
rule filt_100k:
    input:
        expand("results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.1.fq",
            cohort_with_sample = fq_R1_names_df.CohortSample),
        expand("results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.2.fq",
            cohort_with_sample = fq_R1_names_df.CohortSample),
        expand("results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.SE.fq",
            cohort_with_sample = fq_SE_names_df.CohortSample),
        filtering_script = 'workflow/scripts/snakemake_100k_filt.sh',
    log:
        filt_100k_log = 'logs/100k_filt.txt'
    output:
        completion_100k = 'logs/completion/100k_filt__COMPLETE.txt'
    shell:
        'bash {input.filtering_script}'

In the above, Snakemake does not seem to recognize, based on the --dry-run printout, that outputs from rule bbnorm_pe and bbnorm_se are actually required to exist, as inputs for rule filt_100k.

I have tried using touch to create a completion flag (see my earlier question, here: Snakemake wildcard issues using touch() to avoid premature pipeline progression), but this did not resolve the issue.

The pipeline has run successfully before, but I suspect that this was simply due to luck - the datasets I used before were much smaller, and not nearly as deeply-sequenced as the dataset I am trying to run the pipeline on now. Lag time resulting from the sequencing depth is probably what is now causing Snakemake not to proceed in the intended order.

The dataframe file looks like this:

Cohort  Sample  CohortSample    CohortSampleSample      Location_Raw    FileBase_Raw    CohortBase_Raw  Location_Trim  FileBase_Trim    CohortBase_Trim Location_NonHuman       FileBase_NonHuman       CohortBase_NonHuman     Location_Norm  FileBase_Norm    CohortBase_Norm Location_100k   FileBase_100k   CohortBase_100k AssemblySample_Location AssemblySample_FileBase AssemblySample_CohortBase       FiltAssemblySample_Location     FiltAssemblySample_FileBase     FiltAssemblySample_CohortBase   TaxaSample_Location     TaxaSample_FileBase     TaxaSample_CohortBase   GECCOSample_Location    GECCOSample_FileBase    GECCO_CohortBase        AntiSMASH_Location      AntiSMASH_Base  AntiSMASH_CohortBase    ReadNum
CapeTown_PRJNA767784    SRR16503423     CapeTown_PRJNA767784/SRR16503423        CapeTown_PRJNA767784/SRR16503423/SRR16503423    resources/RawData/CapeTown_PRJNA767784/SRR16503423/SRR16503423_1.fastq  SRR16503423_1   CapeTown_PRJNA767784/SRR16503423_1      results/Trimmomatic/CapeTown_PRJNA767784/SRR16503423.1.fastq    SRR16503423.1   CapeTown_PRJNA767784/SRR16503423.1      results/DataNonHuman/NonHumanOG/CapeTown_PRJNA767784/SRR16503423_NON-human_map.1.fq     SRR16503423_NON-human_map.1     CapeTown_PRJNA767784/SRR16503423_NON-human_map.1        results/DataNonHuman/BBNorm_Reads/CapeTown_PRJNA767784/SRR16503423_norm.1.fq    SRR16503423_norm.1      CapeTown_PRJNA767784/SRR16503423_norm.1 results/DataNonHuman/100k_Filt/CapeTown_PRJNA767784/SRR16503423_norm.1.fq       SRR16503423_norm.1      CapeTown_PRJNA767784/SRR16503423_norm.1results/Assembly/PerSample/CapeTown_PRJNA767784/SRR16503423/SRR16503423_scaffolds.fasta  SRR16503423_scaffolds   CapeTown_PRJNA767784/SRR16503423/SRR16503423_scaffolds  results/AssemblyNonHuman/PerSample/CapeTown_PRJNA767784/SRR16503423/SRR16503423_assemblySample_NON-human_map.fasta      SRR16503423_assemblySample_NON-human_map        CapeTown_PRJNA767784/SRR16503423/SRR16503423_assemblySample_NON-human_map       results/Taxonomy/PerSample/CapeTown_PRJNA767784/SRR16503423/SRR16503423__SampleTaxa.txt SRR16503423__SampleTaxa CapeTown_PRJNA767784/SRR16503423/SRR16503423__SampleTaxa        results/BGC_Prediction/GECCO_Results/PerSample/CapeTown_PRJNA767784/SRR16503423/SRR16503423_contigs.clusters.gff        SRR16503423_contigs.clusters    CapeTown_PRJNA767784/SRR16503423/SRR16503423_contigs.clusters   results/BGC_Prediction/AntiSMASH_Results/PerSample/CapeTown_PRJNA767784/SRR16503423/SRR16503423_persample_AntiSMASH.json        SRR16503423_persample_AntiSMASH CapeTown_PRJNA767784/SRR16503423/SRR16503423_persample_AntiSMASH        1
C

And is imported into a Pandas dataframe in Snakemake like so:

# dataframe with the file basenames as index
filebases_df = (pd.read_csv(config['targets_per_sample'], sep='\t').set_index('FileBase_Raw', drop=False))
# dataframe with cohort-categorized samples IDs
cohort_filebases_df = (pd.read_csv(config['targets_per_sample'], sep='\t').set_index('CohortBase_Raw', drop=False))
# cohort sample index
cohort_sample_df = (pd.read_csv(config['targets_per_sample'], sep='\t').set_index('CohortSample', drop=False))
# filtered databases:
# forward reads
fq_R1_names_df = cohort_sample_df[cohort_sample_df['ReadNum'] == '1']
# reverse reads
fq_R2_names_df = cohort_sample_df[cohort_sample_df['ReadNum'] == '2']
# SE reads
fq_SE_names_df = cohort_sample_df[cohort_sample_df['ReadNum'] == 'SE']

When I try a dry-run until just the bottleneck step of rule filt_100k (the aggregator rule that is executing prematurely), I can see from the printout that the inputs for rule filt_100k are not registering as required in order for the job to run, despite being included in the inputs fields for both rule filt_100k and rule all.

If I change rule all to only include the BBnorm step outputs, like so:

rule all:
    input:
        expand('results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.1.fq',
            cohort_with_sample = fq_R1_names_df.CohortSample),
        expand('results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.2.fq',
            cohort_with_sample = fq_R1_names_df.CohortSample),
        expand('results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.SE.fq',
            cohort_with_sample = fq_SE_names_df.CohortSample)

And then do dry-runs with Snakemake:

(snakemake) snakemake --dry-run --verbose --until bbnorm_pe
Wildcard constraints in inputs are ignored. (rule: trim_files_pe)
Wildcard constraints in inputs are ignored. (rule: trim_files_pe)
Wildcard constraints in inputs are ignored. (rule: trim_files_se)
host: [REMOVED]
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
Job stats:
job      count
-----  -------
total        0

Job stats:
job      count
-----  -------
total        0

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
(snakemake) snakemake --dry-run --verbose --until filt_100k
Wildcard constraints in inputs are ignored. (rule: trim_files_pe)
Wildcard constraints in inputs are ignored. (rule: trim_files_pe)
Wildcard constraints in inputs are ignored. (rule: trim_files_se)
host: [REMOVED]
Building DAG of jobs...
logs/completion/100k_filt__COMPLETE.txt: True 0
shared_storage_local_copies: True
remote_exec: False
Job stats:
job          count
---------  -------
filt_100k        1
total            1


[Thu Oct 16 14:46:37 2025]
rule filt_100k:
    input: workflow/scripts/snakemake_100k_filt.sh
    output: logs/completion/100k_filt__COMPLETE.txt
    log: logs/100k_filt.txt
    jobid: 0
    reason: Missing output files: logs/completion/100k_filt__COMPLETE.txt
    resources: tmpdir=<TBD>
Handle temp files for job filt_100k
Job stats:
job          count
---------  -------
filt_100k        1
total            1

Reasons:
    (check individual jobs above for details)
    output files have to be generated:
        filt_100k
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

To me, this looks like Snakemake is not properly registering the BBnorm outputs as required for pipeline execution. I suspect that avoiding the database would make this pipeline much cleaner, but the way that I have to maintain data structure in nested per-cohort and per-sample directories (rather than pooling them) makes this complicated:

├── resources
│   ├── RawData
│   │     ├── {COHORT_ID}
│   │     │    └── {SAMPLE_ID}
│   │     │        ├── {SAMPLE_1.fastq}
│   │     │        └── {SAMPLE_1.fastq}
│   │     └── {COHORT_ID}
│   │         └── {SAMPLE_ID}
│   │             └── {SAMPLE.fastq}

So my question then becomes: How can I fix my wildcard usage to ensure that Snakemake understands that these files are created via intermediate rules, and must exist before the pipeline proceeds to the next rule?

Is the issue perhaps with the way Snakemake resolves the expand() function? If so, how should I reformat the rules to avoid using expand()? Or is there something fundemental I am missing about how lambda wildcards vs. expand() work in Snakemake?

I appreaciate any comments or suggestions you may have!

1 Answer 1

3

Your rules use different wildcard patterns:

  • bbnorm_pe uses: {cohort_mapped_with_sample}

  • bbnorm_se uses: {cohort_with_sample}

  • filt_100k expects: {cohort_with_sample}

But in your dry-run, you see that filt_100k only lists workflow/scripts/snakemake_100k_filt.sh as input, none of the normalized fastq files appear. This is because the expand() calls are failing silently to match any actual files.

When you use expand() with a dataframe column, Snakemake tries to generate file paths. However:

  1. Your bbnorm_pe outputs use {cohort_mapped_with_sample} but your filt_100k input expects {cohort_with_sample}

  2. The column names in your dataframe index also matter - you're indexing by CohortSample but using different wildcard names

You need to standardize your wildcard names across all rules.

# For PE reads - change to cohort_with_sample
rule bbnorm_pe:
    input:
        mapped_fq1 = lambda wildcards: fq_R1_names_df.loc[wildcards.cohort_with_sample, 'Location_NonHuman'],
        mapped_fq2 = lambda wildcards: fq_R2_names_df.loc[wildcards.cohort_with_sample, 'Location_NonHuman']
    output:
        input_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_NON-human_map_input_kmers.png",
        output_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_NON-human_map_output_kmers.png",
        normalized_fq1 = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.1.fq",
        normalized_fq2 = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.2.fq"
    threads: config['threads_bbnorm']
    resources:
        mem_mb = config['memory_maximum']
    params:
        memory_alloc = config['bbnorm_memory']
    log:
        "logs/BBnorm/{cohort_with_sample}_read_mapping_log.txt"
    shell:
        """
        apptainer exec workflow/containers/bbtools.sif /bbmap/bbnorm.sh {params.memory_alloc} \
        in={input.mapped_fq1} in2={input.mapped_fq2} \
        out={output.normalized_fq1} out2={output.normalized_fq2} \
        target=80 min=3 threads={threads} \
        hist={output.input_kmers} histout={output.output_kmers}
        """

# SE reads already uses cohort_with_sample - no change needed
rule bbnorm_se:
    input:
        mapped_fqse = lambda wildcards: fq_SE_names_df.loc[wildcards.cohort_with_sample, 'Location_NonHuman']
    output:
        input_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_NON-human_map_input_kmers.png",
        output_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_NON-human_map_output_kmers.png",
        normalized_fqse = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.SE.fq"
    threads: config['threads_bbnorm']
    resources:
        mem_mb = config['memory_maximum']
    params:
        memory_alloc = config['bbnorm_memory']
    log:
        "logs/BBnorm/{cohort_with_sample}_read_mapping_log.txt"
    shell:
        """
        apptainer exec workflow/containers/bbtools.sif /bbmap/bbnorm.sh {params.memory_alloc} \
        in={input.mapped_fqse} out={output.normalized_fqse} threads={threads} \
        target=80 min=3 hist={output.input_kmers} histout={output.output_kmers}
        """

The other issue is also that you're using .CohortSample (the index) but your outputs use {cohort_with_sample} as the wildcard. These should match:

rule filt_100k:
    input:
        r1_files = expand("results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.1.fq",
            cohort_with_sample = fq_R1_names_df.index),  # Use .index instead of .CohortSample
        r2_files = expand("results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.2.fq",
            cohort_with_sample = fq_R2_names_df.index),
        se_files = expand("results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.SE.fq",
            cohort_with_sample = fq_SE_names_df.index),
        filtering_script = 'workflow/scripts/snakemake_100k_filt.sh',
    log:
        filt_100k_log = 'logs/100k_filt.txt'
    output:
        completion_100k = 'logs/completion/100k_filt__COMPLETE.txt'
    shell:
        'bash {input.filtering_script}'
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much for pointing this out! I guess I was treating the wildcards more like variables, and wasn't consistent enough. It'll take a bit to sort this all out, but I'm really glad to have a way forward!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.