I have a Snakemake pipeline (https://github.com/V-Varga/SPOT-BGC/tree/main), where I generate input and output file names for various intermediate steps using wildcards that refer back to file and file path names in a Pandas dataframe. The full Snakefile is available in the repository, so I will only be including the most relevant parts here.
I use both lambda wildcards and wildcards with expand() in rule inputs. I realized recently that this structure somehow leads to Snakemake not always recognizing outputs as necessary inputs for further rules to run, despite being defined that way. For example, the outputs of rule bbnorm_pe and bbnorm_se are defined as inputs for rule filt_100k:
# rule bbnorm_pe normalizes PE reads
rule bbnorm_pe:
input:
mapped_fq1 = lambda wildcards: fq_R1_names_df.loc[wildcards.cohort_mapped_with_sample, 'Location_NonHuman'],
mapped_fq2 = lambda wildcards: fq_R2_names_df.loc[wildcards.cohort_mapped_with_sample, 'Location_NonHuman']
output:
input_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_mapped_with_sample}_NON-human_map_input_kmers.png",
output_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_mapped_with_sample}_NON-human_map_output_kmers.png",
normalized_fq1 = "results/DataNonHuman/BBNorm_Reads/{cohort_mapped_with_sample}_norm.1.fq",
normalized_fq2 = "results/DataNonHuman/BBNorm_Reads/{cohort_mapped_with_sample}_norm.2.fq"
threads: config['threads_bbnorm']
resources:
mem_mb = config['memory_maximum']
params:
memory_alloc = config['bbnorm_memory']
log:
"logs/BBnorm/{cohort_mapped_with_sample}_read_mapping_log.txt"
shell:
"""
apptainer exec workflow/containers/bbtools.sif /bbmap/bbnorm.sh {params.memory_alloc} \
in={input.mapped_fq1} in2={input.mapped_fq2} \
out={output.normalized_fq1} out2={output.normalized_fq2} \
target=80 min=3 threads={threads} \
hist={output.input_kmers} histout={output.output_kmers}
"""
# rule bbnorm_se normalizes SE reads
rule bbnorm_se:
input:
mapped_fqse = lambda wildcards: fq_SE_names_df.loc[wildcards.cohort_with_sample, 'Location_NonHuman']
output:
input_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_NON-human_map_input_kmers.png",
output_kmers = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_NON-human_map_output_kmers.png",
normalized_fqse = "results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.SE.fq"
threads: config['threads_bbnorm']
resources:
mem_mb = config['memory_maximum']
params:
memory_alloc = config['bbnorm_memory']
log:
"logs/BBnorm/{cohort_with_sample}_read_mapping_log.txt"
shell:
"""
apptainer exec workflow/containers/bbtools.sif /bbmap/bbnorm.sh {params.memory_alloc} \
in={input.mapped_fqse} out={output.normalized_fqse} threads={threads} \
target=80 min=3 hist={output.input_kmers} histout={output.output_kmers}
"""
# filter down to files with 100k reads
rule filt_100k:
input:
expand("results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.1.fq",
cohort_with_sample = fq_R1_names_df.CohortSample),
expand("results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.2.fq",
cohort_with_sample = fq_R1_names_df.CohortSample),
expand("results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.SE.fq",
cohort_with_sample = fq_SE_names_df.CohortSample),
filtering_script = 'workflow/scripts/snakemake_100k_filt.sh',
log:
filt_100k_log = 'logs/100k_filt.txt'
output:
completion_100k = 'logs/completion/100k_filt__COMPLETE.txt'
shell:
'bash {input.filtering_script}'
In the above, Snakemake does not seem to recognize, based on the --dry-run printout, that outputs from rule bbnorm_pe and bbnorm_se are actually required to exist, as inputs for rule filt_100k.
I have tried using touch to create a completion flag (see my earlier question, here: Snakemake wildcard issues using touch() to avoid premature pipeline progression), but this did not resolve the issue.
The pipeline has run successfully before, but I suspect that this was simply due to luck - the datasets I used before were much smaller, and not nearly as deeply-sequenced as the dataset I am trying to run the pipeline on now. Lag time resulting from the sequencing depth is probably what is now causing Snakemake not to proceed in the intended order.
The dataframe file looks like this:
Cohort Sample CohortSample CohortSampleSample Location_Raw FileBase_Raw CohortBase_Raw Location_Trim FileBase_Trim CohortBase_Trim Location_NonHuman FileBase_NonHuman CohortBase_NonHuman Location_Norm FileBase_Norm CohortBase_Norm Location_100k FileBase_100k CohortBase_100k AssemblySample_Location AssemblySample_FileBase AssemblySample_CohortBase FiltAssemblySample_Location FiltAssemblySample_FileBase FiltAssemblySample_CohortBase TaxaSample_Location TaxaSample_FileBase TaxaSample_CohortBase GECCOSample_Location GECCOSample_FileBase GECCO_CohortBase AntiSMASH_Location AntiSMASH_Base AntiSMASH_CohortBase ReadNum
CapeTown_PRJNA767784 SRR16503423 CapeTown_PRJNA767784/SRR16503423 CapeTown_PRJNA767784/SRR16503423/SRR16503423 resources/RawData/CapeTown_PRJNA767784/SRR16503423/SRR16503423_1.fastq SRR16503423_1 CapeTown_PRJNA767784/SRR16503423_1 results/Trimmomatic/CapeTown_PRJNA767784/SRR16503423.1.fastq SRR16503423.1 CapeTown_PRJNA767784/SRR16503423.1 results/DataNonHuman/NonHumanOG/CapeTown_PRJNA767784/SRR16503423_NON-human_map.1.fq SRR16503423_NON-human_map.1 CapeTown_PRJNA767784/SRR16503423_NON-human_map.1 results/DataNonHuman/BBNorm_Reads/CapeTown_PRJNA767784/SRR16503423_norm.1.fq SRR16503423_norm.1 CapeTown_PRJNA767784/SRR16503423_norm.1 results/DataNonHuman/100k_Filt/CapeTown_PRJNA767784/SRR16503423_norm.1.fq SRR16503423_norm.1 CapeTown_PRJNA767784/SRR16503423_norm.1results/Assembly/PerSample/CapeTown_PRJNA767784/SRR16503423/SRR16503423_scaffolds.fasta SRR16503423_scaffolds CapeTown_PRJNA767784/SRR16503423/SRR16503423_scaffolds results/AssemblyNonHuman/PerSample/CapeTown_PRJNA767784/SRR16503423/SRR16503423_assemblySample_NON-human_map.fasta SRR16503423_assemblySample_NON-human_map CapeTown_PRJNA767784/SRR16503423/SRR16503423_assemblySample_NON-human_map results/Taxonomy/PerSample/CapeTown_PRJNA767784/SRR16503423/SRR16503423__SampleTaxa.txt SRR16503423__SampleTaxa CapeTown_PRJNA767784/SRR16503423/SRR16503423__SampleTaxa results/BGC_Prediction/GECCO_Results/PerSample/CapeTown_PRJNA767784/SRR16503423/SRR16503423_contigs.clusters.gff SRR16503423_contigs.clusters CapeTown_PRJNA767784/SRR16503423/SRR16503423_contigs.clusters results/BGC_Prediction/AntiSMASH_Results/PerSample/CapeTown_PRJNA767784/SRR16503423/SRR16503423_persample_AntiSMASH.json SRR16503423_persample_AntiSMASH CapeTown_PRJNA767784/SRR16503423/SRR16503423_persample_AntiSMASH 1
C
And is imported into a Pandas dataframe in Snakemake like so:
# dataframe with the file basenames as index
filebases_df = (pd.read_csv(config['targets_per_sample'], sep='\t').set_index('FileBase_Raw', drop=False))
# dataframe with cohort-categorized samples IDs
cohort_filebases_df = (pd.read_csv(config['targets_per_sample'], sep='\t').set_index('CohortBase_Raw', drop=False))
# cohort sample index
cohort_sample_df = (pd.read_csv(config['targets_per_sample'], sep='\t').set_index('CohortSample', drop=False))
# filtered databases:
# forward reads
fq_R1_names_df = cohort_sample_df[cohort_sample_df['ReadNum'] == '1']
# reverse reads
fq_R2_names_df = cohort_sample_df[cohort_sample_df['ReadNum'] == '2']
# SE reads
fq_SE_names_df = cohort_sample_df[cohort_sample_df['ReadNum'] == 'SE']
When I try a dry-run until just the bottleneck step of rule filt_100k (the aggregator rule that is executing prematurely), I can see from the printout that the inputs for rule filt_100k are not registering as required in order for the job to run, despite being included in the inputs fields for both rule filt_100k and rule all.
If I change rule all to only include the BBnorm step outputs, like so:
rule all:
input:
expand('results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.1.fq',
cohort_with_sample = fq_R1_names_df.CohortSample),
expand('results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.2.fq',
cohort_with_sample = fq_R1_names_df.CohortSample),
expand('results/DataNonHuman/BBNorm_Reads/{cohort_with_sample}_norm.SE.fq',
cohort_with_sample = fq_SE_names_df.CohortSample)
And then do dry-runs with Snakemake:
(snakemake) snakemake --dry-run --verbose --until bbnorm_pe
Wildcard constraints in inputs are ignored. (rule: trim_files_pe)
Wildcard constraints in inputs are ignored. (rule: trim_files_pe)
Wildcard constraints in inputs are ignored. (rule: trim_files_se)
host: [REMOVED]
Building DAG of jobs...
shared_storage_local_copies: True
remote_exec: False
Job stats:
job count
----- -------
total 0
Job stats:
job count
----- -------
total 0
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
(snakemake) snakemake --dry-run --verbose --until filt_100k
Wildcard constraints in inputs are ignored. (rule: trim_files_pe)
Wildcard constraints in inputs are ignored. (rule: trim_files_pe)
Wildcard constraints in inputs are ignored. (rule: trim_files_se)
host: [REMOVED]
Building DAG of jobs...
logs/completion/100k_filt__COMPLETE.txt: True 0
shared_storage_local_copies: True
remote_exec: False
Job stats:
job count
--------- -------
filt_100k 1
total 1
[Thu Oct 16 14:46:37 2025]
rule filt_100k:
input: workflow/scripts/snakemake_100k_filt.sh
output: logs/completion/100k_filt__COMPLETE.txt
log: logs/100k_filt.txt
jobid: 0
reason: Missing output files: logs/completion/100k_filt__COMPLETE.txt
resources: tmpdir=<TBD>
Handle temp files for job filt_100k
Job stats:
job count
--------- -------
filt_100k 1
total 1
Reasons:
(check individual jobs above for details)
output files have to be generated:
filt_100k
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
To me, this looks like Snakemake is not properly registering the BBnorm outputs as required for pipeline execution. I suspect that avoiding the database would make this pipeline much cleaner, but the way that I have to maintain data structure in nested per-cohort and per-sample directories (rather than pooling them) makes this complicated:
├── resources
│ ├── RawData
│ │ ├── {COHORT_ID}
│ │ │ └── {SAMPLE_ID}
│ │ │ ├── {SAMPLE_1.fastq}
│ │ │ └── {SAMPLE_1.fastq}
│ │ └── {COHORT_ID}
│ │ └── {SAMPLE_ID}
│ │ └── {SAMPLE.fastq}
So my question then becomes: How can I fix my wildcard usage to ensure that Snakemake understands that these files are created via intermediate rules, and must exist before the pipeline proceeds to the next rule?
Is the issue perhaps with the way Snakemake resolves the expand() function? If so, how should I reformat the rules to avoid using expand()? Or is there something fundemental I am missing about how lambda wildcards vs. expand() work in Snakemake?
I appreaciate any comments or suggestions you may have!