At some point, you may have a situation where you need a single sample attribute (or column) to populate several different pipeline arguments with different values. In other words, the value of a given attribute may imply values for other attributes. It would be nice if you didn’t have to enumerate all of these secondary, implied attributes, and could instead just infer them from the value of the original attribute. For example, if my organism attribute is
human, this implies a few other secondary attributes (which may be project-specific): For one project, I want to set
hs. Of course, I could just define columns called
macs_genome_size, but these would be invariant, so it feels inefficient and unweildy; and then, changing the aligned genome would require changing the sample annotation sheet (every sample, in fact). You can do this with looper, of course, but a better way would be handle these things at the project level.
As a more elegant alternative, Looper offers a
project_config section called
implied_columns. Instead of hard-coding
macs_genome_size in the sample annotation sheet, you can simply specify that the attribute
organism implies additional attribute-value pairs (which may vary by sample based on the value of the
organism attribute). This lets you specify the genome, transcriptome, genome size, and other similar variables all in your project configuration file.
To do this, just add an
implied_columns section to your project_config.yaml file. Example:
implied_columns: organism: human: genome: "hg38" macs_genome_size: "hs" mouse: genome: "mm10" macs_genome_size: "mm"
There are 3 layers in the
implied_columns hierarchy. The first layer, (sub-values under
organism), are primary columns from which new attributes will rely. The second layer (here,
mouse) are possible values your samples may take in the primary column. The third layer (
macs_genome_size) are the key-value pair of new, implied columns for any samples with the required value for that primary column. In this example, any samples with organism set to “human” will automatically also have attributes for genome (hg38) and for macs_genome_size (hs). Any samples with organism set to “mouse” will have the corresponding values. A sample with organism set to
frog would lack attributes for
macs_genome_size, since those columns are not implied by
This system essentially lets you set global, species-level attributes at the project level instead of duplicating that information for every sample that belongs to a species. Even better, it’s generic – so you can do this for any subdivision of samples (just replace
organism with whatever you like). This makes your project more portable and does a better job conceptually with separating sample attributes from project attributes; after all, a reference genome assembly is really not an inherent property of a sample, but of a sample in respect to a particular project or alignment.