Understanding Microbes, and Life: June 2013

Had to take a hiatus again for an interview and such-what. I hope it went well!
So I am continuing with the QIIME struggle.Today I am doing the split_libraries.py command.
I typed in for my Lake sequences:

split_libraries.py -m Lakes_Map_reverse_primer1.txt -f lakes454.fna -q lakes454.qual -b hamming_8 -o split_library_output_revprimers/ -z truncate_only

This code takes the Lakes 454 fasta sequences (with barcode, and forward and reverse primers) and splits them into a file with just the sequences with their names. The output folder gives you 3 files. The first is the .fna output file with the sequences and names mentioned. The log file is a summary of the sequences. The histogram file tells you how many sequences are within the lengths (peak) before and after processing. Look at where the majority of sequences are after processing and choose a range around those. Then run the split libraries again specifying a minimum and maximum length.

The number of written sequences should always be less than the number of input sequences.

The –b hamming_8 part of the script specifies that the barcode sequence is 8 bp long.

The –z part indicates that you want the primer sequences taken off if they are detected.

The script takes between 15-20 minutes to run on my slow ass computer with a default of 200 minimum and 1000 maximum nt’s per read. The tighter this range, the less noise or non-specific amplification of primers you will have.

The forward (27 F) and reverse primers (519 F) should be amplifying a product around 492 bp long.

QUESTIONS I have:

The split libraries output gives me a log file and I have questions about it. Why are there sequences with an identifiable barcode but no detectable reverse primer? Why is there a list of “Total valid barcodes that are not in mapping file”? Should these be in the final written .fna file, or are they just sequences that could be barcodes that the script picks up?

I redid this (on original files, notice the change in name of output folder) to shorten the product range:

split_libraries.py -m Lakes_Map_reverse_primer1.txt -f lakes454.fna -q lakes454.qual -b hamming_8 -l 410 -L 490 -o split_library_output_revprimers_Run2/ -z truncate_only

The -l lets me set a minimum for base pairs, and -L lets me set a maximum. I chose this range using the histograms.txt file generated by the first run. I based by range upon where the peak length was for maximum number of sequences, and around that peak to get a big fraction of total sequences. But I was still pretty stringent. This gave me a file with the quality controlled sequences.

Enough for today!

Understanding Microbes, and Life

Wednesday, June 5, 2013

Moving forward with sequence analysis.