Had to take a hiatus again for an interview and such-what. I hope it went well!
So I am continuing with the QIIME struggle.Today I am doing the split_libraries.py command.
I typed in for my Lake sequences:
So I am continuing with the QIIME struggle.Today I am doing the split_libraries.py command.
I typed in for my Lake sequences:
split_libraries.py -m
Lakes_Map_reverse_primer1.txt -f lakes454.fna -q lakes454.qual -b hamming_8 -o split_library_output_revprimers/
-z truncate_only
This code takes the Lakes 454 fasta sequences (with barcode,
and forward and reverse primers) and splits them into a file with just the
sequences with their names. The output folder gives you 3 files. The first is
the .fna output file with the sequences and names mentioned. The log file is a
summary of the sequences. The histogram file tells you how many sequences are
within the lengths (peak) before and after processing. Look at where the
majority of sequences are after processing and choose a range around those.
Then run the split libraries again specifying a minimum and maximum length.
The number of written sequences should always be less than
the number of input sequences.
The –b hamming_8 part of the script specifies that the
barcode sequence is 8 bp long.
The –z part indicates that you want the primer sequences
taken off if they are detected.
The script takes between 15-20 minutes to run on my slow ass
computer with a default of 200 minimum and 1000 maximum nt’s per read. The
tighter this range, the less noise or non-specific amplification of primers you
will have.
The forward (27 F) and reverse primers (519 F) should be
amplifying a product around 492 bp long.
QUESTIONS I have:
The split libraries output gives me a log file and I have
questions about it. Why are there sequences with an identifiable barcode but no
detectable reverse primer? Why is there a list of “Total valid barcodes that
are not in mapping file”? Should these be in the final written .fna file, or
are they just sequences that could be barcodes that the script picks up?
I redid this (on original files, notice the change in name of output folder) to shorten the product range:
split_libraries.py -m
Lakes_Map_reverse_primer1.txt -f lakes454.fna -q lakes454.qual -b hamming_8 -l 410
-L 490 -o split_library_output_revprimers_Run2/ -z truncate_only
The -l lets me set a minimum for base pairs, and -L lets me set a maximum. I chose this range using the histograms.txt file generated by the first run. I based by range upon where the peak length was for maximum number of sequences, and around that peak to get a big fraction of total sequences. But I was still pretty stringent. This gave me a file with the quality controlled sequences.
Enough for today!