Tuesday, May 28, 2013

Bioinformatics or Bust!

Ok, so that was more than a day’s hiatus I took from the QIIME tutorial, but I had a hooding ceremony to attend! My folks just left Sunday, it was a great visit :)

Today I also sent my dissertation to my committee… and immediately after discovered additional formatting errors after battling formatting issues for a week. LeSigh.

Back to the QIIME tutorial using the VirtualBox and Ubuntu interface. Today I learned the following:

1. How to convert a .sff file to .fna file in fasta format via QIIME 
(http://qiime.org/scripts/process_sff.html). It’s easy. Basically in your terminal window, make sure you’re in the directory (cd) that the .sff file is. Then use the command

process_sff.py -i lakes454.sff

That will convert the input file (-i) named lakes454.sff, which is what 454 pyrosequencing gives you, to outputs lakes454.fna and lakes454.qual. The .fna is fasta format of all the 454 sequences. There is also a .qual file that is generated, which tells you about the quality of the bases/sequences. A full description can be found here under “Quality Scores” http://qiime.org/tutorials/tutorial.html

Also to get info about a script (.py) type in
process_sff.py –h

This will bring up the help file. It tells you all the inputs and outputs you will get. You can also type in
process_sff.py

By itself and get general info about what this script does.

2. Making a mapping file. BY HAND. Yes, I’ve been given the sample name, barcode sequences, forward primer, and reverse primer in a .txt file… but not in the format required by QIIME. So, to do it by hand, I am following the instructions here:

It’s not horrible… okay yes it was. I named the file Lakes_Map1.txt. Started it in Excel with each heading as a separate column. Then copied and pasted the samples IDs and barcodes in the appropriate columns, etc. Tried to save as tab-delimited .txt file, and Excelt added double quotations to the .txt file.

I deleted all the double quotations by hand in the .txt file (Gatta be a better way to do this but all I could find online were Excel macros and I don’t use those yet) itself and saved it again.
I then checked my mapping file to see if there were any formatting errors in it using the command:

check_id_map.py -m Lakes_Map1.txt -o Lakes_output

This generated a new folder entitled Lakes_output with some files in it. The .html file tells you where the errors are and the _corrected.txt file tries to correct them for you. I deleted the …. That were inserted where the errors in my file were and resaved the file as Lakes_Map1.txt (deleted all the old ones). And redid the check_id_map.py command. This time there were no errors. Yay!

So this is as far as I got today. Not bad! I’m also using the QIIME tutorial files they give to try the new commands first and then using the Lakes data given to me by a co-worker to try these things on REAL data. I feel that’s the only way I am going to learn this process. Today’s bioinformatics “workout” took about 2.25 hours with all the errors and doing stuff by hand.

Signing off for today. Time for a real physical workout.

2 comments:

  1. Once you have a .csv with extra quotes and stuff, here's how I would handle it:
    1. Open the .csv in a text editor, such as Notepad in Windows.
    2. Use the "Find and replace" command to Find: "" and replace with nothing.
    3. Tell it to "Replace All". (ctrl+z will undo if yields unexpected results.)

    No macro required. :)

    ReplyDelete