Extracting lines from txt file with Python -


i'm downloading mtdna records off ncbi , trying extract lines them using python. lines i'm trying extract either start or contain keywords such 'haplotype' , 'nationality' or 'locality'. i've tried following code:

import re infile = open('sequence.txt', 'r') #open in file 'infilename' read outfile = open('results.txt', 'a') #open out file 'outfilename' write  line in infile:     if re.findall("(.*)haplogroup(.*)", line):         outfile.write(line)         outfile.write(infile.readline())  infile.close() outfile.close() 

the output here contains first line containing 'haplogroup' , example not following line infile:

                 /haplogroup="t2b20" 

i've tried following:

keep_phrases = ["accession", "haplogroup"]  line in infile:     phrase in keep_phrases:         if phrase in line:             outfile.write(line)             outfile.write(infile.readline()) 

but doesn't give me of lines containing accession , haplogroup.

line.startswith works can't use lines word in middle of line.

could give me example piece of code print following line output containing 'locality':

/note="origin_locality:wales" 

any other advice how can extract lines containing words appreciated.

edit:

                 /haplogroup="l2a1l2a"                  /note="ethnicity:ashkenazic jewish;                  origin_locality:poland: warsaw; origin_coordinates:52.21 n                  21.05 e"                  /note="taa stop codon completed addition of 3'                  residues mrna"                  /note="codons recognized: ucn" 

in case, using peter's code, first 3 lines written outfile not line containing 21.05 e". how can make exception /note=" , copy of lines until second set of quotation marks, without copying /note lines containing /note="taa or /note="codons

edit2:

this current solution working me.

stuff_to_write = [] multiline = false open('sequences.txt') f:     line in f.readlines():         if any(phrase in line phrase in keep_phrases) or multiline:             do_not_write = false             if multiline , line.count('"') >= 1:                 multiline = false             if 'note' in line:                 if any(phrase in line.split('note')[1] phrase in remove_phrases):                     do_not_write = true                 elif line.count('"') < 2:                     multiline = true             if not do_not_write:                 stuff_to_write.append(line) 

this search file matching phrases , write lines new file assuming after "note" doesn't match in remove_phrases.

it read input line line check if matches words in keep_phrases, store values in list, write them new file on separate lines. unless need write new file line line matches found, should lot faster way since written @ same time.

if don't want case sensitive, change any(phrase in line any(phrase.lower() in line.lower().

keep_phrases = ["accession", "haplogroup", "locality"] remove_phrases = ['codon', 'taa']  stuff_to_write = [] open('c:/a.txt') f:     line in f.readlines():         if any(phrase in line phrase in keep_phrases):             do_not_write = false             if 'note' in line:                 if any(phrase in line.split('note')[1] phrase in remove_phrases):                     do_not_write = true             if not do_not_write:                 stuff_to_write.append(line)  open('c:/b.txt','w') f:     f.write('\r\n'.join(stuff_to_write)) 

Comments

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -