Extracting lines from txt file with Python -
i'm downloading mtdna records off ncbi , trying extract lines them using python. lines i'm trying extract either start or contain keywords such 'haplotype' , 'nationality' or 'locality'. i've tried following code:
import re infile = open('sequence.txt', 'r') #open in file 'infilename' read outfile = open('results.txt', 'a') #open out file 'outfilename' write line in infile: if re.findall("(.*)haplogroup(.*)", line): outfile.write(line) outfile.write(infile.readline()) infile.close() outfile.close()
the output here contains first line containing 'haplogroup' , example not following line infile:
/haplogroup="t2b20"
i've tried following:
keep_phrases = ["accession", "haplogroup"] line in infile: phrase in keep_phrases: if phrase in line: outfile.write(line) outfile.write(infile.readline())
but doesn't give me of lines containing accession , haplogroup.
line.startswith
works can't use lines word in middle of line.
could give me example piece of code print following line output containing 'locality':
/note="origin_locality:wales"
any other advice how can extract lines containing words appreciated.
edit:
/haplogroup="l2a1l2a" /note="ethnicity:ashkenazic jewish; origin_locality:poland: warsaw; origin_coordinates:52.21 n 21.05 e" /note="taa stop codon completed addition of 3' residues mrna" /note="codons recognized: ucn"
in case, using peter's code, first 3 lines written outfile not line containing 21.05 e"
. how can make exception /note="
, copy of lines until second set of quotation marks, without copying /note
lines containing /note="taa
or /note="codons
edit2:
this current solution working me.
stuff_to_write = [] multiline = false open('sequences.txt') f: line in f.readlines(): if any(phrase in line phrase in keep_phrases) or multiline: do_not_write = false if multiline , line.count('"') >= 1: multiline = false if 'note' in line: if any(phrase in line.split('note')[1] phrase in remove_phrases): do_not_write = true elif line.count('"') < 2: multiline = true if not do_not_write: stuff_to_write.append(line)
this search file matching phrases , write lines new file assuming after "note"
doesn't match in remove_phrases
.
it read input line line check if matches words in keep_phrases
, store values in list, write them new file on separate lines. unless need write new file line line matches found, should lot faster way since written @ same time.
if don't want case sensitive, change any(phrase in line
any(phrase.lower() in line.lower()
.
keep_phrases = ["accession", "haplogroup", "locality"] remove_phrases = ['codon', 'taa'] stuff_to_write = [] open('c:/a.txt') f: line in f.readlines(): if any(phrase in line phrase in keep_phrases): do_not_write = false if 'note' in line: if any(phrase in line.split('note')[1] phrase in remove_phrases): do_not_write = true if not do_not_write: stuff_to_write.append(line) open('c:/b.txt','w') f: f.write('\r\n'.join(stuff_to_write))
Comments
Post a Comment