perl - sloppy matching of hash keys? -
i'm comparing 2 lists of genes aim of finding overlapping genes between 2 lists.
at moment, store names of gene hash key for both lists (blast1 , blast2) , find keys (genes) exist in both hashes:
input 1:
xloc_000157_6.21019:12.8196,_change:1.04564,_p:0.04915,_q:0.999592 99.66 gi|475392713|dbj|ab759708.1|_xenopus_laevis_phyhd_mrna_for_phytanoyl-coa_dioxygenase_like_protein,_complete_cds xloc_000159_636.025:343.104,_change:-0.890436,_p:0.00575,_q:0.999592 99.47 gi|9909981|emb|aj278067.1|_xenopus_laevis_mrna_for_putative_xirg_protein xloc_000561_31.1018:14.9273,_change:-1.05905,_p:0.0073,_q:0.999592 91.57 gi|165973401|ref|nm_001113689.1|_xenopus_(silurana)_tropicalis_cytokine_inducible_sh2-containing_protein_(cish),_mrna
assign 1st gene list...
$input1 = $argv[0]; open $blast1, '<', $input1 or die $!; $results1 = 0; (@blast1id, @blast1_info, @percent_id, @split); while (<$blast1>) { chomp; @split = split('\t'); push @blast1_info, $split[0]; push @percent_id, $split[1]; push @blast1id, $split[2]; $results1++; } print "$results1 blast hits in '$input1'\n"; push @{$blast1{$blast1id[$_]} }, [ $blast1_info[$_], $percent_id[$_] ] 0 .. $#blast1id;
input 2:
xloc_000561_31.1018:14.9273,_change:-1.05905,_p:0.0073,_q:0.999592 91.57 gi|165973401|ref|nm_001113689.1|_xenopus_(silurana)_tropicalis_cytokine_inducible_sh2-containing_protein_(cish),_mrna xloc_000679_57.3461:29.2637,_change:-0.970585,_p:0.03645,_q:0.999592 85.13 gi|51704135|gb|bc081195.1|_xenopus_laevis_hypothetical_protein_loc446937,_mrna_(cdna_clone_image:6640116),_partial_cds xloc_000766_10.699:6.33756,_change:-0.755473,_p:0.0384,_q:0.999592 99.04 gi|195972824|ref|nm_001130940.1|_xenopus_laevis_interleukin_6_signal_transducer_(gp130,_oncostatin_m_receptor)_(il6st),_mrna
assign 2nd gene list
$input2 = $argv[1]; open $blast2, '<', $input2 or die $!; $results2 = 0; (@blast2id, @blast2_info, @percent_id); while (<$blast2>) { chomp; @split = split('\t'); push @blast2_info, $split[0]; push @percent_id, $split[1]; push @blast2id, $split[2]; $results2++; } print "$results2 blast hits in '$input2'\n"; push @{$blast2{$blast2id[$_]} }, [ $blast2_info[$_], $percent_id[$_] ] 0 .. $#blast2id;
find keys (genes) exist in both hashes:
my $intersect_count = 0; $key (sort keys %blast1) { if (exists $blast1{$key} && $blast2{$key}) { $intersect_count++; $part1 (@ { $blast1{$key} } ) { ($hit1, $percent_id1) = @$part1; } $part2 (@ { $blast2{$key} } ) { ($hit2, $percent_id2) = @$part2; } push @intersect, "$key\tc1:$hit1 [$percent_id1]\tc2:$hit2 [$percent_id2]\n"; push @intersecting_list, "$key"; } }
the above code find 1 gene that's present in both lists:
gi|165973401|ref|nm_001113689.1|_xenopus_(silurana)_tropicalis_cytokine_inducible_sh2-containing_protein_(cish),_mrna
my question how can adapt genes have similar names included in output? example want see:
gi|186928837|ref|nm_005982.3|_homo_sapiens_six_homeobox_1_(six1),_mrna
find match with:
gi|154142326|ref|nm_001100275.1|_xenopus_(silurana)_tropicalis_six_homeobox_1_(six1),_mrna
any suggestions?
there 2 strategies can use
extract actual key want use, matched exactly.
some parts of original key may no of use – remove them. depending on input, may want unicode normalization, , perform case folding.
in case, common key for
gi|186928837|ref|nm_005982.3|_homo_sapiens_six_homeobox_1_(six1),_mrna gi|154142326|ref|nm_001100275.1|_xenopus_(silurana)_tropicalis_six_homeobox_1_(six1),_mrna
could like
gi|ref|nm_00|_six_homeobox_1_(six1),_mrna
do away hashes, , calculate similarity index between possible records. idea such indices, may want @ levenstein edit distance. can treat other records within bounds match. considerably more expensive, may yield better results.
i not know problem domain, can't make suggestions.
there problems code, when finding hits. looks should equivalent this:
my $intersect_count = 0; $key (sort keys %blast1) { if (exists $blast2{$key}) { $intersect_count++; ($hit1, $percent_id1) = @{ $blast1{$key}[-1] }; ($hit2, $percent_id2) = @{ $blast2{$key}[-1] }; push @intersect, "$key\tc1:$hit1 [$percent_id1]\tc2:$hit2 [$percent_id2]\n"; push @intersecting_list, $key; } }
differences:
exists $blast1{$key} && $blast2{$key}
parsedexists($blast1{$key}) && $blast2{$key}
, silly, because know$blast1{$key}
exists: fetched viakeys
!- when looping on array , assigning each item variable, variable retain value of last item.
my $y; $x (@xs) { $y = $x }
equivalent to, less efficient than,my $y = $xs[-1]
.
Comments
Post a Comment