perl - sloppy matching of hash keys? -


i'm comparing 2 lists of genes aim of finding overlapping genes between 2 lists.

at moment, store names of gene hash key for both lists (blast1 , blast2) , find keys (genes) exist in both hashes:

input 1:

xloc_000157_6.21019:12.8196,_change:1.04564,_p:0.04915,_q:0.999592      99.66   gi|475392713|dbj|ab759708.1|_xenopus_laevis_phyhd_mrna_for_phytanoyl-coa_dioxygenase_like_protein,_complete_cds xloc_000159_636.025:343.104,_change:-0.890436,_p:0.00575,_q:0.999592    99.47   gi|9909981|emb|aj278067.1|_xenopus_laevis_mrna_for_putative_xirg_protein xloc_000561_31.1018:14.9273,_change:-1.05905,_p:0.0073,_q:0.999592      91.57   gi|165973401|ref|nm_001113689.1|_xenopus_(silurana)_tropicalis_cytokine_inducible_sh2-containing_protein_(cish),_mrna 

assign 1st gene list...

$input1 = $argv[0]; open $blast1, '<', $input1 or die $!;  $results1 = 0; (@blast1id, @blast1_info, @percent_id, @split); while (<$blast1>) {     chomp;     @split = split('\t');     push @blast1_info, $split[0];     push @percent_id, $split[1];     push @blast1id, $split[2];     $results1++; }     print "$results1 blast hits in '$input1'\n";  push @{$blast1{$blast1id[$_]} }, [ $blast1_info[$_], $percent_id[$_] ] 0 .. $#blast1id; 

input 2:

xloc_000561_31.1018:14.9273,_change:-1.05905,_p:0.0073,_q:0.999592      91.57   gi|165973401|ref|nm_001113689.1|_xenopus_(silurana)_tropicalis_cytokine_inducible_sh2-containing_protein_(cish),_mrna xloc_000679_57.3461:29.2637,_change:-0.970585,_p:0.03645,_q:0.999592    85.13   gi|51704135|gb|bc081195.1|_xenopus_laevis_hypothetical_protein_loc446937,_mrna_(cdna_clone_image:6640116),_partial_cds xloc_000766_10.699:6.33756,_change:-0.755473,_p:0.0384,_q:0.999592      99.04   gi|195972824|ref|nm_001130940.1|_xenopus_laevis_interleukin_6_signal_transducer_(gp130,_oncostatin_m_receptor)_(il6st),_mrna 

assign 2nd gene list

$input2 = $argv[1]; open $blast2, '<', $input2 or die $!;  $results2 = 0; (@blast2id, @blast2_info, @percent_id); while (<$blast2>) {     chomp;     @split = split('\t');      push @blast2_info, $split[0];     push @percent_id, $split[1];     push @blast2id, $split[2];     $results2++; }    print "$results2 blast hits in '$input2'\n";  push @{$blast2{$blast2id[$_]} }, [ $blast2_info[$_], $percent_id[$_] ] 0 .. $#blast2id; 

find keys (genes) exist in both hashes:

my $intersect_count = 0; $key (sort keys %blast1) {     if (exists $blast1{$key} && $blast2{$key}) {         $intersect_count++;             $part1 (@ { $blast1{$key} } ) {                 ($hit1, $percent_id1) = @$part1;             }              $part2 (@ { $blast2{$key} } ) {                 ($hit2, $percent_id2) = @$part2;             }     push @intersect, "$key\tc1:$hit1 [$percent_id1]\tc2:$hit2 [$percent_id2]\n";                 push @intersecting_list, "$key";                     } } 

the above code find 1 gene that's present in both lists:

gi|165973401|ref|nm_001113689.1|_xenopus_(silurana)_tropicalis_cytokine_inducible_sh2-containing_protein_(cish),_mrna 

my question how can adapt genes have similar names included in output? example want see:

gi|186928837|ref|nm_005982.3|_homo_sapiens_six_homeobox_1_(six1),_mrna 

find match with:

gi|154142326|ref|nm_001100275.1|_xenopus_(silurana)_tropicalis_six_homeobox_1_(six1),_mrna 

any suggestions?

there 2 strategies can use

  1. extract actual key want use, matched exactly.

    some parts of original key may no of use – remove them. depending on input, may want unicode normalization, , perform case folding.

    in case, common key for

    gi|186928837|ref|nm_005982.3|_homo_sapiens_six_homeobox_1_(six1),_mrna gi|154142326|ref|nm_001100275.1|_xenopus_(silurana)_tropicalis_six_homeobox_1_(six1),_mrna 

    could like

    gi|ref|nm_00|_six_homeobox_1_(six1),_mrna 
  2. do away hashes, , calculate similarity index between possible records. idea such indices, may want @ levenstein edit distance. can treat other records within bounds match. considerably more expensive, may yield better results.

i not know problem domain, can't make suggestions.


there problems code, when finding hits. looks should equivalent this:

my $intersect_count = 0; $key (sort keys %blast1) {     if (exists $blast2{$key}) {         $intersect_count++;         ($hit1, $percent_id1) = @{ $blast1{$key}[-1] };         ($hit2, $percent_id2) = @{ $blast2{$key}[-1] };         push @intersect, "$key\tc1:$hit1 [$percent_id1]\tc2:$hit2 [$percent_id2]\n";         push @intersecting_list, $key;     } } 

differences:

  1. exists $blast1{$key} && $blast2{$key} parsed exists($blast1{$key}) && $blast2{$key} , silly, because know $blast1{$key} exists: fetched via keys!
  2. when looping on array , assigning each item variable, variable retain value of last item. my $y; $x (@xs) { $y = $x } equivalent to, less efficient than, my $y = $xs[-1].

Comments

Popular posts from this blog

java - activate/deactivate sonar maven plugin by profile? -

python - TypeError: can only concatenate tuple (not "float") to tuple -

java - What is the difference between String. and String.this. ? -