perl - sloppy matching of hash keys? -

- February 15, 2015

i'm comparing 2 lists of genes aim of finding overlapping genes between 2 lists.

at moment, store names of gene hash key for both lists (blast1 , blast2) , find keys (genes) exist in both hashes:

input 1:

xloc_000157_6.21019:12.8196,_change:1.04564,_p:0.04915,_q:0.999592      99.66   gi|475392713|dbj|ab759708.1|_xenopus_laevis_phyhd_mrna_for_phytanoyl-coa_dioxygenase_like_protein,_complete_cds xloc_000159_636.025:343.104,_change:-0.890436,_p:0.00575,_q:0.999592    99.47   gi|9909981|emb|aj278067.1|_xenopus_laevis_mrna_for_putative_xirg_protein xloc_000561_31.1018:14.9273,_change:-1.05905,_p:0.0073,_q:0.999592      91.57   gi|165973401|ref|nm_001113689.1|_xenopus_(silurana)_tropicalis_cytokine_inducible_sh2-containing_protein_(cish),_mrna

assign 1st gene list...

$input1 = $argv[0]; open $blast1, '<', $input1 or die $!;  $results1 = 0; (@blast1id, @blast1_info, @percent_id, @split); while (<$blast1>) {     chomp;     @split = split('\t');     push @blast1_info, $split[0];     push @percent_id, $split[1];     push @blast1id, $split[2];     $results1++; }     print "$results1 blast hits in '$input1'\n";  push @{$blast1{$blast1id[$_]} }, [ $blast1_info[$_], $percent_id[$_] ] 0 .. $#blast1id;

input 2:

xloc_000561_31.1018:14.9273,_change:-1.05905,_p:0.0073,_q:0.999592      91.57   gi|165973401|ref|nm_001113689.1|_xenopus_(silurana)_tropicalis_cytokine_inducible_sh2-containing_protein_(cish),_mrna xloc_000679_57.3461:29.2637,_change:-0.970585,_p:0.03645,_q:0.999592    85.13   gi|51704135|gb|bc081195.1|_xenopus_laevis_hypothetical_protein_loc446937,_mrna_(cdna_clone_image:6640116),_partial_cds xloc_000766_10.699:6.33756,_change:-0.755473,_p:0.0384,_q:0.999592      99.04   gi|195972824|ref|nm_001130940.1|_xenopus_laevis_interleukin_6_signal_transducer_(gp130,_oncostatin_m_receptor)_(il6st),_mrna

assign 2nd gene list

$input2 = $argv[1]; open $blast2, '<', $input2 or die $!;  $results2 = 0; (@blast2id, @blast2_info, @percent_id); while (<$blast2>) {     chomp;     @split = split('\t');      push @blast2_info, $split[0];     push @percent_id, $split[1];     push @blast2id, $split[2];     $results2++; }    print "$results2 blast hits in '$input2'\n";  push @{$blast2{$blast2id[$_]} }, [ $blast2_info[$_], $percent_id[$_] ] 0 .. $#blast2id;

find keys (genes) exist in both hashes:

my $intersect_count = 0; $key (sort keys %blast1) {     if (exists $blast1{$key} && $blast2{$key}) {         $intersect_count++;             $part1 (@ { $blast1{$key} } ) {                 ($hit1, $percent_id1) = @$part1;             }              $part2 (@ { $blast2{$key} } ) {                 ($hit2, $percent_id2) = @$part2;             }     push @intersect, "$key\tc1:$hit1 [$percent_id1]\tc2:$hit2 [$percent_id2]\n";                 push @intersecting_list, "$key";                     } }

the above code find 1 gene that's present in both lists:

gi|165973401|ref|nm_001113689.1|_xenopus_(silurana)_tropicalis_cytokine_inducible_sh2-containing_protein_(cish),_mrna

my question how can adapt genes have similar names included in output? example want see:

gi|186928837|ref|nm_005982.3|_homo_sapiens_six_homeobox_1_(six1),_mrna

find match with:

gi|154142326|ref|nm_001100275.1|_xenopus_(silurana)_tropicalis_six_homeobox_1_(six1),_mrna

any suggestions?

there 2 strategies can use

extract actual key want use, matched exactly.

some parts of original key may no of use – remove them. depending on input, may want unicode normalization, , perform case folding.

in case, common key for
```
gi|186928837|ref|nm_005982.3|_homo_sapiens_six_homeobox_1_(six1),_mrna gi|154142326|ref|nm_001100275.1|_xenopus_(silurana)_tropicalis_six_homeobox_1_(six1),_mrna 
```
could like
```
gi|ref|nm_00|_six_homeobox_1_(six1),_mrna 
```
do away hashes, , calculate similarity index between possible records. idea such indices, may want @ levenstein edit distance. can treat other records within bounds match. considerably more expensive, may yield better results.

i not know problem domain, can't make suggestions.

there problems code, when finding hits. looks should equivalent this:

my $intersect_count = 0; $key (sort keys %blast1) {     if (exists $blast2{$key}) {         $intersect_count++;         ($hit1, $percent_id1) = @{ $blast1{$key}[-1] };         ($hit2, $percent_id2) = @{ $blast2{$key}[-1] };         push @intersect, "$key\tc1:$hit1 [$percent_id1]\tc2:$hit2 [$percent_id2]\n";         push @intersecting_list, $key;     } }

differences:

exists $blast1{$key} && $blast2{$key} parsed exists($blast1{$key}) && $blast2{$key} , silly, because know $blast1{$key} exists: fetched via keys!
when looping on array , assigning each item variable, variable retain value of last item. my $y; $x (@xs) { $y = $x } equivalent to, less efficient than, my $y = $xs[-1].

Search This Blog

LAVA

perl - sloppy matching of hash keys? -

Comments

Post a Comment

Popular posts from this blog

c++ - Linked List error when inserting for the last time -

java - activate/deactivate sonar maven plugin by profile? -

tsql - Pivot with Temp Table (definition for column must include data type) -- SQL Server 2008 -