Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture uploaded allele correctly for VCF input #1744

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
6 changes: 3 additions & 3 deletions modules/Bio/EnsEMBL/VEP/InputBuffer.pm
Original file line number Diff line number Diff line change
Expand Up @@ -257,8 +257,7 @@ sub next {
die($error_msg);
}

$self->split_variants() if $self->{minimal};

$self->split_variants() if $self->{minimal} ;
return $buffer;
}

Expand Down Expand Up @@ -664,7 +663,8 @@ sub split_variants {
$first->{original_allele_string} = $original_vf->{allele_string};
$first->{original_start} = $original_vf->{start};
$first->{original_end} = $original_vf->{end};
$first->{minimised} = 1
$first->{minimised} = 1;
$first->{first} = 1;
}

push @tmp, $new_vf;
Expand Down
6 changes: 4 additions & 2 deletions modules/Bio/EnsEMBL/VEP/OutputFactory.pm
Original file line number Diff line number Diff line change
Expand Up @@ -1296,7 +1296,8 @@ sub VariationFeatureOverlapAllele_to_output_hash {

# reference allele
$hash->{REF_ALLELE} = $vf->ref_allele_string if $self->{show_ref_allele};


# Capture uploaded allele
$hash->{UPLOADED_ALLELE} = ($vf->{original_allele_string} || $vf->{allele_string} || $vf->{class_SO_term} || "" ) if $self->param('uploaded_allele');

# picked?
Expand Down Expand Up @@ -2193,7 +2194,8 @@ sub rejoin_variants_in_InputBuffer {
foreach my $vf(@{$buffer->buffer}) {

# reset original one
if(defined($vf->{original_allele_string})) {
# check $vf->{first} to get first $vf in split_variants
if(defined($vf->{first})) {
nakib103 marked this conversation as resolved.
Show resolved Hide resolved

# do consequence stuff
$self->get_all_output_hashes_by_VariationFeature($vf);
Expand Down
33 changes: 26 additions & 7 deletions modules/Bio/EnsEMBL/VEP/Parser.pm
Original file line number Diff line number Diff line change
Expand Up @@ -851,13 +851,24 @@ sub post_process_vfs {
$vf->seq_region_start($vf->{start});
$vf->seq_region_end($vf->{end});

# Checks if the allele string is insertion or/and deletion
# Checks if the allele string is non-minimised insertion or/and deletion
my $is_sv = ref($vf) eq 'Bio::EnsEMBL::Variation::StructuralVariationFeature';
if(!$is_sv && defined($vf->{allele_string}) && $vf->{allele_string} =~ /\//){
my $is_indel = 0;
my ($ref_allele_string,$alt_allele_string) = split(/\//, $vf->{allele_string});
$is_indel = 1 unless length($ref_allele_string) == length($alt_allele_string) or $vf->{allele_string} =~ /-/;
$vf = ${$self->minimise_alleles([$vf])}[0] if $is_indel;
my $is_non_minimised_indel = 0;
# For VCF input, the allele_string is trimmed to remove anchor base, so we capture original allele

my $original_allele_string = ($vf->{nontrimmed_allele_string} || $vf->{allele_string});
my @alleles = split(/\//, $original_allele_string);
my $ref_allele_string = shift @alleles;
my $alt_allele_count;

foreach my $alt(@alleles) {
if (length($ref_allele_string) != length($alt) or $original_allele_string =~ /^-/){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@likhitha-surapaneni , it seems we are checking only if ref allele is - with $original_allele_string =~ /^-/ unlike for any allele like before. Was it intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$original_allele_string contains both ref_allele and alt_alleles in the same string separated by "/"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, but with ^ you are only checking - allele for ref and missing deletions?

$is_non_minimised_indel = 1;
last;
}
}
$vf = ${$self->minimise_alleles([$vf])}[0] if $is_non_minimised_indel;
}
}
return $vfs;
Expand Down Expand Up @@ -950,7 +961,15 @@ sub minimise_alleles {

nakib103 marked this conversation as resolved.
Show resolved Hide resolved
# skip VFs with more than one alt
# they get taken care of later by split_variants/rejoin_variants
if(!$vf->{allele_string} || $vf->{allele_string} =~ /.+\/.+\/.+/ || $vf->{allele_string} !~ /.+\/.+/) {
if(!$vf->{allele_string} || $vf->{allele_string} !~ /.+\/.+/) {
push @return, $vf;
}

elsif($vf->{allele_string} =~ /.+\/.+\/.+/)
{
# Updating a flag to minimise multi-allelic variants in split_variants/rejoin_variants
$vf->{minimised} = 1;
nakib103 marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +970 to +971
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-allelic is not getting minimised for default format. For example - 1 961320 961324 GCAGG/GCA/GCAG +

But in the output still getting MINIMISED=1, (without the PR they are also not minimised but there is no MINIMISED=1).

$vf->{original_allele_string} = $vf->{nontrimmed_allele_string} || $vf->{allele_string};
push @return, $vf;
}

Expand All @@ -976,7 +995,7 @@ sub minimise_alleles {
$new_vf->{end} = $end;
$new_vf->{seq_region_start} = $start;
$new_vf->{seq_region_end} = $end;
$new_vf->{original_allele_string} = $vf->{allele_string};
$new_vf->{original_allele_string} = $vf->{nontrimmed_allele_string} || $vf->{allele_string};
$new_vf->{original_start} = $vf->{start};
$new_vf->{original_end} = $vf->{end};
$new_vf->{minimised} = 1;
Expand Down
1 change: 1 addition & 0 deletions modules/Bio/EnsEMBL/VEP/Parser/VCF.pm
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,7 @@ sub create_VariationFeatures {
start => $start,
end => $end,
allele_string => $non_variant ? $ref : $ref.'/'.join('/', @$alts),
nontrimmed_allele_string => $non_variant ? $original_alleles[0] : join("/",@original_alleles),
strand => 1,
map_weight => 1,
adaptor => $self->get_adaptor('variation', 'VariationFeature'),
Expand Down
10 changes: 10 additions & 0 deletions t/InputBuffer.t
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ is_deeply($vfs->[0], bless( {
'variation_name' => 'rs142513484',
'map_weight' => 1,
'allele_string' => 'C/T',
'nontrimmed_allele_string' => 'C/T',
'end' => 25585733,
'start' => 25585733,
'seq_region_end' => 25585733,
Expand All @@ -90,6 +91,7 @@ is_deeply($vfs->[0], bless( {
'variation_name' => 'rs148490508',
'map_weight' => 1,
'allele_string' => 'A/G',
'nontrimmed_allele_string' => 'A/G',
'end' => 25592911,
'start' => 25592911,
'seq_region_end' => 25592911,
Expand Down Expand Up @@ -418,7 +420,9 @@ is_deeply($ib->buffer, [
'chr' => '1',
'minimised' => 1,
'original_allele_string' => 'CAGAAGAAAG/TAGAAGAAAG/C',
'nontrimmed_allele_string' => 'CAGAAGAAAG/TAGAAGAAAG/C',
'original_end' => 10,
'first' => 1,
'end' => 1,
'original_start' => 1,
'strand' => 1,
Expand All @@ -439,7 +443,10 @@ is_deeply($ib->buffer, [
'variation_name' => '.',
'alt_allele' => '-',
'map_weight' => 1,
'minimised' => 1,
'allele_string' => 'AGAAGAAAG/-',
'original_allele_string' => 'CAGAAGAAAG/TAGAAGAAAG/C',
'nontrimmed_allele_string' => 'CAGAAGAAAG/TAGAAGAAAG/C',
'start' => 2,
'seq_region_start' => 2,
'seq_region_end' => 10,
Expand Down Expand Up @@ -470,7 +477,10 @@ is_deeply(
'strand' => 1,
'variation_name' => '.',
'map_weight' => 1,
'minimised' => 1,
'allele_string' => 'CAG/TAG/T',
'original_allele_string' => 'CAG/TAG/T',
'nontrimmed_allele_string' => 'CAG/TAG/T',
'end' => 3,
'start' => 1,
'seq_region_start' => 1,
Expand Down
4 changes: 3 additions & 1 deletion t/OutputFactory.t
Original file line number Diff line number Diff line change
Expand Up @@ -1678,7 +1678,9 @@ $ib = get_annotated_buffer({

is(scalar @{$ib->buffer}, 2, 'minimal - expanded count');
is($ib->buffer->[0]->allele_string, 'C/T', 'minimal - expanded first allele string');

# print("Before rejoin\n");
# use Data::Dumper;
# print(Dumper($ib->buffer));
Comment on lines +1681 to +1683
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove the comments?

$of->rejoin_variants_in_InputBuffer($ib);

is(scalar @{$ib->buffer}, 1, 'minimal - rejoined count');
Expand Down
2 changes: 1 addition & 1 deletion t/OutputFactory_JSON.t
Original file line number Diff line number Diff line change
Expand Up @@ -532,7 +532,7 @@ SKIP: {
'cdna_start' => 2347,
'transcript_id' => 'NM_000484.3',
'gene_id' => '351',
'uploaded_allele' => '-/A',
'uploaded_allele' => 'G/GA',
'cds_start' => 2147,
'protein_start' => 716,
'refseq_match' => [
Expand Down
4 changes: 4 additions & 0 deletions t/Parser_Region.t
Original file line number Diff line number Diff line change
Expand Up @@ -127,9 +127,13 @@ is_deeply($vf, bless( {
'strand' => '1',
'variation_name' => '21:25587759-25587758:1/A',
'map_weight' => 1,
'minimised' => 1,
'allele_string' => '-/A',
'original_allele_string' => '-/A',
'end' => '25587758',
'start' => '25587759',
'original_start' => '25587759',
'original_end' => '25587758',
'seq_region_end' => '25587758',
'seq_region_start' => '25587759',
'_line' => ['21:25587759-25587758:1/A']
Expand Down
Loading