MSQT User Manual

Contents

[Up]

MSQT is a "Multiple SNP Query Tool".

This query tool is a front-end to our PostgreSQL SNP databases. The user can directly query these databases using "MSQT/E". However, for the main tasks we prepared scripts that do the job. Each of them is accessible through a specialized webpage. Since different studies used different subsets of ecotypes and different genomic loci, meaningful comparisons can only be made within one dataset. Hence, the user has to choose a database table (schema) to use.
Please note that there is two different modes of action depending on the type of data. There is organisms where a published genome sequence exists. This sequence serves as a reference organized in linkage groups and positions are absolute. (e.g. database "nordborg_96" from Arabidopsis thaliana). Here SNPs are queried by "chromosome" and a "position range". Without a reference sequence the data basically consists of a collection of fragments sequenced from different ecotypes/strains. Here the SNPs are queried by "FragID" and positions are relative (e.g. arabis_alpina). "Positions Name" has the form "FragID-position".

[Up]

MSQT/SBE ("SNPs Between Ecotypes")

Input

Here its mandatory that the user specifies 2 groups of ecotypes in 2 select boxes, "1st ecotype set" and "2nd ecotype set", the "chromosome" and a position range or "FragID","Position Name". A mouse-click on "compute SNPs" will start the program. Optionally the user can specify a third ecotype group by selecting ecotypes in "3rd ecotype set"; these ecotypes will not have influence on the selection of SNPs but sequence changes in any of the ecotypes in the third group will later also be shown in the ADF output.
Example: You want SNPs between Col-0 and Ler-1. For them you want to develop assays which you later also want to use for genotyping of Ws-2, but you do not care whether Ws-2 has the Col-0 or Ler-1 allele; so you select Col-0 in group 1, Ler-1 in group 2 (or vice versa) and select Ws-2 in group 3.

Selecting ecotypes

Each group can contain one or several (up to all) ecotypes. The selection is done by mouse-click, several ecotypes are chosen by holding down the shift or ctrl/apple buttons. Obviously, only if the ecotype groups chosen are exclusive will the program return meaningful results.

Choosing parameters

Note that - for data where a published genome sequence exists - any computation runs on ONE chromosome at a time. To query the entire genome, the user will have to compute the SNPs for each chromosome (1-5) separately. In addition, the user can restrict the program to a chromosomal region by setting "Start pos" and "End pos". This will reduce computation time and will make the output more concise. Setting "Start pos" and "End pos" to 0 (default) will query the entire chromosome.

Meaningful "Start pos" and "End pos" for Arabidopsis thaliana (e.g. "nordborg_96"):

Chromosome 1: 0..30432563
Chromosome 2: 0..19705359 Chromosome 3: 0..23470805 Chromosome 4: 0..18585042 Chromosome 5: 0..26992728

Output

Besides returning the user's input-parameters on which the results are based, the program will return all DISTINCT changes between the two ecotype groups. "Distinct changes" are SNPs and indels that can be used to distinguish between the groups. That means: ALL of group one have allele1 and ALL of group two have allele2.
The output has the form of a table with several columns:

Chr Type Pos Allele1 Allele2 FragID

  • Chr - the chromosome
  • Type - a type of change; can be a "SNP" or an "indel"
  • Pos - the position of the change. This is a calculated position; it is FragID plus position within that alignment.
  • Allele1 - allele 1st ecotype set
  • Allele2 - allele 2nd ecotype set
  • FragID - the name of the alignment file, which is a position. It is the absolute position of the first nucleotide of the alignment in the Arabidopsis genome (TIGR version5).

  • An allele at a SNP position can be any of the IUPAC bases A, C, G, T, R, Y, K, M, S, W, B, D, H, V, N and "-". However, the decision whether alleles are distinct is based only on A, C, G, T and "-". But at a position of a distinct SNP all other alleles found (R, Y, K, M, S, W, B, D, H, V, N) are returned as well. (concept: "in dubio pro reo", since the reason is most likely that one of the ecotypes is heterozygous at that position or the sequencing failed. It is up to the user to use one of those or not).

    Example

    The alignment (FragID=12345) looks like this:

    >Van-0
    ATTTGGTGATRATGATTTCGCTCGCTAGCTGATG >Bur-0 ATTTGGTGATGATGATTTCACTCGCTAGCTGATG >Cvi-0 ATTTGGTGATAATGATTTCGCTCGCTTGCTG--G >Bay-0 NNNNNNNNNNNNNNNNNTCACTCGCTTGCTG--G

    A query of Van-0 AND Bur-0 against Cvi-0 AND Bay-0 will produce this output:

    Chr Type Pos Allele1 Allele2 FragID
    1 SNP 12355 G,R A,N 12345 1 SNP 12371 A T 12345 1 indel 12376 A T - - 12345

    A query of Van-0 AND Cvi-0 against Bur-0 AND Bay-0 will produce this output:

    Chr Type Pos Allele1 Allele2 FragID
    1 SNP 12355 A,R G,N 12345 1 SNP 12364 G A 12345

    Additional Features

  • View alignment: A mouse-click on "alignment" will open a pop-up window and show the original alignment from which the data had been parsed. For this feature we use "alivie", developed by Joffrey Fitz. Remark: Since our SNPs have been parsed out of alignments the results can only be as good as the original alignment. To make this transparent, the user is invited to take a look at the alignments him/herself.
  • Assay development format: A mouse-click on "assay_development_format" will send a request to our database via the MSQT/ADF program which will return the corresponding ADF output: a line of sequence for this particular SNP (details see below). Note that this is only valid for the ecotypes previously queried. We recommend extracting the lines of interest by copy and paste into a text file.
  • [Up]

    MSQT/ADF ("Assay Development Formatter")

    The idea of this program is to assist the user in developing the SNP detection assays. A SNP detection assay usually involves the placing of primers in conserved regions either for PCR amplification or actual SNP detection. The output is supposed to assist the user in choosing a specific distinct SNP and to indicate where to place the primers by having all other changes between ecotypes in group 1, 2 and 3 in this fragment also annotated.

    Input

    On the MSQT/ADF page the user is required to enter two ecotype sets (a third set is optional) a chromosome, and a SNP position. Alternatively (and that is probably the most frequent use) you can use the link from the output of the SBE page and the program will automatically run ADF with the ecotypes used in SBE and the corresponding SNP.

    Output

    The output of this program is a one line representation of the conserved bases as well as the differences between the ecotype groups the user has chosen. It will resemble a consensus between the ecotypes. This line of sequence is supposed to enable researchers to place primers and probes for SNP detection assays. It will look different (with respect to where the changes are) depending on the ecotype set(s) chosen. In each line ONE distinct SNP is annotated in square brackets, [allele1/allele2], where allele1 is the allele of the first and allele2 the allele of the second ecotype group. All other changes between the groups are annotated in curly brackets and separated by comma. Changes in ecotypes selected in group 3 can of course only be annotated in curly brackets. One special case is, if there is only ONE letter in curly brackets. This indicates that both groups have the same base at this position (the base shown), but it differs from the reference sequence ('target').

    [Up]

    MSQT/E ("Expert" mode)

    This is the expert mode; here the user is invited to query our database with his/her own SQL statements (SELECT). As of today we are not restricting your selects, so be prepared that your browser may time-out or even crash if your query takes too long or produces too lengthy outputs.

    Database Structure

    Each dataset is organized in tables (ecotypes, snps, target_sequences, etc.) which are wrapped in a schema in a PostgreSQL database.
    For Magnus Nordborg's dataset the name of the schema is "nordborg_96", and therefore the tables are accessible as:

  • nordborg_96.ecotypes
  • nordborg_96.snps
  • nordborg_96.target_sequences
  • nordborg_96.sniped
  • nordborg_96.snps

    This table contains one entry for each ecotype at every position where there are 2 or more alleles other than "N". Each entry has chromosome, position, ecotype, replacement, frag_id, allele_freq.

    Datatypes of the ".snps"-table:

    chromosome int4
    position int4 ecotype varchar(8) replacement char(1) frag_id int4 allele_freq float4

    msqt=> SELECT * FROM nordborg_96.snps WHERE chromosome=1 ORDER BY
    position LIMIT 10; chromosome | position | ecotype | replacement | frag_id | allele_freq -----------+----------+---------+-------------+---------+------------ 1 | 29291 | target | G | 29215 | 0.6 1 | 29291 | Ag-0 | G | 29215 | 0.6 1 | 29291 | An-1 | N | 29215 | 0.23 1 | 29291 | Bay-0 | G | 29215 | 0.6 1 | 29291 | Bil-5 | N | 29215 | 0.23 1 | 29291 | Bil-7 | A | 29215 | 0.18 1 | 29291 | Bor-1 | N | 29215 | 0.23 1 | 29291 | Bor-4 | N | 29215 | 0.23 1 | 29291 | Br-0 | G | 29215 | 0.6 1 | 29291 | Bur-0 | A | 29215 | 0.18 (10 rows)

    nordborg_96.target_sequences

    As said, the SNPs are parsed from multiple alignments of several ecotypes, and the table "target_sequences" has one entry for each position from all alignments showing the target base (ref_base). This is the base of the reference genotype plus indels. That means: In case of the Nordborg dataset, the published Arabidopsis (Col-0) sequence served as "target". However,sequence that is not present in Col-0 but in other ecotypes (== deletions in the reference) need of course be included and get a position assigned. At those positions "target" gets a dash ("-") introduced. One consequence of this is that sometimes a position assignment is not exactly the same as in the published Col-0 sequence (downstream of a deletion in the reference in particular fragments). So in case you ever want to look up a SNP in the published genome-sequence you are advised to use neighboring sequence rather than a position as a search term.

    Datatypes of the ".target_sequences"-table:

    chromosome int4
    position int4 ref_base char(1) frag_id int4

    msqt=> SELECT * FROM nordborg_96.target_sequences where chromosome=4 ORDER
    BY position limit 10; chromosome | position | ref_base | frag_id -----------+----------+----------+-------- 4 | 48286 | A | 48286 4 | 48287 | A | 48286 4 | 48288 | T | 48286 4 | 48289 | T | 48286 4 | 48290 | C | 48286 4 | 48291 | C | 48286 4 | 48292 | A | 48286 4 | 48293 | G | 48286 4 | 48294 | A | 48286 4 | 48295 | G | 48286 (10 rows)

    nordborg_96.ecotypes

    This table simply contains all ecotype names in that dataset in one column.

    Datatypes of the ".ecotypes"-table:

    ecotype varchar(8)
    msqt=> SELECT * FROM nordborg_96.ecotypes LIMIT 10; ecotype --------- Ag-0 An-1 Bay-0 Bil-5 Bil-7 Bor-1 Bor-4 Br-0 Bur-0 C24 (10 rows)

    [Up]

    MSQT/SNIPED! (an "Expert" mode wrapper for a precompiled SNP table "SNIPED!")

    For studies involving many different combinations of accessions or to map in previously ucharacterized accessions it may be more appropriate to select SNP-markers by particular allele frequencies (or within a range of allele frequencies) and to take all polymorphism information into account when designing the Assay development format (ADF). This is what the "sniped" table is for. The table "*.snps" (see above) already contains the allele frequency for each SNP. To produce the "*.sniped"-table the "*.snp"-table is parsed and for each chromosome,position the two unambiguous alleles -[A, T, G or C]- with allele frequencies closest to 0.5 and the ecotypes for these two alleles are extracted. Subsequently the ADF output (see above) for each position is produced, taking the particular ecotype groups (for each position!) as an input. This results in a big table with an entry for every SNP in the dataset. We reduce the number of snps by requiring that each SNP has to be spaced by at least 3 bases from any neighboring SNP. This leaves all indels and very closely spaced SNPs behind. The user can now restrict the list further by setting lower bounds for the the sum of the allele frequencies and upper bounds for the difference between.

    Datatypes of the "sniped"-table:

    chrom int2
    fragid int4 position int4 summary text allele1 varchar(1) allele_freq1 float4 allele2 varchar(1) allele_freq2 float4 freq_sum float4 freq_diff float4 neighborhood_left_len int4 neighborhood_right_len int4 ecot1 text ecot2 text adf text

    nordborg_96.sniped

    msqt=> SELECT chrom, position, summary, allele1, allele2, freq_sum, freq_diff
    FROM nordborg_96.sniped LIMIT 10; chrom | position | summary | allele1 | allele2 | freq_sum | freq_diff ------+----------+-----------------------+---------+---------+----------+----------- 1 | 29757 | |<--39-[G/A]-6-->| | G | A | 0.84 | 0.36 1 | 29291 | |<--82-[G/A]-53-->| | G | A | 0.76 | 0.42 1 | 29451 | |<--105-[G/A]-67-->| | G | A | 0.78 | 0.68 1 | 29345 | |<--53-[G/A]-105-->| | G | A | 0.75 | 0.71 1 | 29713 | |<--18-[C/T]-3-->| | C | T | 0.85 | 0.77 1 | 29519 | |<--67-[A/G]-174-->| | A | G | 0.8 | 0.78 1 | 112907 | |<--133-[G/A]-116-->| | G | A | 0.97 | 0.01 1 | 197787 | |<--162-[A/G]-126-->| | A | G | 0.8 | 0.04 1 | 197964 | |<--7-[C/T]-66-->| | C | T | 0.78 | 0.06 1 | 198031 | |<--66-[T/A]-17-->| | T | A | 0.24 | 0.18 (10 rows)

    you can get all entries by typing:

    msqt=> SELECT * FROM nordborg_96.sniped;

    The user can query the sniped table from the MSQT/E interface by simply using standard SQL statements. One can think of all sorts of sophisticated queries; for example:

    SELECT * FROM nordborg_96.sniped WHERE
    ( (freq_diff < 0.5) AND (freq_sum > 0.8) ) AND ( (neighborhood_left_len > 20) OR (neighborhood_right_len > 20) ) AND ( (ecot1 LIKE '%Lz-0%') OR (ecot2 LIKE '%Lz-0%') ) ORDER BY chrom, fragid ;

    This will query the sniped table for all SNPs where frequency measures are in the given range and each SNP has at least 20 bases without a neighboring SNP on ONE side (which might be needed for a SNP-detection oligo). In addition it requires that the ecotype "Lz-0" has one of the 2 alleles documented. In case you require free space on BOTH sides of the SNP you simply replace the "OR" by "AND"; e.g.

    ( (neighborhood_left_len > 35) AND (neighborhood_right_len > 35) )

    If you want to restrict your query further you just add more arguments. For example you only want SNPs from chromosome 1 you add

    AND chrom = '1'

    Obviously all conditions have to be placed before the "ORDER BY" statement. Another sophisticated example: The following query should return all SNPs that distinguish Est-1, Kin-0, Nd-1, Mr-0 and Van-0 from Columbia and each SNP should at least have 24 neighboring basepairs to one side without a SNP.

    SELECT * FROM nordborg_96.sniped WHERE
    ( (neighborhood_left_len > 24) or (neighborhood_right_len > 24) ) AND ( ((ecot1 LIKE '%Est-1%') AND (ecot1 LIKE '%Kin-0%') AND (ecot1 LIKE '%Nd-1%') AND (ecot1 LIKE '%Mr-0%') AND (ecot1 LIKE '%Van-0%') AND ecot2 LIKE '%Col-0%') OR ((ecot2 LIKE '%Est-1%') AND (ecot2 LIKE '%Kin-0%') AND (ecot2 LIKE '%Nd-1%') AND (ecot2 LIKE '%Mr-0%') AND (ecot2 LIKE '%Van-0%') AND ecot1 LIKE '%Col-0%') ) ORDER by chrom, fragid

    If you now were to also want the SNPs for each group of four of the five accessions against Columbia you simply run it 5 more times, each run commenting-out the 2 corresponding AND clauses. Comments in SQL are made by two dashes. ("--") You can concatenate the results from all those queries into a textfile simply by copy and paste after each run. In case you are now wondering how to get rid of the duplicates, try in a shell:

    sort textfile | uniq > new_textfile

    [Up]

    FAQ

    Question: Why is the order of the alleles in the ADF output for some SNPs mixed up, e.g. [-/T] vs. {T,-} at the same position?
    Answer: This is a feature rather than a bug. Note the differences: Distinct SNPs are shown in square brackets and are separated by a slash. Here the order is important: its [allele1/allele2]. All other changes in that fragment are annotated in curly brackets and separated by comma, simply showing all observed bases at that position. These of course also include changes WITHIN an ecotype group, therefore we cannot assign an allele order.


    Question: Why do some SNP positions not correspond to the absolute positions of that base in the Arabidopsis genome (TIGR version 5/6)?
    Answer: That's a consequence of deletions in Columbia within the alignment upstream of the position you are looking at. Our data are based on alignments and Columbia is treated as any other ecotype. If Columbia has a deletion the positions in the alignment gets shifted by the indel size.

    Question: Why is the "scrolling" in the alignment viewer "alivie" so slow?
    Answer: Well, this -so called- rendering is done on your end. You are most likely hampered by a slow internet connection and/or a slow computer.

    Question: Why are indels in ADF output not concatenated?
    Answer: This is on purpose; each position within the indel is treated as a SNP on its own.

    Question: Why do I sometimes not get any result when I click on "Compute SNPs", although I did select 2 ecotype sets?
    Answer: You most likely wanted to use the SBE-program but you did use the ADF-program; make sure it says MSQT/SBE in the title and try again. If you actually intended to use the ADF program directly then you probably forgot to supply a "position". This needs to be the position of a known SNP.

    Question: Why do I sometimes get an error when I click on assay_development_format?
    Answer: You most likely clicked on the assay_development_format link associated with an "idel" and not a "SNP". Jup, this is an issue we know about and we are trying to fix this.

    Question: Is ADF, the Assay Development Formatter, an alignment reconstruction tool?
    Answer: No.

    Question: Is the MSQT program available for download?
    Answer: As of May 27th, 2006: Yes and No. We have prepared all necessary installation scripts needed to quickly install MSQT on Linux, BSD and Mac Os X. We are testing them right now. Prerequisites are Perl, BioPerl 1.5, Apache and Postgresql. If you know what these are then do not hesitate to contact us and we will let you checkout your copy from our svn. But be prepared that you will be an installation script tester ;-).

    [Up]

    About

    For Credits and Copyright notices please see about.html

    [Up]