Phen-Gen

Frequently asked questions

  1. There was an error with my job submission. How do I debug?
  2. Which human genome reference version should I use to generate my VCF?
  3. I don’t know the disease status of the individual(s). Is unknown disease status allowed?
  4. How can I modify my VCF(s)?
  5. Does Phen-Gen work with multiple families in a single run?
  6. Does Phen-Gen work with unrelated samples?
  7. While uploading the input files on the web server, I am getting the error message “incorrect (pedigree) file extension”. Can I save the file as .ped.txt? Are we supposed to make the .ped extension ourselves?
  8. How do I prepare the PED and VCF files?
  9. I ran Phen-Gen for 15 families and based on the results, most of the false top-scoring genes contain common frameshift indels. How can I address this?

  1. There was an error with my job submission. How do I debug?
    • Read the Instructions page and check that the file formats and extensions conform to Phen-Gen’s requirements.
    • Ensure that filenames do not contain any spaces.
    • If you are using the standalone, view the log files. Also, to examine the intermediate files (which are deleted automatically), delete/comment out the unlink command from the following scripts:
      • Line 105 of combinedprobability.pl
      • Line 148 of phen-gen.pl
      • Line 49 of phenotype.pl

  2. Which human genome reference version should I use to generate my VCF?
  3. Our databases are based on NCBI build 37 reference genome. Kindly check that the VCF has been generated using the same version. If you are not sure, check the VCF header and chromosome naming conventions. For instance, to represent chromosomal sequences, b37 uses naming conventions “1” to “22”, “X” and “Y” whereas hg19 uses “chr1” to “chr22”, “chrX” and “chrY”.

  4. I don’t know the disease status of the individual(s). Is unknown disease status allowed?
  5. Phen-Gen does not allow for unknown disease status. It is suggested that the user start by providing the best estimate of the disease status (affected or unaffected). Since variant alleles seen in healthy controls are discarded (see ‘Incorporating pedigree information’ in Supplementary material for detail), it is highly recommended that the analysis be repeated using the alternate disease status as well.

  6. How can I modify my VCF(s)?
  7. Users can use GATK or VCFtools to quality control the variant calls, combine VCFs and/or extract samples.

  8. Does Phen-Gen work with multiple families in a single run?
  9. Currently, Phen-Gen only accepts 1 family per VCF and PED. Hence, users should split the samples and generate the VCF and PED files for each family separately.
    For example, to compare the results for 3 families, run Phen-Gen on each family and then identify commonality among the top genes (see examples in FAQ #8).

  10. Does Phen-Gen work with unrelated samples?
  11. Phen-Gen only works with related samples.
    For example, to compare the results for 3 unrelated individuals with similar disease symptoms, split the samples and generate the VCF and PED files for each individual separately (see FAQ #5 and examples in FAQ #8 as well).

  12. While uploading the input files on the web server, I am getting the error message “incorrect (pedigree) file extension”. Can I save the file as .ped.txt? Do I have to create the .ped extension?
  13. The pedigree file (and other input files) should conform to the respective file extensions and formats as specified on the Instructions page. The pedigree file should be saved as .ped and not .ped.txt.
    For example, for the file family1.ped.txt, execute the following:
    Unix command: mv family1.ped.txt family1.ped
    Windows command: rename family1.ped.txt family1.ped

  14. How do I prepare the PED and VCF files?
  15. The PED file is a tab-delimited text file with the following 6 mandatory columns.

    Column 1 Family ID
    Column 2 Individual ID
    Column 3 Paternal ID
    Column 4 Maternal ID
    Column 5 Sex (1=male, 2=female)
    Column 6 Phenotype/Disease status (1=unaffected, 2=affected)

    • For single samples where parental information is not provided, the Parental IDs in the PED file can be designated as ‘unknown’.
    • Unknown disease status and unknown gender is not permissible with Phen-Gen.
    • The sample names should be identical between the VCF and the PED file.
    • Phen-Gen relies only on the first 6 columns of the PED file. Any additional columns, for example the genotype information in PED file, will be ignored. Phen-Gen relies on the VCF file for the genotype information.

    Analyzing unrelated individuals:
    To analyze 2 (or more) unrelated individuals with similar disease symptoms, run Phen-Gen independently for each sample:

    sample1.vcf

    sample1.ped

    sample2.vcf

    sample2.ped

    Phen-Gen will not work if you combine the 2 unrelated individuals in the following manner:

    samples.vcf

    samples.ped

    Analyzing a trio:
    If the samples are from the same family, they should be merged in a single VCF file. The relation between these individuals should be defined in the PED file.

    family.vcf

    family.ped

    To analyze multiple families, run Phen-Gen independently for each family.

  16. I ran Phen-Gen for 15 families and based on the results, most of the false top-scoring genes contain common frameshift indels. How can I address this?
  17. The results of Phen-Gen are reliant on the quality of the input data. Some recommendation/heuristics which have worked well for us (and other users) are:
    1. Follow the best practice guidelines for the variant caller of choice.
    2. If an in-house curated dataset is available, it can be used to discard variants commonly seen in healthy individuals (see ‘Performance using real dataset’ in supplementary material).
    3. Variant calls for indels, in general, tend to be more error prone than SNPs. Furthermore, some sequencing platforms are more prone to these errors. In the absence of an in-house dataset from the same platform, another heuristic which may be considered is to discard known frameshift indels (with reference SNP cluster IDs, or rs IDs). To filter out indels with rs IDs using a Unix command:
    4. awk 'FS="\t" {if(!((length($4)>1) || (length($5)>1)) && ($3~/^rs/)) print;}' filename.vcf