ParseCNV Copy Number
Variant Call Association Software
Linux/Mac/Windows with
Cygwin (command line)
Windows Executable
(graphical user interface)
Decompress command: gunzip ParseCNV20.tgz;
tar -xvf ParseCNV20.tar
Run Command: perl ParseCNV.pl Cases.rawcnv
Controls.rawcnv Cases.fam ChrSnp0Pos.map
On Windows, unzip, and launch by
double-clicking ParseCNV.exe (with camel icon).
Linux
bash environment is best, but on Windows Cygwin will also work.
Having
trouble? Bugs? Email Joseph Glessner:
Or post
on Google Group: https://groups.google.com/forum/#!forum/parsecnv
Multiple
CNV Detection Comparison and De Novo CNV Calls Annotation |
Multiple
Testing Correction |
ParseCNV takes CNV calls as input and creates probe
based statistics for CNV occurrence in (cases and controls, families, or
population with quantitative trait) then calls CNVRs based on neighboring SNPs
of similar significance. CNV calls may be from aCGH, SNP array, Exome
Sequencing, or Whole Genome Sequencing.
A brief introduction: DNA base call GWAS deals with 3
possible states at each SNP [AA,AB,BB] where A and B=[A,T,C,G] and AB.
CNV association deals with 5 possible states at each
SNP [0,1,2,3,4] where 2 is typically high frequency as the normal diploid one
maternal one paternal copy expectation. To leverage 3 state association, we
divide out deletion [0,1,2] and duplication [2,3,4] association.
Pre and Post association quality control metrics are important
to limit bias in overly stringent Pre association quality control.
CNVs are also called Copy Number Abberations,
Changes, Differences or Polymorphisms (CNA, CNC, CND or CNP) depending on the
discipline, context and the level of specificity. CNVs are the popularly
assessed subset of Structural Variations, which include insertions and
inversions.
See simple example input files: Cases.rawcnv Controls.rawcnv
Cases.fam ChrSnp0Pos.map IDToPath.txt and try to run them as a test of the
software on your system. If you have trouble with your input files, create
similarly simple versions to quickly run the software and isolate the problem.
Run Command: perl
ParseCNV.pl Cases.rawcnv Controls.rawcnv Cases.fam ChrSnp0Pos.map (optional
--out <prefix>, --build <hg18>, --idToPath <file.txt>,
--includePed, --probesBafLrr <file.txt>, --batch <file.txt>,
--mergePVar <1>, --mergeDist <1000000>, --tdt, --maxPInclusion
<0.05>, --includeAllRedFlags)
Dependencies: bash, R,
and plink. Try which bash, which R, and which plink and if any returns command
not found instead of path, try: "module avail" and "module load
R" otherwise download and install these dependencies then do "export
PATH=/YourDirectoryPath/R-3.1.2/bin:$PATH".
Read and cite the paper:
New Major Features
(full version log at bottom):
NEW Feature! 2015-05-14 --includePed now writes
directly to a Plink .bed file to save disk space
NEW Feature! 2014-11-03 ParseCNV_GUI_Windows:
Executable Minimized Memory for big data but a bit slower
NEW Feature! 2014-02-27 MakeRedFlagPlots.pl: derive
continuous confidence score based on redFlagReasons column in ParseCNV
--includeAllRedFlags
NEW Feature! 2014-02-22 PQN_Concord_Inh.pl: multiple
CNV detection algorithm comparison and de
novo, transmitted, and untransmitted CNV call based annotation
NEW Feature! 2013-12-13 Graphical Interface for
ParseCNV.pl using wxPerl and Tk ParseCNV_GUI.pl
NEW Feature! 2013-11-25 --build bosTau7 (Cow 1-29 X)
--build susScr3 (Pig 1-18 XY) now supported in addition to hg18 hg19
NEW Feature! 2013-11-21 Brief column .stats main
folder output, ParseCNV_QC.pl QC Thresholds may be specified, basic non-human
species support
NEW Feature! 2012-11-27 Sample quality control
metrics and sample remove report now automated in ParseCNV_QC.pl
NEW Feature! 2012-11-25 Gene based association
approach now available in ParseCNV_GeneBased.pl
NEW Feature! 2012-11-20 BAF/LRR plots of two samples
with associated CNV and one sample with CN=2 with flanking diploid data for
each CNVR with --idToPath
NEW Feature! 2012-08-25 Image created with BAF and
LRR of specific association regions across contributing cases as part of
--idToPath option.
NEW Feature! 2012-04-09 Plink p values can now be
imported into ParseCNV using InsertPlinkPvalues.pl for CNVR boundary calling.
NEW Feature! 2012-02-06 UCSC review features are now
automatically scored.
CNV Calls
CNV calls (.rawcnv) can
be generated by any algorithm but should be in the PennCNV format:
CNVCallChromosome:Start-StopBoundaries NumProbes
Size(BasePairs) CNState SampleID StartSNP EndSNP
Format conversion scripts for popular algorithms (QuantiSNP, XHMM, and GenomeStrip) are included. Typically this reformatting is relatively easy in Excel. Calls for a given sample should be sorted together in subsequent rows. CNState values: state1,cn=0 state2,cn=1 state5,cn=3 state6,cn=4. These states are also referred to as homozygous deletion, hemizygous deletion, hemizygous duplication, and homozygous duplication respectively in literature. Be careful there is no blank line at end of file. PennCNV states [1,2,5,6] correspond to CN [0,1,3,4].
For example:
Òchr1:156783977-156788016 numsnp=6
length=4,040
state2,cn=1 sample1.txt startsnp=rs16840314 endsnp=rs10489835Ó
Alternatively: Òchr1:156783977-156788016 numsnp=6
length=4,040
state2,cn=1 sample1.txt startsnp=rs16840314 endsnp=rs10489835 conf=20.659Ó
There are separate files for the study subjects
(cases and related parents or siblings if available) and control subjects.
If there are no
control subjects as in a family based or quantitative trait association study,
use the file NullControl_ForQuantitativeTrait.rawcnv for the Controls.rawcnv
input.
If you want to filter a .rawcnv with both cases and
controls into separate input files you can use: GeneRef/RemoveSpecificBarcodes-Hash_Example_DefFile_ColRemKeep.pl
(FilterCNV.pl)
state3,cn=2 on chrX is male duplication so should be
recoded state5,cn=3
state4,cn=2 means copy neutral LOH i.e. ROH
BED (4 columns: chr start stop id) TO RAWCNV: awk
'{print "chr"$1":"$2"-"$3"\tnumsnp=1\tlength="$3-$2"\tstate2,cn=1\t"$4"\tstartsnp="$1"_"$2"\tendsnp="$1"_"$3}'
human_g1k_v37.mask.36.bed > human_g1k_v37.mask.36.rawcnv
RAWCNV TO Plink CNV: sed 's/:/\t/' Cases_wConf.rawcnv
| sed 's/-/\t/' | sed 's/chr//' | sed 's/state.\,cn\=//'| sed 's/conf=//' | sed
's/numsnp=//' | awk '{print "0 "$7" "$1"
"$2" "$3" "$6" "$10" "$4}'
PennCNV: compile_pfb.pl creates a .pfb and cal_gc_snp.pl
creates a .gcmodel for your specific array. Use the same version of array for
cases and controls.
Please make sure you use --log (or
2>>MySamples.log) and --conf command line arguments for quality
assessment. Then:
grep summary MySamples.log > MySamples_QC.txt
awk Ô{print $5}Õ MySamples.rawcnv | uniq --count > MySamples_CountCNVs.txt
If you want ChrX and LOH(copy neutral LOH/ROH) calls,
you have to run separately --chrx and --loh --hmm hh550.hmm
Merging fragmented calls with clean_cnv.pl is useful,
especially if you are interested in large CNVs: perl clean_cnv.pl combineseg file.rawcnv
-signalfile file.pfb -out file_clean.rawcnv
PFB gives universal definition for snp to chr pos and
snp set used
detect_cnv.pl --trio after detect_cnv.pl --test is
strongly recommended if you have trios to limit the de novo rate.
Review LRR BAF of the parents for false negative CNVs
by copying child CNVs and adding parent IDs and state3,cn=2 to run
visualize_cnv.pl the run: convert Mom.JPG Dad.JPG +append Parents.JPG; convert
Parents.JPG Child.JPG -append Trio.JPG
.signalfile (.baflrr) .pfb
.hmm .gcmodel |
PennCNV ---------------> |
.rawcnv .log .fam .map |
ParseCNV ---------------> |
.cnvr .stats |
PennCNV-ExomeSeq: PennCNV Copy Number Variant Call
Detection in Exome Sequencing Software http://penncnvexomeseq.sourceforge.net
QuantiSNP: You have to download from here. The various other websites do not work. Make sure your files are: ÒSNPName Chromosome
Position LRR BAFÓ the header is not read and if you switch LRR and BAF no error
is given.
Plink: Run --missing to get call rates
(or from GenomeStudio/GenotypingConsole), --genome to check relatedness (high
PI_HAT), --mendel to verify families, --check-sex to verify gender, .mds (--genome
then --read-genome --cluster --mds-plot) or Eigenstrat smartpca to create
.pca.evec for ethnicity/stratification.
GenomeSTRiP: Always use the reference metadata bundles and
download using wget NOT drag and drop (which gives a truncated file):
wget ftp://ftp.broadinstitute.org/pub/svtoolkit/reference_metadata_bundles/1000G_phase1_12May2015.tar.gz
wget ftp://ftp.broadinstitute.org/pub/svtoolkit/reference_metadata_bundles/1000G_phase3_12May2015.tar.gz
.fastq .fasta |
BWA-MEM --------------> -M |
.bam Ref Align CN2 Ploidy Gender |
GenomeSTRiP ------------------> (SVPreprocess, SVDiscovery, SVGenotyper,
CNVDiscoveryPipeline NOT SVAltAlign) |
test1.genotypes.vcf cnv_stage12/gs_cnv.genotypes.vcf.gz |
ConvertVCFToPennCNV_GenomeStrip.pl ---------------------------------------------------------->
|
.rawcnv |
ParseCNV -------------> |
.cnvr |
If you only have
PED/BED genotype file, you need to use the Illumina Genome Studio Project with the Idats from
your samples loaded and export BafLrr files. Use column chooser to show B
allele Freq and Log R Ratio along with SNP Name, Chr, and Position. Use the
PennCNV packaged script kcolumn.pl to convert the all samples text file into
single sample text files. See detailed instructions here: http://penncnv.openbioinformatics.org/en/latest/user-guide/input/
If you are using Affymetrix, see here:
http://penncnv.openbioinformatics.org/en/latest/user-guide/affy/ Once you have
generated a PennCNV .rawcnv file (using the --conf option), you can convert into
a Plink CNV file using a simple bash command: sed 's/:/\t/' Cases_wConf.rawcnv
| sed 's/-/\t/' | sed 's/chr//' | sed 's/state.\,cn\=//'| sed 's/conf=//' | sed
's/numsnp=//' | awk '{print "0 "$7" "$1"
"$2" "$3" "$6" "$10" "$4}' or use
ParseCNV which uses PennCNV .rawcnv as input directly. See Plink CNV format
specification here: http://pngu.mgh.harvard.edu/~purcell/plink/cnv.shtml
Otherwise, it is very hard to make any meaningful
inference of CNVs from a SNP genotype PED file. We can infer CNV by intensity (LRR)
alone, but not by genotype alone.
The best you could do is detect runs of homozygosity
using plink:
plink --ped file.ped --map file.map --homozyg
http://pngu.mgh.harvard.edu/~purcell/plink/ibdibs.shtml#homo
This would flag regions, which could be deletions
(with intensity drop) or simply runs of homozygosity (which do not have the
intensity drop). You can convert the plink ROH output into a plink CNV input
file using Excel or awk. You can just put Type as 1 to presume they are mostly
single copy deletions.
You could similarly look for "runs of
no-calls" which would flag regions of homozygous deletion or duplication,
again non-differentiatable without the intensity data.
This is the ambiguity you would have to accept given
the limited data available.
Some papers say you can impute CNVs from SNP data,
but I am skeptical and at best would only detect mostly common CNVs, which are
typically of less interest.
FAM
The family or FAM file has the family relationships
in the study subjects to track inheritance of CNVs. Tab Delimited:
FamID
IndividualID(InCNVCallFile) FatherID MotherID Gender(1=male,2=female) Affected(1=unaffected,2=affected,-9=exclude).
IndividualID and Affected columns are required others
can be zeroed if not applicable.
For example: Ò1 1.txt 0 0 1 2Ó
If you have Cases.rawcnv and they are just unrelated
cases, without any families (unrelated), or quantitative traits you can just
run: awk '{print $5}' Cases.rawcnv | uniq | sed 's/^/0\t/' | sed
's/$/\t0\t0\t0\t2/' > Cases.fam
MAP
The SNP definition file is four columns:
Chromosome ProbeID
PositionCentiMorgan PositionBasePair
similar to Plink map file (PositionCentiMorgan is
fine just zeroed out). Sorted by position and chromosome Sorted.map created
automatically which should be used with the resulting ped. Map could be an
intersection or union of chip types if study is across platforms. Be sure that
all SNPs used for CNV calling are provided.
For example: Ò1 rs1739822 0 16196820Ó
If you are using
sequencing data or are uncertain about the full set of probes, a blank or
incomplete file can be used as the map input and the map will be dynamically
determined by the start and end positions (breakpoints) of the CNVs observed in
the .rawcnv files.
The full .map chr snp pos information should be available
on the Illumina/Affymetrix website, baflrr "signal intensity files"
PennCNV inputs, in Illumina GenomeStudio / Affymetrix Genotyping Console, or PennCNV
.pfb file. You can use the linux awk command to rearrange the columns if
needed.
Chr X should be X, not 23 as in some plink processed
files.
For example if you have the PennCNV .pfb run: awk
'{print $2"\t"$1"\t0\t"$3}' ChipX.pfb > ChipX.map
If you have a .vcf run: head -1
DATA.PCA_normalized.filtered.sample_zscores.RD.txt | sed 's/\s/\n/g' | grep -v
Matrix | sed 's/:/\t/' | awk '{print
$1"\t"$1"_"$2"\t0\t"$2}' > file.map
OR awk '{print $1" "$2}' file.vcf | sed
's/^chr//' | awk '{print $1"\t"$1"_"$2"\t0\t"$2}'
| grep -v ^# > file.map
ParseCNV will automatically define probes missing in
the map according to the .rawcnv definitions but this should only be utilized
for a small number of probes to properly represent the actual data resolution.
CHR_POS TO MAP: awk '{print
$1"\t"$1"\t0\t"$1}' b | sed 's/_/\t/g' | awk '{print
$1"\t"$3"_"$4"\t0\t"$7}' | sort -n -k 4,4 | sort -k
1,1 -s > b_Sorted.map
Download Test Data 785 cases, 1110 controls,
561,308 probes
Out of Memory:
Try running per chr by
Ògrep chr1: file.rawcnvÓ if runs out of memory indicated by percentage updating
getting cutoff.
grep chr1: Cases.rawcnv > Cases.rawcnv_chr1
grep chr1: Controls.rawcnv > Controls.rawcnv_chr1
grep -P "^1\s" ChrSnp0Pos.map > ChrSnp0Pos.map_chr1 OR sed 's/^/chr/'
ChrSnp0Pos.map | awk '{print
$1":\t"$2"\t"$3"\t"$4}' | grep chr1: | sed
's/chr//' | sed 's/://' > ChrSnp0Pos.map_chr1
Cases.fam no edit needed
cat SSC_ParentsControlOnly_PCGC_chr1_Recluster_CNVR_ALL_ReviewedCNVRs.txt
outputs when done
--splitByChr --cluster --sub bsub --clusterHeader
MGH_Header.txt
--splitByRange --cluster --sub bsub --clusterHeader
MGH_Header.txt
ParseCNV_BigData_LowMemory_Slower.pl
--out output
file prefix.
--idToPath <file.txt> Additional
SampleID to BafLrrFilePath file for Specific BAF LRR access of all associated
samples for associated CNVRs. The input file should contain 2 columns: IDs as
in column 5 of .rawcnv and full path to baf lrr file (signal intensity). In many
cases, the 2 columns may be the same.
--probesBafLrr <file.txt> If Probe ID not first
column of BAF LRR file, provide probe order reference file.
--includePed write .ped file for usage in PLINK
--build <hg18> Specify
the genome build used in CNV calling and .map file. Uses corresponding build
files in GeneRef folder. See README.txt end for build specific download
commands (example below).
--batch <file.txt> A
subset of samples which form a discernible batch to monitor the contributions
of such as a different chip type/version, processing lab, sample type/source,
or reagent batch.
--mergePVar <1> Change
the default of merging probes with p-values no more than one power of ten
variation.
--mergeDist <1000000> Change the default of merging
probes with proximal distance no more than 1000000 base pairs (1MB). The
default is good for 500k but for 2.5M since 5x more markers, use 1/5 of 1M MB =
200,000.
--tdt Specify
a family based study with trios to allow TDT deletion and duplication p-value
to drive CNVR boundary determination rather than case:control.
--maxPInclusion <0.05> The maximum p value of probe based statistics to
include in CNVR determination and thus reported associations.
--includeAllRedFlags Include
all red flag values in RedFlagReasons regardless of extreme values
--keepTemp Keep
temp folder files (mainly for Verbose.stats currently dependency of
InsertPlinkPValues.pl)
--splitByChr Split
run by chromosome for large projects that run out of memory
--splitByRange Split
run by equal sized ranges for large projects that run out of memory
--cluster option
requires a parallel computing cluster you can qsub to (speeds up split by Chr
run)
--lowRange <0> low
index for range specific run if only interested in specific locus or for large
memory intensive projects
--highRange <nProbes> high index for range specific
run if only interested in specific locus or for large memory intensive projects
--oneSidedP calculate
one sided P assuming only enriched in case CNVs are of interest (default two
sided)
--permuteP <1000> calculate
permuted P providing the number of permutations (takes longer time
computationally)
Explanation |
Category |
|
ParseCNV_GeneBased.pl |
Gene based association approach counting any overlap of
gene or gene exons together (collapsing association strategy) |
Collapsing Approach Association Alternative to
ParseCNV.pl |
Polishgc5BaseForScanRegion_KeepOnlyChrLines_5120Increments.pl |
Make GC file format for scan_region.pl for non-human
species in 5120 increments to be similar to the human definition |
definition utility |
PolishRefGeneForScanRegion_KeepOnlyChrLines.pl |
Make RefGene file format for scan_region.pl for
non-human species keeping only "Chr Number" entries |
definition utility |
AvgGCFromScanRegion.pl |
Averages GC |
dependency of AnnotateCNV |
CountAvgScanRegion.pl |
Counts DGV and SegDups and Averages GC as Part of
AnnotateCNV.txt Commands |
dependency of AnnotateCNV |
1_A.baflrr |
Example BafLrr SignalFile |
example file |
2_A.baflrr |
Example BafLrr SignalFile |
example file |
3_A.baflrr |
Example BafLrr SignalFile |
example file |
4_A.baflrr |
Example BafLrr SignalFile |
example file |
5_A.baflrr |
Example BafLrr SignalFile |
example file |
Batch.txt |
Example Batch input file |
example file |
Cases.fam |
Example Fam File |
example file |
Cases.rawcnv |
Example Cases rawcnv file |
example file |
ChrSnp0Pos.map |
Example Map File |
example file |
Controls.rawcnv |
Example Controls rawcnv file |
example file |
IDToPath.txt |
Example rawcnv column 5 ID matched to BafLrr SignalFile
path (in many cases will be the same) |
example file |
NullControl_ForQuantitativeTrait.rawcnv |
Empty Controls rawcnv file if running quantitative trait
where all subjects are in Cases.rawcnv |
example file |
probesBafLrr.txt |
Example probesBafLrr input file in case BafLrr
SignalFiles do not have SNP ID column but are just in a standard order |
example file |
WeightTEST.pheno |
Example Quantitative Trait Phenotype File |
example file |
GeneRef |
Genome definition files from UCSC Tables |
folder definition files |
ArrayConcordance_InheritanceDenovo_AnnotateCNVCalls |
PQN_Concord_Inh.pl: multiple CNV detection algorithm
comparison and de novo, transmitted, and untransmitted CNV call based
annotation |
folder denovo multiple algorithms |
PerlModules |
Perl Modules and scripts used within the main folder
scripts |
folder perl modules |
ParseCNV_GUI_Tk.pl |
Graphical user interface for ParseCNV.pl using perl Tk |
GUI ParseCNV.pl |
ParseCNV_GUI_Wx.pl |
Graphical user interface for ParseCNV.pl using perl Wx |
GUI ParseCNV.pl |
ConvertQuantiSNPToPennCNV.pl |
Convert QuantiSNP format CNV Calls to PennCNV format |
input conversion |
ConvertVCFToPennCNV_GenomeStrip.pl |
Convert GenomeStrip VCF format CNV Calls to PennCNV
format |
input conversion |
ConvertVCFToPennCNV_XHMM.pl |
Convert XHMM VCF format CNV Calls to PennCNV format |
input conversion |
ParseCNV_QC.pl |
Sample Quality Control metrics plotting and sample
remove report based on automatic or manual thresholds |
input utility |
TestConcordance.pl |
compares pairs of CNV calls |
input utility |
FilterCNV.pl |
Handy utility to keep or remove a certain list of
samples from a specified column, such as QC filtering the .rawcnv file |
input utility |
AnnotateCNV.txt |
Annotates many of the genomic feature red flags onto
.rawcnv calls |
output utility |
CountGenes.pl |
Counts the number of values in a comma separated list
such as genes |
output utility |
FindSnpsInRange.pl |
Query a genomic region of interest in the ParseCNV
.stats file |
output utility |
GetSnpsInSnpsRange.pl |
Query Multiple genomic regions of interest in the
ParseCNV .stats file |
output utility |
InsertPlinkPvalues.pl |
Use Plink or other software p-values to define ParseCNV
CNVRs and RedFlag Annotations |
output utility |
MakeRedFlagPlots.pl |
Derive continuous confidence score based on
redFlagReasons column in ParseCNV --includeAllRedFlags |
output utility |
PercentSamples.pl |
Showing contributing
counts of batches in IDsCasesDel/IDsCasesDup column. |
output utility |
wgetUcscPdfs.sh |
Pulls UCSC Track high-quality images for review of
significant CNVRs |
output utility |
ParseCNV.pl |
THE MAIN SCRIPT. |
ParseCNV.pl |
extremevalues_2.2.tar.gz |
Used by ParseCNV_QC R module to find extreme values
(needs work, perhaps substitute median absolute deviation) |
perl module (could move into folder) |
ParseCNV_Denovo.pl |
Reformat files for de novo calling (probably not needed) |
probably not needed |
PennCNV_MultSpaceToTab.pl |
Used by ParseCNV_Denovo (probably not needed) |
probably not needed |
README.txt |
Standard Readme file, most pertinent updated information
will be on the website |
README |
Download hg19/hg38 definition files
by pasting these commands (Included in standard download now):
cd GeneRef
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/kgXref.txt.gz
wget
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/bioCycPathway.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/keggPathway.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/keggMapDesc.txt.gz
wget
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cgapBiocPathway.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cgapAlias.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cgapBiocDesc.txt.gz
gunzip knownGene.txt.gz; mv knownGene.txt hg19_knownGene.txt
gunzip kgXref.txt.gz; mv kgXref.txt hg19_kgXref.txt
gunzip bioCycPathway.txt.gz; mv bioCycPathway.txt hg19_bioCycPathway.txt
gunzip keggPathway.txt.gz; mv keggPathway.txt hg19_keggPathway.txt
gunzip keggMapDesc.txt.gz; mv keggMapDesc.txt hg19_keggMapDesc.txt
gunzip cgapBiocPathway.txt.gz; mv cgapBiocPathway.txt hg19_cgapBiocPathway.txt
gunzip cgapAlias.txt.gz; mv cgapAlias.txt hg19_cgapAlias.txt
gunzip cgapBiocDesc.txt.gz; mv cgapBiocDesc.txt hg19_cgapBiocDesc.txt
perl MakeExonScanRegionDefFile.pl hg19_knownGene.txt
perl MakeExonScanRegionDefFile_wGene.pl hg19_knownGene.txt
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz
gunzip cytoBand.txt.gz; mv cytoBand.txt hg19_cytoBand.txt
awk -F"\t" '{print
$1$4"\t"$1"\t+\t"$2"\t"$3"\t \t \t \t \t \t
\t "}' hg19_cytoBand.txt | sed 's/chr//' >
hg19_cytoBand_SimFormat_AllCol.sorted
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/genomicSuperDups.txt.gz
gunzip genomicSuperDups.txt.gz
sort -k 2,2 -k 3,3n genomicSuperDups.txt > hg19_genomicSuperDups.sorted
awk '{print $27"\t"$2"\t"$7"\t"$3"\t"$4"\t
\t \t \t \t \t \t "}' hg19_genomicSuperDups.sorted >
hg19_genomicSuperDups_SimFormat_AllCol.sorted
rm genomicSuperDups.txt
rm hg19_genomicSuperDups.sorted
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/dgvMerged.txt.gz
gunzip dgvMerged.txt.gz; mv dgvMerged.txt hg19_dgv.txt
awk -F"\t" '{print $12"\t"$2"\t"$7"\t"$3"\t"$4"\t
\t \t \t \t \t \t "}' hg19_dgv.txt> hg19_dgv_SimFormat_AllCol.sorted
rm hg19_dgv.txt
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gc5Base.txt.gz ## wget http://hgdownload.cse.ucsc.edu/gbdb/hg38/bbi/gc5BaseBw/gc5Base.bw
gunzip gc5Base.txt.gz
awk -F"\t" '{print
$13/$12"\t"$2"\t+\t"$3"\t"$4"\t \t \t \t \t
\t \t "}' gc5Base.txt> hg19_gc5Base_SimFormat_AllCol.sorted
rm gc5Base.txt
hg19_somaticRearrangement_SimFormat_AllCol.sorted now
provided in download if error arises at that step.
Was: dgv.txt.gz, now:dgvMerged.txt.gz and dgvSupporting.txt.gz
Quantitative Trait Association
¤
run
ParseCNV with --includePed and --out <myPrefix> (Cases.rawcnv can have
all samples and Controls.rawcnv can just be a blank file with no blank line at
end NullControl_ForQuantitativeTrait.rawcnv)
¤
run
plink --ped CNVDEL.ped --map myMap_UseWithPed.map --make-bed --out CNVDEL_Bed
and again with CNVDUP.ped (this will compress and allow .fam to be easily
edited, just make sure you keep the sorted order of .fam the same)
¤
run
plink --bfile CNVDEL_Bed --pheno file.pheno --assoc --adjust --all-pheno --out
CNVDEL_Bed_Qassoc --allow-no-sex and again with CNVDUP_Bed (see here and here)
¤
perl
InsertPlinkPvalues.pl --del <Plink deletion association result file
.qassoc> --dup <Plink duplication association result file .qassoc>
--outPrefix <ParseCNV out Prefix from first run with --includePed>
¤
Note
many SNPs do not have any CNV frequency with ÒNAÓ p-values or low/rare CNV
frequency.
¤
Be sure your plink input
and output files are all in the ParseCNV folder. The .pheno file should
be: FID IID QuantitativeTrait
¤
InsertPlinkPvalues.pl
will then generate the CNVR reports for quantitative trait association instead
of ParseCNV.pl which generates CNVR reports for Case-Control and TDT.
¤
Example:
¤
Cases.rawcnv
in this case contains all samples in your study since all subjects have a
continuous trait like weight and are not easily separated into simple case and
control cohorts.
¤
perl
ParseCNV.pl Cases.rawcnv NullControl_ForQuantitativeTrait.rawcnv Cases.fam
ChrSnp0Pos.map --includePed --out TEST
¤
plink
--ped TESTCNVDEL.ped --map TESTChrSnp0PosSorted_UseWithPed.map --pheno
WeightTEST.pheno --out DelWeightTEST --noweb --assoc
¤
plink
--ped TESTCNVDUP.ped --map TESTChrSnp0PosSorted_UseWithPed.map --pheno
WeightTEST.pheno --out DupWeightTEST --noweb --assoc
¤
perl
InsertPlinkPvalues.pl --del DelWeightTEST.qassoc --dup DupWeightTEST.qassoc
--outPrefix TEST --out TESTInsert
¤
To
correct for population variation, include Ò--covar myMDSResult.mdsÓ or Ò--covar
myPCAResult.pca.evecÓ to the plink command as detailed below. (Note
–assoc and --tdt do not work with --covar so use --linear or --logistic
instead.)
¤
If you want to use other P-value
generating tools such as GEMMA update the header to be ÒSNPÓ, ÒBETAÓ or ÒTÓ or ÒORÓ,
and ÒPÓ, just note InsertPlinkPvalues dependencies
¤
REQUIRED DEPENDENCY:
¤
TESTCNV_Verbose.stats
¤
CAN MAKE FILES:
¤
vim TESTParseCNV.log containing
ÒRunning Command: perl ParseCNV.pl Cases.rawcnv Controls.rawcnv Cases.fam
ChrSnp0Pos.map --includePed --out TESTÓ
¤
vim DelWeightTEST.log
¤
vim DupWeightTEST.log
Population Stratification Correction
¤
Use
genotype .ped in Plink to create .mds result (--genome then --read-genome
--cluster --mds-plot 3 or 10) or in Eigenstrat smartpca to create .pca.evec
result.
¤
run
ParseCNV with –includePed (which now creates Individual-major
bed/bim/fam)
¤
run
plink --bed plinkDel.bed --bim myMap_UseWithPed.bim --fam ALLFAM.fam
--make-bed --out CNVDEL and again with --bim plinkDup.bed --out CNVDUP changed
(which creates SNP-major bed/bim/fam)
¤
run
plink --bfile CNVDEL --covar myMDSResult.mds --logistic --adjust --allow-no-sex
and again with CNVDUP
¤
egrep
ÔSNP|ADDÕ MDS.assoc.logistic > MDS.assocADDlogistic (important to name this
way without extra Ò.Ó so Plink .log found, If you are using Affy ÒSNP_Ó IDs
choose another word unique to the header)
¤
InsertPlinkPvalues.pl
--del <Plink deletion association result file> --dup <Plink
duplication association result file> --outPrefix <ParseCNV out Prefix from
first run with --includePed>
¤ I recommend checking for CaseControl
PCA overlap -> Majority Cluster (typically Caucasians) ensuring overlap
-> Logistic -> then separately do African and Asian analysis since
ethnicity false positives may occur for rare CNVs and direction of effect may
differ
¤ I support one quantitative trait
specified in the .pheno. I do not currently support QuantitativeTrait in the
.fam affected column or multiple quantitative traits in the .pheno.
¤ You can also use LMM,
which is agrued to be the best stratification correction method for rare
variants such as CNVs
¤ Gemma has this command
to generate the relatedness/kinship matrix: ./gemma -bfile
/work/1a/SCZ/chr1CNVDEL -gk 1 -o chr1CNVDELGemmaKinship
¤ Then you run Gemma again
to do the actual p-value calculations: ./gemma
-bfile /work/1a/SCZ/chr1CNVDEL -k ./output/chr1CNVDELGemmaKinship.cXX.txt -lmm
4 -maf 0.000000005 -o chr1CNVDELGemma
Collapsing approach used to combine non-overlapping
CNVs which all overlap the same gene.
Gene based association approach now available in
ParseCNV_GeneBased.pl
perl
ParseCNV_GeneBased.pl Cases.rawcnv Controls.rawcnv
sort by OR>1 for Case enriched significant Genes.
The gene based approach will give many more potential
hits because it is more flexible and sensitive. Not as many features are there
but it is useable. It should work in a similar way for METAL with all necessary
input columns.
The gene-based association collapses CNVs based on
overlapping any sub-segment of the gene rather than the same sub-segment as in
the case of a CNVR defined by ParseCNV. Thus one case may overlap exon 1 of
gene x while another case may overlap exon 2 of gene x and yield significance
in a gene-based association but not in a traditional CNVR SNP-based statistics
strict overlap association.
http://www.nature.com/nature/journal/v459/n7246/extref/nature07953-s1.pdf
Supplementary Figure 5 and p37 lower panel
Burden
See Burden.txt
output from ParseCNV.pl
We need to filter the CNV calls by PennCNV provided
[numsnp, length, and conf], run PennCNV clean_cnv.pl --fraction 0.8 to merge
fragmented calls, genomic features [seg dups, DGV, Telo/Centromere/GC/somatic
rearrangements], array features[sparse coverage/low AB frequency in dups],
cohort features[population frequency]. Finally do some spot confirmation by
BAF/LRR review and PCR validation.
As always, check what chip versions make up the data
and if the full probe set was used.
The total length burden doesn't take into account the
number of CNVs or the number of genes overlapped by those CNVs.
I have tested bosTau7 Cow and susScr3 Pig
specifically to work. If you have a different species or build you can try
replacing ÒbosTau7Ó in the commands below. Mouse,
chimp, and fly should work. You may need to run the UCSC LiftOver web tool if
definition files are not available in your build version.
Segmental Duplications annotations available through
UWash Eichler website: http://eichlerlab.gs.washington.edu/database.html has: Mouse, chimp, rat, dog, fly, worm, monkey, fugu, chicken, cow, and stickleback.
Database of Genomic Variants (DGV) only has healthy
human control CNVs.
dbVar is similar for more species ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/ has: Bos_taurus cow, Canis_lupus gray wolf, Danio_rerio Zebrafish, Drosophila_melanogaster fly, Equus_caballus Horse,
Homo_sapiens Human, Macaca_mulatta Rhesus macaque, Mus_musculus Mouse, Pan_troglodytes chimpanzee, Sorghum_bicolor grass grain, and Sus_scrofa Pig.
Download bosTau7 Cow definition
files by pasting these commands:
cd GeneRef
wget http://hgdownload.cse.ucsc.edu/goldenPath/bosTau7/database/refGene.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/bosTau7/database/refLink.txt.gz
###wget http://hgdownload.cse.ucsc.edu/goldenPath/bosTau7/database/gc5BaseBw.txt.gz
wget http://hgdownload.cse.ucsc.edu/gbdb/bosTau7/bbi/gc5Base.bw
###Use
hg19 Pathways files by querying TOUPPER of all genes
wget http://hgdownload.cse.ucsc.edu/goldenPath/bosTau7/database/cytoBandIdeo.txt.gz
wget http://cowparalogy.gs.washington.edu/data/cow4GenomicSuperDupgenomicSuperDup.tab ###NoHeader SeeHeader in
http://paralogy.gs.washington.edu/build37/data/GRCh37GenomicSuperDup.tab
wget
ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Bos_taurus/by_assembly/Btau_4.6.1/gvf/Btau_4.6.1.remap.all.germline.ucsc.gvf.gz
gunzip refGene.txt.gz; mv refGene.txt
bosTau7_refGene.txt
gunzip refLink.txt.gz; mv refLink.txt
bosTau7_refLink.txt
###gunzip gc5BaseBw.txt.gz; mv gc5BaseBw.txt
bosTau7_gc5BaseBw.txt
perl PolishRefGeneForScanRegion_KeepOnlyChrLines.pl
bosTau7_refGene.txt ### to delete non ÒchrÓ lines
perl MakeExonScanRegionDefFile_refGene.pl
bosTau7_refGene.txt_OnlyChr.txt
perl MakeExonScanRegionDefFile_wGene_refGene.pl
bosTau7_refGene.txt_OnlyChr.txt
#cp PerlModules/scan_region.pl
.
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph
chmod 777 bigWigToBedGraph
###
These steps will take considerable compute time
wget http://hgdownload.cse.ucsc.edu/gbdb/bosTau7/bbi/gc5Base.bw
mv gc5Base.bw
bosTau7_gc5Base.bw
./bigWigToBedGraph bosTau7_gc5Base.bw
bosTau7_gc5Base.bedGraph
perl Polishgc5BaseForScanRegion_KeepOnlyChrLines_5120Increments.pl
bosTau7_gc5Base.bedGraph
###
Illegal division by zero at
Polishgc5BaseForScanRegion_KeepOnlyChrLines_5120Increments.pl line 30,
<SAMPSHEET> line 358024431.# so no Y
awk -F"\t"
'{print $4"\t"$1"\t+\t"$2"\t"$3"\t \t \t \t
\t \t \t "}' bosTau7_gc5Base.bedGraph_OnlyChr_5120Intervals.txt >
bosTau7_gc5Base_SimFormat_AllCol.sorted ### Now unique values, rand() was added
before to $4 GC to allow duplicates in scan region since original 5 base pair
interval values are only 20,40,60,80
###perl Polishgc5BaseForScanRegion_KeepOnlyChrLines.pl
bosTau7_gc5Base_SimFormat_AllCol.sorted_randDec
rm bosTau7_gc5Base.bw; rm bosTau7_gc5Base.bedGraph; rm bosTau7_gc5Base.bedGraph_OnlyChr_5120Intervals.txt
gunzip cytoBandIdeo.txt.gz; mv cytoBandIdeo.txt bosTau7_cytoBandIdeo.txt
awk -F"\t" '{print
$1$4"\t"$1"\t+\t"$2"\t"$3"\t \t \t \t \t \t
\t "}' bosTau7_cytoBandIdeo.txt | sed 's/chr//' > bosTau7_cytoBand_SimFormat_AllCol.sorted
mv cow4GenomicSuperDupgenomicSuperDup.tab bosTau7_cow4GenomicSuperDupgenomicSuperDup.tab
sort -k 1,1 -k 2,2n
bosTau7_cow4GenomicSuperDupgenomicSuperDup.tab >
bosTau7_cow4GenomicSuperDupgenomicSuperDup.tab.sorted #Sort By Chr Pos
awk '{print
$7Ó:Ó$8"\t"$1"\t"$6"\t"$2"\t"$3"\t
\t \t \t \t \t \t "}'
bosTau7_cow4GenomicSuperDupgenomicSuperDup.tab.sorted >
bosTau7_genomicSuperDups_SimFormat_AllCol.sorted
gunzip Btau_4.6.1.remap.all.germline.ucsc.gvf.gz; mv Btau_4.6.1.remap.all.germline.ucsc.gvf bosTau7_Btau_4.6.1.remap.all.germline.ucsc.gvf
grep -v ^#
bosTau7_Btau_4.6.1.remap.all.germline.ucsc.gvf | sed 's/ID=//' | sed
's/;Name/\t/' | sed 's/copy_number_/CN/Õ | sed 's/variation/var/Õ | awk '{print
$9Ó:Ó$3"\t"$1"\t+\t"$4"\t"$5"\t \t \t \t \t
\t \t "}' > bosTau7_dgv_SimFormat_AllCol.sorted
Download susScr3 Pig definition
files by pasting these commands:
cd GeneRef
wget http://hgdownload.cse.ucsc.edu/goldenPath/susScr3/database/refGene.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/susScr3/database/refLink.txt.gz
###wget http://hgdownload.cse.ucsc.edu/goldenPath/susScr3/database/gc5BaseBw.txt.gz###
just has path below
wget http://hgdownload.cse.ucsc.edu/gbdb/susScr3/bbi/gc5Base.bw
###Use hg19 Pathways files by
querying TOUPPER
of all genes
wget http://hgdownload.cse.ucsc.edu/goldenPath/susScr3/database/cytoBandIdeo.txt.gz
wget
http://www.nature.com/nature/journal/v491/n7424/extref/nature11622-s3.xls ### xls2csv and xls2txt could work but just include in downloads <susScr3_PigSegDups_GroenenMAetalNature2012SupTable16.txt>
wget
ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Sus_scrofa/by_assembly/Sscrofa3/gvf/Sscrofa3.submitted.all.germline.ucsc.gvf.gz
### < ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Sus_scrofa/by_assembly/Sscrofa10.2/gvf/Sscrofa10.2.remap.all.germline.ucsc.gvf.gz
>
gunzip refGene.txt.gz; mv refGene.txt
susScr3_refGene.txt
gunzip refLink.txt.gz; mv refLink.txt
susScr3_refLink.txt
###gunzip gc5BaseBw.txt.gz; mv gc5BaseBw.txt
susScr3_gc5BaseBw.txt
perl PolishRefGeneForScanRegion_KeepOnlyChrLines.pl
susScr3_refGene.txt ### to delete chromosome ÒGL892101-1Ó...ÓGL896661-1Ó and
ÒJH118404-1Ó...ÓJH118996-1Ó (or add ÒchrÓ but input not accepted)
perl MakeExonScanRegionDefFile_refGene.pl
susScr3_refGene.txt_OnlyChr.txt
perl MakeExonScanRegionDefFile_wGene_refGene.pl
susScr3_refGene.txt_OnlyChr.txt
#cp PerlModules/scan_region.pl
.
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph
chmod 777 bigWigToBedGraph
###
These steps will take considerable compute time
wget http://hgdownload.cse.ucsc.edu/gbdb/susScr3/bbi/gc5Base.bw
mv gc5Base.bw
susScr3_gc5Base.bw
./bigWigToBedGraph susScr3_gc5Base.bw
susScr3_gc5Base.bedGraph
perl Polishgc5BaseForScanRegion_KeepOnlyChrLines_5120Increments.pl
susScr3_gc5Base.bedGraph
awk -F"\t"
'{print $4"\t"$1"\t+\t"$2"\t"$3"\t \t \t \t
\t \t \t "}' susScr3_gc5Base.bedGraph_OnlyChr_5120Intervals.txt >
susScr3_gc5Base_SimFormat_AllCol.sorted ### Now unique values, rand() was added
before to $4 GC to allow duplicates in scan region since original 5 base pair
interval values are only 20,40,60,80
###perl Polishgc5BaseForScanRegion_KeepOnlyChrLines.pl
susScr3_gc5Base_SimFormat_AllCol.sorted_randDec
rm susScr3_gc5Base.bw; rm susScr3_gc5Base.bedGraph; rm susScr3_gc5Base.bedGraph_OnlyChr_5120Intervals.txt
gunzip cytoBandIdeo.txt.gz; mv cytoBandIdeo.txt susScr3_cytoBandIdeo.txt
awk -F"\t" '{print
$1$4"\t"$1"\t+\t"$2"\t"$3"\t \t \t \t \t \t
\t "}' susScr3_cytoBandIdeo.txt | sed 's/chr//' |
grep chr > susScr3_cytoBand_SimFormat_AllCol.sorted
grep -v ^# susScr3_PigSegDups_GroenenMAetalNature2012SupTable16.txt
| sort -k 1,1 -k 2,2n > susScr3_PigSegDups_GroenenMAetalNature2012SupTable16.txt.sorted #Sort By Chr Pos
awk '{print
$1Ó:Ó$2"\tchr"$1"\t+\t"$2"\t"$3"\t \t \t \t
\t \t \t "}' susScr3_PigSegDups_GroenenMAetalNature2012SupTable16.txt.sorted >
susScr3_genomicSuperDups_SimFormat_AllCol.sorted
gunzip Sscrofa3.submitted.all.germline.ucsc.gvf.gz; mv Sscrofa3.submitted.all.germline.ucsc.gvf
susScr3_Sscrofa3.submitted.all.germline.ucsc.gvf
grep -v ^# susScr3_Sscrofa3.submitted.all.germline.ucsc.gvf
| sed 's/ID=//' | sed 's/;Name/\t/' | sed 's/copy_number_/CN/Õ | sed
's/variation/var/Õ | sort -k 1,1 -k 4,4n | awk '{print
$9Ó:Ó$3"\t"$1"\t+\t"$4"\t"$5"\t \t \t \t \t
\t \t "}' > susScr3_dgv_SimFormat_AllCol.sorted
Download dm6 Fly definition files by
pasting these commands:
cd GeneRef
wget
http://hgdownload.cse.ucsc.edu/goldenPath/dm6/database/refGene.txt.gz
wget http://hgdownload.cse.ucsc.edu/goldenPath/dm6/database/refLink.txt.gz
###wget
http://hgdownload.cse.ucsc.edu/goldenPath/dm6/database/gc5BaseBw.txt.gz
wget
http://hgdownload.cse.ucsc.edu/gbdb/dm6/bbi/gc5BaseBw/gc5Base.bw
###Use hg19 Pathways files by querying
TOUPPER of all genes
wget
http://hgdownload.cse.ucsc.edu/goldenPath/dm6/database/cytoBandIdeo.txt.gz
wget
http://humanparalogy.gs.washington.edu/dm3/data/dm3genomicSuperDup.tab ###Needs
Liftover as Instructed Below###NoHeader SeeHeader in http://paralogy.gs.washington.edu/build37/data/GRCh37GenomicSuperDup.tab
wget
ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Drosophila_melanogaster/by_assembly/Release_6_plus_ISO1_MT/gvf/Release_6_plus_ISO1_MT.remap.all.germline.ucsc.gvf.gz
gunzip refGene.txt.gz; mv refGene.txt
dm6_refGene.txt
gunzip refLink.txt.gz; mv refLink.txt
dm6_refLink.txt
###gunzip gc5BaseBw.txt.gz; mv gc5BaseBw.txt
dm6_gc5BaseBw.txt
cp
../PolishRefGeneForScanRegion_KeepOnlyChrLines.pl .
perl
PolishRefGeneForScanRegion_KeepOnlyChrLines.pl dm6_refGene.txt ### to delete
non ÒchrÓ lines
perl MakeExonScanRegionDefFile_refGene.pl
dm6_refGene.txt_OnlyChr.txt
perl
MakeExonScanRegionDefFile_wGene_refGene.pl dm6_refGene.txt_OnlyChr.txt
#cp PerlModules/scan_region.pl .
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph
chmod 777 bigWigToBedGraph
### These steps will take considerable
compute time
mv gc5Base.bw dm6_gc5Base.bw
./bigWigToBedGraph dm6_gc5Base.bw
dm6_gc5Base.bedGraph
cp
../Polishgc5BaseForScanRegion_KeepOnlyChrLines_5120Increments.pl .
perl
Polishgc5BaseForScanRegion_KeepOnlyChrLines_5120Increments.pl
dm6_gc5Base.bedGraph
### Illegal division by zero at
Polishgc5BaseForScanRegion_KeepOnlyChrLines_5120Increments.pl line 30,
<SAMPSHEET> line 358024431.# so no Y
awk -F"\t" '{print
$4"\t"$1"\t+\t"$2"\t"$3"\t \t \t \t \t \t \t
"}' dm6_gc5Base.bedGraph_OnlyChr_5120Intervals.txt >
dm6_gc5Base_SimFormat_AllCol.sorted ### Now unique values, rand() was added
before to $4 GC to allow duplicates in scan region since original 5 base pair
interval values are only 20,40,60,80
###perl
Polishgc5BaseForScanRegion_KeepOnlyChrLines.pl
dm6_gc5Base_SimFormat_AllCol.sorted_randDec
rm dm6_gc5Base.bw; rm dm6_gc5Base.bedGraph;
rm dm6_gc5Base.bedGraph_OnlyChr_5120Intervals.txt
gunzip
cytoBandIdeo.txt.gz; mv cytoBandIdeo.txt dm6_cytoBandIdeo.txt
awk -F"\t" '{print
$1$4"\t"$1"\t+\t"$2"\t"$3"\t \t \t \t \t \t
\t "}' dm6_cytoBandIdeo.txt | sed 's/chr//' >
dm6_cytoBand_SimFormat_AllCol.sorted
awk '{print
$1"\t"$2"\t"$3"\t"$6"#"$7"#"$8}'
dm3genomicSuperDup.tab > dm3genomicSuperDup.tab_ForLiftover.txt
### Go to
https://genome.ucsc.edu/cgi-bin/hgLiftOver Original Assembly: dm3 New Assembly:
dm6 Click Choose File Click Upload File Click View Conversions
mv hglft_genome_*.bed dm6_genomicSuperDup.tab
sort -k 1,1 -k 2,2n dm6_genomicSuperDup.tab |
sed 's/#/\t/g' > dm6_genomicSuperDup.tab.sorted #Sort By Chr Pos
awk '{print
$5":"$6"\t"$1"\t"$4"\t"$2"\t"$3"\t
\t \t \t \t \t \t "}' dm6_genomicSuperDup.tab.sorted >
dm6_genomicSuperDups_SimFormat_AllCol.sorted
gunzip Release_6_plus_ISO1_MT.remap.all.germline.ucsc.gvf.gz;
mv Release_6_plus_ISO1_MT.remap.all.germline.ucsc.gvf
dm6_Release_6_plus_ISO1_MT.remap.all.germline.ucsc.gvf
grep -v ^#
dm6_Release_6_plus_ISO1_MT.remap.all.germline.ucsc.gvf | sed 's/ID=//' | sed 's/;Name/\t/'
| sed 's/copy_number_/CN/' | sed 's/variation/var/' | awk '{print
$9":"$3"\t"$1"\t+\t"$4"\t"$5"\t \t
\t \t \t \t \t "}' > dm6_dgv_SimFormat_AllCol.sorted
cp hg19_bioCycPathway.txt
dm6_bioCycPathway.txt
cp hg19_cgapAlias.txt dm6_cgapAlias.txt
cp hg19_cgapBiocDesc.txt dm6_cgapBiocDesc.txt
cp hg19_cgapBiocPathway.txt
dm6_cgapBiocPathway.txt
cp hg19_keggMapDesc.txt dm6_keggMapDesc.txt
cp hg19_keggPathway.txt dm6_keggPathway.txt
Download canFam2 Dog definition
files by pasting these commands:
cd GeneRef
wget
http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/database/refGene.txt.gz
wget
http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/database/refLink.txt.gz
wget
http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/database/gc5Base.txt.gz
###Use hg19 Pathways files by querying
TOUPPER of all genes
###No Cytobands! wget
http://hgdownload.cse.ucsc.edu/goldenPath/canFam2/database/cytoBandIdeo.txt.gz
wget
http://humanparalogy.gs.washington.edu/canFam2seqdup/canFam2_wgac
wget ftp://ftp.ncbi.nlm.nih.gov/pub/dbVar/data/Canis_lupus/by_assembly/CanFam2.0/gvf/CanFam2.0.submitted.all.germline.ucsc.gvf.gz
gunzip refGene.txt.gz; mv refGene.txt
canFam2_refGene.txt
gunzip refLink.txt.gz; mv refLink.txt
canFam2_refLink.txt
###gunzip gc5BaseBw.txt.gz; mv gc5BaseBw.txt canFam2_gc5BaseBw.txt
cp
../PolishRefGeneForScanRegion_KeepOnlyChrLines.pl .
perl
PolishRefGeneForScanRegion_KeepOnlyChrLines.pl canFam2_refGene.txt ### to
delete non ÒchrÓ lines
perl MakeExonScanRegionDefFile_refGene.pl
canFam2_refGene.txt_OnlyChr.txt
perl
MakeExonScanRegionDefFile_wGene_refGene.pl canFam2_refGene.txt_OnlyChr.txt
#cp PerlModules/scan_region.pl .
gunzip gc5Base.txt.gz; mv gc5Base.txt
canFam2_gc5Base.txt
awk -F"\t" '{print $1"\t"$2"\t+\t"$3"\t"$4"\t
\t \t \t \t \t \t "}' canFam2_gc5Base.txt >
canFam2_gc5Base_SimFormat_AllCol.sorted ### Now unique values, rand() was added
before to $4 GC to allow duplicates in scan region since original 5 base pair
interval values are only 20,40,60,80
###perl
Polishgc5BaseForScanRegion_KeepOnlyChrLines.pl canFam2_gc5Base_SimFormat_AllCol.sorted_randDec
rm canFam2_gc5Base.txt
sort -k 1,1 -k 2,2n canFam2_wgac | grep -v chrom
> canFam2_genomicSuperDup.tab.sorted #Sort By Chr Pos
awk '{print $4":"$5"\t"$1"\t+\t"$2"\t"$3"\t
\t \t \t \t \t \t "}' canFam2_genomicSuperDup.tab.sorted >
canFam2_genomicSuperDups_SimFormat_AllCol.sorted
gunzip
CanFam2.0.submitted.all.germline.ucsc.gvf.gz; mv
CanFam2.0.submitted.all.germline.ucsc.gvf
canFam2_CanFam2.0.submitted.all.germline.ucsc.gvf
grep -v ^# canFam2_CanFam2.0.submitted.all.germline.ucsc.gvf
| sed 's/ID=//' | sed 's/;Name/\t/' | sed 's/copy_number_/CN/' | sed
's/variation/var/' | awk '{print
$9":"$3"\t"$1"\t+\t"$4"\t"$5"\t \t
\t \t \t \t \t "}' > canFam2_dgv_SimFormat_AllCol.sorted
cp hg19_bioCycPathway.txt canFam2_bioCycPathway.txt
cp hg19_cgapAlias.txt canFam2_cgapAlias.txt
cp hg19_cgapBiocDesc.txt canFam2_cgapBiocDesc.txt
cp hg19_cgapBiocPathway.txt canFam2_cgapBiocPathway.txt
cp hg19_keggMapDesc.txt canFam2_keggMapDesc.txt
cp hg19_keggPathway.txt canFam2_keggPathway.txt
cp hg19_cytoBand_SimFormat_AllCol.sorted
canFam2_cytoBand_SimFormat_AllCol.sorted
cp hg19_cytoBand.txt canFam2_cytoBandIdeo.txt
TestConcordance.pl compares pairs of CNV calls.
The pair can be PennCNV vs. Nexus of the same sample,
Mother vs. Child, Tumor vs. Blood, Transplant Donor vs.
Recipient, technical replicate 1 vs. technical replicate 2.
The input is calls in PennCNV format with confidence
column and Algorithm column, sorted by ID and algorithm. You also need to provide
a map file for all probes being considered. Make sure the map is sorted by
position and chromosome (sort -n -k 4,4 file.map | sort -s > file_Sorted.map).
Multiple CNV Detection Comparison
and De Novo CNV Calls Annotation
ArrayConcordance_InheritanceDenovo_AnnotateCNVCalls
Folder Scripts
COMMAND:
perl PQN_Concord_Inh.pl Cases_Denovo.rawcnv
Cases_Denovo_Q.rawcnv Cases_Denovo_N.rawcnv Cases_Denovo.fam ChrSnp0Pos.map
--cluster
You need a parallel compute cluster you can qsub to
use --cluster option, otherwise run without this option to run in serial
fashion.
INPUTS:
Make sure to enter a blank line at the end.
Cases_Denovo.rawcnv
chr1:1-3
numsnp=3 length=3
state1,cn=0 Mom.baflrr startsnp=rs1 endsnp=rs3 conf=10 PennCNV
chr1:4-7
numsnp=4 length=4
state5,cn=3 Dad.baflrr startsnp=rs4 endsnp=rs7 conf=10 PennCNV
chr1:1-3
numsnp=3 length=3
state1,cn=0 Pro.baflrr startsnp=rs1 endsnp=rs3 conf=10 PennCNV
chr1:4-7
numsnp=4 length=4
state5,cn=3 Pro.baflrr startsnp=rs4 endsnp=rs7 conf=10 PennCNV
chr1:8-10 numsnp=3 length=3 state5,cn=3 Pro.baflrr startsnp=rs8
endsnp=rs10 conf=10 PennCNV
Cases_Denovo_Q.rawcnv
chr1:1-3
numsnp=3 length=3
state1,cn=0 Mom.baflrr startsnp=rs1 endsnp=rs3 conf=10 QuantiSNP
chr1:4-7
numsnp=4 length=4 state5,cn=3
Dad.baflrr startsnp=rs4 endsnp=rs7 conf=10 QuantiSNP
chr1:1-3
numsnp=3 length=3
state1,cn=0 Pro.baflrr startsnp=rs1 endsnp=rs3 conf=10 QuantiSNP
chr1:4-7
numsnp=4 length=4
state5,cn=3 Pro.baflrr startsnp=rs4 endsnp=rs7 conf=10 QuantiSNP
chr1:8-9
numsnp=2 length=2
state5,cn=3 Pro.baflrr startsnp=rs8 endsnp=rs9 conf=10 QuantiSNP
Cases_Denovo_N.rawcnv
chr1:1-3
numsnp=3 length=3
state1,cn=0 Mom.baflrr startsnp=rs1 endsnp=rs3 conf=10 Nexus
chr1:4-7
numsnp=4 length=4
state5,cn=3 Dad.baflrr startsnp=rs4 endsnp=rs7 conf=10 Nexus
chr1:1-3
numsnp=3 length=3
state1,cn=0 Pro.baflrr startsnp=rs1 endsnp=rs3 conf=10 Nexus
chr1:4-7
numsnp=4 length=4
state5,cn=3 Pro.baflrr startsnp=rs4 endsnp=rs7 conf=10 Nexus
chr1:8-9
numsnp=2 length=2 state5,cn=3
Pro.baflrr startsnp=rs8 endsnp=rs9 conf=10 Nexus
Cases_Denovo.fam
FID IID FatID MotID Gend Aff
1 Mom.baflrr 0 0 0 1
1 Dad.baflrr 0 0 0 1
1 Pro.baflrr Dad.baflrr Mom.baflrr 0 2
ChrSnp0Pos.map
Chr Snp Cm Pos
1 rs1 0 1
1 rs2 0 2
1 rs3 0 3
1 rs4 0 4
1 rs5 0 5
1 rs6 0 6
1 rs7 0 7
1 rs8 0 8
1 rs9 0 9
1 rs10 0 10
OUTPUTS:
The
program does not currently generate new overlapping
segments consensus CNV calls from multiple algorithm comparison for further
analysis. I have used PennCNV calls that had good overlap in QuantiSNP rather
than generating an overlapping segments consensus call. Many CNV calling
algorithms are conservative in the boundary calls, especially if you did not
run the PennCNV clean_cnv.pl script to merge fragmented CNVs.
Make sure you look at either PvsQ or QvsP CallMatching and MatchSummary, looking at both will be confusing. For the QvsP output, the first 9 columns refer to the QuantiSNP calls. The next 6 columns refer to the overlap profile of the QuantiSNP call with the PennCNV call. The next 9 columns starting with RefCallMatchingCN_CNVBoundariesChrStartStop refer to the overlapping PennCNV call with the same CN contributing to the ÒhitsnpÓ column. The next 9 columns starting with RefCallDiffCN_CNVBoundariesChrStartStop refer to the overlapping PennCNV call with a different CN contributing to the ÒhitDiffCNÓ column.
The columns with numbers in the QvsP.txtCNVMatchSummary.txt file are the hitDiffCN mismatches. For example, 01 is the count of QuaniSNP calls that were CN=0 but PennCNV called as CN=1.
PennCNV_ConcordInh_NoROH.txt
We see full overlap of sample level rawcnv calls of
the three algorithms PennCNV, QuanitSNP, and Nexus with reference to the
PennCNV calls (indicated by ÒvsPÓ), except PennCNV chr1:8-10 where the other
algorithms called chr1:8-9 but this
still is a consistently detected CNV.
We see full inheritance reporting with inheritance
from mother chr1:1-3, inheritance from father chr1:4-7, and de novo chr1:8-10
for the proband CNV calls. We see inheritance to proband chr1:1-3 for the
mother CNV call. We see inheritance to proband chr1:4-7 for the father CNV
call. In the case of a CNV overlapping but with a different CN state between
family members, similar columns are reported with an additional column
specifying the different CN state. Certainly CN 3 and 4 seems the most typical
permissible different CN to allow as a match but others may indicate abnormal
noise patterns.
CNV Boundaries Chr StartStop
Hg19 |
numsnp |
Length (bp) |
CN |
ChipID |
startsnp |
endsnp |
confidence |
algorithm |
50% PennCNV Overlap |
Freq PennCNV Overlap |
50% QuantiSNP Overlap |
Freq QuantiSNP Overlap |
50% Nexus Overlap |
Freq Nexus Overlap |
BlindedID |
Inherit From Mother 50% Overlap |
Freq Inherit From Mother Overlap |
Inherit From Father 50% Overlap |
Freq Inherit From Father Overlap |
Inherit To Proband 50% Overlap |
Freq Inherit To Proband Overlap |
Inherit From Mother 50% Overlap
Diff CN |
Freq Inherit From Mother Overlap
Diff CN |
Inherit From Mother Diff CN |
Inherit From Father 50% Overlap
Diff CN |
Freq Inherit From Father Overlap
Diff CN |
Inherit From Father Diff CN |
Inherit To Proband 50% Overlap
Diff CN |
Freq Inherit To Proband Overlap
Diff CN |
Inherit To Proband Diff CN |
MFP Trio State (D=Diploid,
2=LOH, N=NA) |
Denovo Call |
Gene |
Distance |
chr1:1-3 |
3 |
3 |
0 |
Pro.baflrr |
rs1 |
rs3 |
10 |
PennCNV |
Y |
1 |
Y |
1 |
Y |
1 |
Pro.baflrr |
Y |
1 |
N |
0 |
na |
na |
N |
0 |
. |
N |
0 |
. |
na |
na |
na |
0D0 |
NotDenovo |
DDX11L1,DDX11L9 |
11871 |
chr1:4-7 |
4 |
4 |
3 |
Pro.baflrr |
rs4 |
rs7 |
10 |
PennCNV |
Y |
1 |
Y |
1 |
Y |
1 |
Pro.baflrr |
N |
0 |
Y |
1 |
na |
na |
N |
0 |
. |
N |
0 |
. |
na |
na |
na |
D33 |
NotDenovo |
DDX11L1,DDX11L9 |
11867 |
chr1:8-10 |
3 |
3 |
3 |
Pro.baflrr |
rs8 |
rs10 |
10 |
PennCNV |
Y |
1 |
Y |
0.667 |
Y |
0.667 |
Pro.baflrr |
N |
0 |
N |
0 |
na |
na |
N |
0 |
. |
N |
0 |
. |
na |
na |
na |
DD3 |
Denovo |
DDX11L1,DDX11L9 |
11864 |
chr1:1-3 |
3 |
3 |
0 |
Mom.baflrr |
rs1 |
rs3 |
10 |
PennCNV |
Y |
1 |
Y |
1 |
Y |
1 |
Mom.baflrr |
na |
na |
na |
na |
Y |
1 |
na |
na |
na |
na |
na |
na |
N |
0 |
. |
NNN |
NotDenovo |
DDX11L1,DDX11L9 |
11871 |
chr1:4-7 |
4 |
4 |
3 |
Dad.baflrr |
rs4 |
rs7 |
10 |
PennCNV |
Y |
1 |
Y |
1 |
Y |
1 |
Dad.baflrr |
na |
na |
na |
na |
Y |
1 |
na |
na |
na |
na |
na |
na |
N |
0 |
. |
NNN |
NotDenovo |
DDX11L1,DDX11L9 |
11867 |
changing Cases_Denovo.rawcnv
chr1:1-3
numsnp=3 length=3
state1,cn=0 Mom.baflrr startsnp=rs1 endsnp=rs3 conf=10 PennCNV
to
chr1:1-3
numsnp=3 length=3
state2,cn=1 Mom.baflrr startsnp=rs1 endsnp=rs3 conf=10 PennCNV
and reviewing the output PennCNV_ConcordInh_NoROH.txt
50%QuantiSNPOverlap and 50%NexusOverlap become N
since the CN state is different.
InheritFromMother50%Overlap becomes N and
InheritFromMother50%OverlapDiffCN becomes Y with InheritFromMotherDiffCN 1 for
the proband CNV call.
InheritToProband50%Overlap becomes N and
InheritToProband50%OverlapDiffCN becomes Y with InheritToProbandDiffCN 0 for
the mother CNV call.
CNV Boundaries Chr
StartStop Hg19 |
numsnp |
Length (bp) |
CN |
ChipID |
startsnp |
endsnp |
confidence |
algorithm |
50% PennCNV Overlap |
Freq PennCNV Overlap |
50% QuantiSNP Overlap |
Freq QuantiSNP Overlap |
50% Nexus Overlap |
Freq Nexus Overlap |
BlindedID |
Inherit From Mother 50%
Overlap |
Freq Inherit From Mother
Overlap |
Inherit From Father 50%
Overlap |
Freq Inherit From Father
Overlap |
Inherit To Proband 50%
Overlap |
Freq Inherit To Proband
Overlap |
Inherit From Mother 50%
Overlap Diff CN |
Freq Inherit From Mother
Overlap Diff CN |
Inherit From Mother Diff
CN |
Inherit From Father 50%
Overlap Diff CN |
Freq Inherit From Father
Overlap Diff CN |
Inherit From Father Diff
CN |
Inherit To Proband 50%
Overlap Diff CN |
Freq Inherit To Proband
Overlap Diff CN |
Inherit To Proband Diff CN |
MFP Trio State (D=Diploid,
2=LOH, N=NA) |
Denovo Call |
Gene |
Distance |
chr1:1-3 |
3 |
3 |
0 |
Pro.baflrr |
rs1 |
rs3 |
10 |
PennCNV |
Y |
1 |
Y |
1 |
Y |
1 |
Pro.baflrr |
N |
0 |
N |
0 |
na |
na |
Y |
1 |
1 |
N |
0 |
. |
na |
na |
na |
1D0 |
NotDenovo |
DDX11L1,DDX11L9 |
11871 |
chr1:4-7 |
4 |
4 |
3 |
Pro.baflrr |
rs4 |
rs7 |
10 |
PennCNV |
Y |
1 |
Y |
1 |
Y |
1 |
Pro.baflrr |
N |
0 |
Y |
1 |
na |
na |
N |
0 |
. |
N |
0 |
. |
na |
na |
na |
D33 |
NotDenovo |
DDX11L1,DDX11L9 |
11867 |
chr1:8-10 |
3 |
3 |
3 |
Pro.baflrr |
rs8 |
rs10 |
10 |
PennCNV |
Y |
1 |
Y |
0.667 |
Y |
0.667 |
Pro.baflrr |
N |
0 |
N |
0 |
na |
na |
N |
0 |
. |
N |
0 |
. |
na |
na |
na |
DD3 |
Denovo |
DDX11L1,DDX11L9 |
11864 |
chr1:1-3 |
3 |
3 |
1 |
Mom.baflrr |
rs1 |
rs3 |
10 |
PennCNV |
Y |
1 |
N |
0 |
N |
0 |
Mom.baflrr |
na |
na |
na |
na |
N |
0 |
na |
na |
na |
na |
na |
na |
Y |
1 |
0 |
NNN |
NotDenovo |
DDX11L1,DDX11L9 |
11871 |
chr1:4-7 |
4 |
4 |
3 |
Dad.baflrr |
rs4 |
rs7 |
10 |
PennCNV |
Y |
1 |
Y |
1 |
Y |
1 |
Dad.baflrr |
na |
na |
na |
na |
Y |
1 |
na |
na |
na |
na |
na |
na |
N |
0 |
. |
NNN |
NotDenovo |
DDX11L1,DDX11L9 |
11867 |
PennCNV_ConcordInh_NoROH_Summary.txt
Sample based counts
ChipID |
BlindedID |
Total Calls |
PennCNV Overlap |
QuantiSNP Overlap |
Nexus Overlap |
Three Overlap |
Two Overlap |
One Overlap |
Inherit From Mother |
Inherit From Father |
Inherit To Proband |
Denovo |
Inherit From Mother And Father
Same CN |
Inherit From Mother And Father
Diff CN |
MFP=110 |
MFP=334 |
Three Overlap And Inh |
Two Overlap And Inh |
One Overlap And Inh |
Three Overlap And Inh Under
Fiftykb |
Three Overlap And Inh Fifty
To Two Hundred kb |
Three Overlap And Inh Above
Two Hundred kb |
Two Overlap And Inh Under Fifty
kb |
Two Overlap And Inh Fifty To
Two Hundred kb |
Two Overlap And Inh Above Two
Hundred kb |
One Overlap And Inh Under Fifty
kb |
One Overlap And Inh Fifty To
Two Hundred kb |
One Overlap And Inh Above Two
Hundred kb |
Family Status |
Pro.baflrr |
Pro.baflrr |
3 |
3 |
3 |
3 |
3 |
0 |
0 |
1 |
1 |
0 |
1 |
0 |
0 |
0 |
0 |
2 |
0 |
0 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
trio |
Mom.baflrr |
Mom.baflrr |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
trio |
Dad.baflrr |
Dad.baflrr |
1 |
1 |
1 |
1 |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
trio |
|
|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
duo |
|
|
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
single |
De novo Prioritization
Trio Recall PennCNV
At least 2 algorithms call de novo (P, Q, N) (do not
exclude parent CNV called by 1 algorithm that is inherited to child CNV called
by 2 algorithms as more de novos will result)
Parental Origin low p (if enough informative markers)
Greater than 5 SNPs
Low/absent untransmitted rates
Low/absent Control (DGV, SSC Healthy Trios, CHOP
Control, Framingham)
BAF/LRR Review (full trio in case FN Parent)
ddPCR Validation
For each study:
ParseCNV --includePed
Plink --covar .pca.evec
PennCNV scan_region SNP based .stats results cohort A→closest
SNP results cohort B
METAL to generate meta-analysis p-value SNP based
InsertPlinkPValues Input is general tab delimited P, SNP, and one of
BETA/T/OR header present.
You
can run for each study to get study specific red flag confidence scores or run
ParseCNV and Plink again on the combined data to use as input for combined
study red flags
ParseCNV generates the output *CNV_Brief.stats which
is the SNP based CNV statistics which can be used for meta-analysis using
METAL. Since you have different arrays, the intersection SNP set will be low so
I would use the gene name as the marker rather than the SNP name. Using PennCNV
scan_region.pl with input chrX:start-stop for each SNP will annotate the
closest gene to each SNP. Then sort -g in unix to sort by p-value so best
(lowest) p-value is first since that is the one METAL will use.
run ParseCNV for each study to generate the p-values
for each study separately.
You need to rearrange *CNV_Brief.stats into a
scan_region.pl input using:
awk '{print "chr"$2":"$3"-"$3}'
CNV_Brief.stats | sed '1d'
or more specifically with all the other columns:
awk '{print
"chr"$2":"$3"-"$3"\t"$1"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10"\t"$11}'
CNV_Brief.stats | sed '1d'
scan_region.pl wants no header (sed '1d') while METAL
needs the header.
The analysis looks like this:
./metal
SEPARATOR
TAB
MARKER Gene
EFFECT log(ORDel)
PVAL DelTwoTailed
PROCESS File1_ForScanRegion_GeneExon_SortDelP.txt
PROCESS File2_ForScanRegion_GeneExon_SortDelP.txt
....
ANALYZE
and again for Dups:
EFFECT log(ORDup)
PVAL DupTwoTailed
PROCESS File1_ForScanRegion_GeneExon_SortDupP.txt
PROCESS File2_ForScanRegion_GeneExon_SortDupP.txt
....
ANALYZE
Convert CNV Calls Formats from
Different Algorithms
PennCNV is the motivating input. The PennCNV input is
very human readable, widely used and informative. CNV .vcf is widely used in
sequencing applications, but not human readable. Therefore, the idea is to
convert CNVs called by different algorithms into PennCNV format with a final
column stating the algorithm for tracking. In most cases, the CNV calls file
columns and coding can be edited relatively easily in Excel. The main exception
is BirdSuite and GenomeStrip .vcf matrix formats. For these common yet very
different formats, use ConvertVCFToPennCNV.pl. For the commonly used QuantiSNP
conversion use ConvertQuantiSNPToPennCNV.pl.
PennCNV, QuantiSNP, Nexus, IlluminaGenomeStudio,
TCGA, AgilentGenomicWorkbench/Cytogenomics, AffymetrixGenotypingConsole,
BirdSuite, CGHscape, NimbleScan, R-GADA, Conifer, XHMM, Plink .cnv file
cnvPartition, GenomeStrip .vcf DEL DUP INS INV CNV
Similar settings should be used between algorithms to
ensure comparability such as using GC base content correction.
Nexus ÒMin RegionÓ should be used not ÒChromosome
RegionÓ since that is the average position between probes. AgilentGenomicWorkbench/Cytogenomics
has the sample ID listed only once as a header of the CNV calls subsections.
ConvertQuantiSNPToPennCNV.pl
ConvertVCFToPennCNV_XHMM.pl
ConvertVCFToPennCNV_GenomeStrip.pl
Consider CallRate, LRR_SD, GCWF, CountCNV, PCA, and
PI_HAT as the most important sample quality metrics. CallRate and LRR_SD are
widely used and critical to evaluate. The problem is these metrics are derived
from various sources, even if you use SNP arrays with PennCNV, only LRR_SD and
GCWF are provided in the PennCNV .log. CountCNV is automatically figured out by
the .rawcnv files. CallRate, PCA, and PI_HAT are based on genotyping A/T/C/G,
not CNVs. For CallRate, you can provide Plink --missing .imiss, Illumina LIMs
Project Detail Report .csv, or Illumina Genome Studio Samples Table. PCA
population stratification components can be Eigenstrat smartPCA .pca.evec or
Plink .mds. PI_HAT is from Plink --genome .genome file. PCA is not
automatically used to create IDs to remove, which would be done for a study
where almost all samples are tightly clustered and a few outliers are then
removed. Typically the PCA shows admixture and various ethnicities and
population stratification correction without sample removal based on PCA is the
best approach.
For Illumina 550k data, key data quality metric
thresholds we have observed are: call rate > 98%, SD LRR < 0.3, |GCWF|
< 0.05, and count CNV< 100.
For Affymetrix 6.0 data, these measures include: call
rate > 96%, SD LRR < 0.35, |GCWF| < 0.02, and count CNV < 80.
Inputs: (all are recommended but not
required to run ParseCNV_QC.pl)
PennCNV .log
PennCNV .rawcnv
Plink --missing .imiss / Illumina LIMs Project Detail Report .csv /
Illumina Genome Studio Samples Table / General format: SampleID CallRate
Eigenstrat smartPCA .pca.evec / Plink .mds
Plink --genome .genome
perl ParseCNV_QC.pl --log PennCNV_Omni25.log --rawcnv
file.rawcnv --callrate PCGC_Omni25-8v1_BED_Miss.imiss --popstrat file.pca.evec
--related PCGC_Omni25-8v1_BED_GenomeAll.genome
Outputs:
See QC_Plot.pdf sample QC distributions with outliers
to remove in blue and corresponding QC_RemoveIDs.txt with ChipIDs failing the
various QC metrics. You may delete some rows from this file if you want a more
aggressive ChipID/Sample inclusion or if the threshold does not look correct on
the plot. Remove thresholds for each quality metric are determined by the R
package extremevalues function getOutliers. You can also use exclusions based
on other PennCNV log metrics provided in QC_Plot_2.pdf and QC_RemoveIDs_LRRmean_BAFmean_BAFSD_BAFDRIFT_WF.txt.
Then run this code found in the GeneRef folder to
create a "clean" .rawcnv file to use for ParseCNV association
perl FilterCNV.pl QC_RemoveIDs.txt CNVCalls.rawcnv 5
remove
UPDATE: The R getOutliers is not very dynamic in
selecting outliers based on a variety of possible data quality metric
distributions. This typically results in too many samples being excluded.
Again, look for linear mode for inclusion and exponential mode for exclusion.
This can be easily done in R or Excel. Once an appropriate threshold is
determined from reviewing the first run distributions or from a static know
threshold for consistency, ParseCNV_QC.pl can be run again specifying
thresholds to drive the plots and sample remove report.
Run once with no
thresholds specified and review automatically determined thresholds based on
outliers in your dataset.
Then you may
specify some adjusted threshold for a given quality metric based on your review
of the plots.
For example add: --qccallrate 0.98 --qclrrsd 0.2 --qcgcwf
0.005 –qcnumcnv 100
popstrat is just for plotting, no exclusion is
currently supported based on popstrat.
Example: running ParseCNV_QC again after reviewing
plots. See 146 vs. 49 samples excluded based on automatic vs. manual quality
metric thresholds (see bottom of plot Òunique fail any QCÓ):
perl ParseCNV_QC_3_ProvideFilterThresholds.pl --log
PennCNV_Omni1.log --rawcnv file.rawcnv --callrate PCGC_Omni25-8v1_BED_Miss.imiss
--popstrat file.pca.evec
perl ParseCNV_QC_3_ProvideFilterThresholds.pl --log
PennCNV_Omni1.log --rawcnv file.rawcnv --callrate
PCGC_Omni25-8v1_BED_Miss.imiss --popstrat file.pca.evec --qcnumcnv 700 --qcgcwf 0.005 --qclrrsd
0.2 --qccallrate 0.98
Bimodal metrics indicate batches, typically of different
chip versions
CNV Call Quality Control:
Removing CNV calls may introduce significant bias and
there are a limited number of metrics to make an informed decision. Probably
the best metric is the confidence score based on the cumulative probability
from the HMM. ParseCNV annotates average numsnp, length, and confidence of each
potential CNVR association which captures this quality/confidence issue with
less bias than upfront CNV call quality control.
Step by Step Processing Description:
CNV calls are mapped into SNP based statistics which are
then merged into CNVRs based on proximity (1Mb) and similar significance (power
of ten P value) of neighboring SNPs.
The most significant subregion is presented in the
case of multiple significant proximal CNVRs.
Deletion and Duplication p values are then calculated
by pooling 0 and 1 copy for deletion and 3 and 4 for duplication.
The runtime log lists the processing
steps:
scan_range2: Takes SNP based statistics and
collapses into CNVRs based on distance 1Mb and p-value one power of ten in
negative log.
scan_region: Takes CNVR and annotates nearest
UCSC gene (more inclusive than RefSeq genes) and proximal distance until gene
boundary (including introns, 0 is direct overlap). Definition files can be
updated or different builds used by downloading from UCSC Tables.
scan_DescPathway: Takes CNVR with gene and gives full
text description of gene name and pathway (many not_found in current version of
definition file).
scan_rangeJustPos2: Takes SNP based statistics and
gives SNP indexes based only on distance 1MB to limit redundancy of small
regions with different significance levels.
CNVToPed: --includePed since large file. Del
and Dup Ped files created for more statistics in Plink. In the Del Ped,
state1cn=0 → 1 1 and state2cn=1 → 1 2 other → 2 2. In the Dup
Ped, state5cn=3 → 1 2 state6cn=4 → 1 1 other → 2 2. This definition
closely resembles the CNV state frequencies (1 1=rare, 1 2=common, 2 2=very
common). Note many probes will be always diploid (no CNV calls) so NA will
result in Plink. Sorted.map created automatically should be used with the
resulting ped.
vlookup SNP to
Region ID: Add
column Region ID.
countBarcodeOCCURENCE_V2: Count Region ID occurrences.
vlookup Count Sig
Regions: Add
column Count Region ID.
Sort by p-value: Sort from Chr-Pos to low->high p-value.
Data-filter-advanced-unique
records only:
Include the first occurrence of each Region ID to filter out many significant
regions close together, keeping the lowest p-value occurrence.
Create UCSC Custom
Track For Review:
UCSCGenomeBrowser-Select Assembly-add custom tracks-Browse-Submit
NUMLines: Adds Tab delimited Index Column to
space delimited CNV calls (.rawcnv) files to access recorded contributing calls
to each association for back referencing trackability
ComponentDelCallsCases,ComponentDupCallsCases,ComponentDelCallsControls,ComponentDupCallsControls
PercentSamples-AverageNumsnpLenComponentCalls: Adds AverageNumsnps AverageLength
CNVRange for cases and controls for screening purposes
SpecificBafLrrAccessMany: Assumes all BafLrr files have
ProbeID as first column and have same probe sorted error, but can be different
from map. If Probe ID not first column, provide probe order reference file
--probesBafLrr <file.txt>. Needs additional SampleID to BafLrrFilePath file
specified on command line by --idToPath <file.txt>. Highly recommended if
BafLrr files available. Takes CNVRs and contributing sample IDs and creates
files with BAF and LRR values for each CNVR with all samples and a master file
with all CNVRs and signals in all samples with CNV contributing to association
in BafLrr Folder. Used for review of specific association region across many
samples.
Main output: CNVR_ALL_ReviewedCNVRs.txt Note results sorted by deletion p are
followed by results sorted by duplication p (See CNVType column).
Copy and paste text into
EXCEL to review output! Do not try viewing in shell terminal except perhaps
*_brief.txt.
For
big data, cells may overflow to the next line since the total number of
characters that a cell can contain is 32,767 characters. The Sample
ID columns are usually the problem. You will notice this problem if the first
column is not all chr:start-stop values. To fix this, add a column with an
index, sort by the chr:start-stop column, delete rows not starting with ÒchrÓ
at the top and bottom, and sort by the index. Full details will not be present
for very common CNVs but these are typically not of interest anyway. Also
consider making Sample IDs a simple index number rather than a lengthy text
string. You can use vlookup in Excel to replace sample IDs with index sample
IDs in your input .rawcnv file.
To view output files in OpenOfficeCalc (Linux vs Dos
endlines could cause blank lines):
RightClick-OpenWith-OpenOffice.orgCalc-TextImport=Defaults
If blank lines:
SelectRowsNotHeader-Data-Filter-StandardFilter-Value=Empty-RightClick-Delete
Rows-Data-Filter-RemoveFilter
Output File Names |
Description |
CNVR_ALL_ReviewedCNVRs.txt |
50 column significant case
enriched CNVRs with Red Flag annotation for quality filtering |
CNVR_ALL_ReviewedCNVRs_brief.txt |
10 column significant case
enriched CNVRs with Red Flag annotation for quality filtering |
CNVR_ALL_ReviewedCNVRs_verbose.txt |
100 column significant case
enriched CNVRs with Red Flag annotation for quality filtering (need for TDT
details) |
CNVR_ALL_ReviewedCNVRs_ControlEnriched.txt |
50 column significant control
enriched CNVRs with Red Flag annotation for quality filtering |
CNVR_ALL_ReviewedCNVRs_brief_ControlEnriched.txt |
10 column significant control
enriched CNVRs with Red Flag annotation for quality filtering |
CNV_UCSC_bedfile_ForCustomTrackReview.txt |
.rawcnv significant locus
visualization of individual sample CNVs genome.ucsc.edu Be sure you select
correct genome assembly→click manage custom tracks→Browse for
UCSC bedfile→Paste CNVR coordinates in Genome Browser search box |
CNV_UCSC_bedfile_ForCustomTrackReview_Probes.txt |
Probe coverage significant locus
visualization |
CNV_ContributingCalls.txt |
CNVRs contributing calls full
boundaries as in the .rawcnv input |
CNV_brief.stats |
Probe based CNV association
statistics file |
Burden.txt |
Cases and controls IDs, and
TotalDelLengths, and TotalDupLengths |
ParseCNV.log |
ParseCNV log |
ChrSnp0PosSorted_UseWithPed.map |
Map Sorted to use with CNV ped |
Deletion and Duplication significant CNVRs with
counts in cases and control and p-value. Sample IDs, CN states, CNV calls
contributing to each association and many other statistics and tracking data
are provided (100 fields in total verbose output and 40 highly informative
fields in brief output).
UCSC custom track bed file with CNV calls.
CNV_ContributingCalls.txt: CNV contributing calls for
each CNVR
CNVDEL.ped CNVDUP.ped: .ped files for further
statistics deletion and duplication are generated for use in Plink.
CNVR_ALL_ReviewedCNVRs: All significant CNVRs with CNVType,
redFlagCount, and redFlagReasons.
Nomenclature:
CNVR: CNVRegion intersection region of overlapping CNVs with significant
enrichment in cases vs. controls if overlap (Primary importance, refined CNV
association region)
CNVRange: CNVRange union region of overlapping CNVs
(Typical meaning of ÒCNVRÓ in literature)
People are often confused by the small range and
value of the first 2 columns: CNVR and CountSNPs. This is the most significant
overlapping region, not the entire span of contributing CNVs. CNVRange and
AvgNumSnps columns reflect the entire span of contributing CNVs. CNVs at a
particular locus do not have precisely the same boundaries, especially rare
CNVs (see paper Figure 3).
temp/CNV_Verbose.stats full SNP based statistics for
specific locus queries not among the significant loci or for querying
replication based on TagSNP (i.e. GeneName → PositionRange → SNP
IDs in PositionRange → egrep SNP IDs from temp/CNV_Verbose.stats →
CNVFrequencyInGene). A cleaner way is: perl GeneRef/RemoveSpecificBarcodes-Hash_Example_DefFile_ColRemKeep.pl SNPIDsToReplicate.txt temp/CNV_Verbose.stats
1 keep
(FilterCNV.pl)
For querying replication based on genomic coordinate
range use FindSnpsInRange.pl StatsFile Chr Start Stop (much better than grep
cytoband since not specific enough).
AllRes.pdf: Image created with BAF and LRR of
specific association regions across contributing cases with --idToPath option
If you get the error: Ò-bash: convert: command not
foundÓ run: Òsudo apt-get install imagemagickÓ. This command combines the BAF
LRR plots into a single image to scroll through CNVRs sorted by significance.
If you do not have administrative privileges to do sudo:
wget http://www.imagemagick.org/download/ImageMagick.tar.gz
tar
xvzf ImageMagick.tar.gz
cd
ImageMagick-6.8.9-7/
./configure
make
DESTDIR=~/ImageMagick
make install
export PATH=$PATH:~/ImageMagick/usr/local/bin/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/ImageMagick/usr/local/lib/
Alternatively, the images can be flipped through in
Powerpoint using Insert-PhotoAlbum or Open Office Impress Add On
If you have a local R install due to lack of
administrator privileges on your system, you need to provide the full path to
the R executable such as "/work/1a/joe/R/R-3.1.2/bin/R". This path
needs to replace "R" in instances of "R CMD" in these 3
scripts:
ParseCNV.pl
MakeRedFlagPlots.pl
PerlModules/visualize_cnv.pl
5x10-4 is a conservative bar for CNV
genome-wide significance surviving multiple testing correction based on
analysis of Illumina and Affymetrix genome-wide SNP arrays. The typical bar of
5x10-8 used in GWAS is not appropriate for CNV considering:
¤
the
number of probes with a nominal frequency of CNV occurrence (only probes with
some CNV detected are informative)
¤
the
number of probes with enrichment in cases vs. controls and vice versa (evidence
of more case enriched loci than control enriched loci)
¤
probes
with less than 1% population frequency of CNV (optionally)
¤
the
number of CNVRs (multiple probes are needed to detect a single CNV and should
not count separately for multiple testing correction)
2x10-5 according to permutation studies.
UCSC Track Review for spurious association. (UCSCGenomeBrowser-Select
Assembly-add custom tracks-Browse-Submit).
Recommended UCSC tracks:
Header:Mapping and Sequencing Tracks-Item: GC Percent (full) (exclude if low or high across window)
Header:Variation and Repeats-Item: DGV Struct Var (pack)
and Item:Segmental Dups (pack) (exclude if 10+)
Header:Genes and Gene Prediction Tracks-Item: UCSC
Genes (pack) (See redFlagCount and redFlagReasons columns in ÒCNVR_ALL_ReviewedCNVRs.txtÓ
for automated scoring of relevant features originally visually scored).
Header:Phenotype and Literature-Item: ISCA, DECIPHER
BAF/LRR Review of specific association region across
many samples using SpecificBafLrrAccess output AllRes.txt or CNVR specific
file.
BAF/LRR Review of whole chromosome data of sample
with associated CNV and SNP clusters using Illumina Genome Studio Genome Viewer
or Affymetrix Genotyping Console Browser.
qPCR/ddPCR Wet Lab Review for confirmation of true
positive and potentially true negatives.
These are done in order of increasing effort per
locus but the number of loci will be filtered down by each step.
Red flag count <= 3 may be considered high
confidence results. More important red flags to fail a CNVR as low confidence
include: AvgProbes<5, DgvEntries>10, PenMaxP>0.5 and high frequency,
SegDups>10, Recurrent, FreqInflated,
AvgConf<10, ABFreqLow.
Standard Filter for high significant and confident
results for SpecificBafLrr Visual: Sort RF <=3, delP<5x10^-4 and
ORDup>1 OR dupP<5x10^-4 and ORDup>1, (on exon)
RedFlag |
Default Report
Threshold |
Explanation |
SegDups (count, max, avg) |
>10, >0.98 max Fraction Matching |
Many segmental duplications (i.e., nearly
identical DNA segments), representing genomic segments that are difficult to
uniquely hybridize probes to, which could underlie false positive CNV
detection. Segmental
Duplications inform CNV breakpoints if flanking (include) and noisy regions
if overlapping (exclude). |
DgvEntries |
>10 |
Overlapping multiple Database of Genomic
Variants (DGV) entries, representing CNV signals observed in ÒhealthyÓ
individuals, suggesting that a potential association result in the study at
hand may be false. |
TeloCentro |
any overlap |
Residing at centromere and telomere proximal
regions as they often have sparse probe coverage and only have a single
flanking diploid reference to base CNV calls. |
AvgGC |
31>GC>60 |
Harboring high or low GC content regions that
bias probe hybridization kinetics even after GC model correction is done by
CNV calling algorithms, producing false CNV calling and biasing the result. |
AvgProbes |
<10 |
CNVs captured with low average number of
probes, contributing to association with low confidence. If an association
depends on a preponderance of small CNVs, the likelihood of false positive is
high. |
Recurrent |
any overlap |
Locus frequently found in multiple studies
such as TCR, Ig, HLA, and OR genes. TCRs undergo somatic rearrangement due to
VDJ recombination causing inter-individual differences in the clonality of
T-cell populations and thus are not true CNVs, necessitating exclusion. |
PopFreq |
>0.01 |
CNV regions with high population frequency
(for rare CNV focused studies) indicate that probe clustering is likely
biased due to a high percentage of samples with CNV used in clustering
definition thus biasing CNV detection. |
PenMaxP_Freq_HighFreq |
PenMaxP >0.5, Freq >0.5, HighFreq >0.05 |
CNV peninsula of common CNV (sparse probe
coverage and nearby high frequency CNV) indicates that within the range of
contributing CNV boundaries there is a non-significant (p>0.5) p-value
which is notably different from the CNVR association typically due to random
extension of common CNVs to neighboring sparse or noisy probes. PenMaxP is the worst
p-value in the span of CNV calls contributing to the significant CNVR. Freq
is the frequency of this PenMaxP worst p-value. HighFreq is the frequency any
non-nominally significant p-value (P>0.05). |
FreqInflated |
>0.5 sids at this locus have
>(maxInflatedSampleCount-2) occurrences in all significant results |
The same inflated sample driving multiple CNV
association signals. Certain samples have many noisy CNV calls arising in
rare regions despite upfront sample quality filtering. |
Sparse |
>50kb |
A large gap in probe coverage exists within
the CNV calls indicating uncertainty in the continuity of a single CNV event,
typically due to dense clusters of copy number (intensity only) probes with
large intervening gaps. |
ABFreq |
<1% values (0.1,0.4) or (0.6,0.9) |
For duplications, AB banding of BAF at 0.33
and 0.66 for CN=3 or 0.25 and 0.75 for CN=4 are very important observations
given the relatively modest gain in intensity observed in duplications. |
AvgConf |
<10 |
The HMM confidence score in PennCNV is a
superior indication of CNV call confidence compared to numsnps and length in
studies comparing de novo vs. inherited CNV calls, giving an indication of
the strength of the CNV signal or aggregate difference in probability between
the called CN and the next highest probability CN. Other CNV calling
algorithms give different range confidence scores or lower values might mean
more confidence (i.e. call p value) so threshold may need modification. It is
recommended to be in .rawcnv file as column 8 i.e. Òconf=20.659Ó but not
required. |
AvgLength |
<10kb |
A classical confidence scoring parameter is
the length of the CNV. If the CNV is too small, it is submicroscopic and even
if many probes are tightly clustered, bias of local DNA regions and probe
overlap make confidence difficult. |
If you did the case:control (default), you want to
look at ÒDelTwoTailÓ and ÒDupTwoTailÓ.
If you did the --tdt, you want to look at ÒTDTDelÓ
and ÒTDTDupÓ for TDT P-value and ÒNormParDelÓ and ÒNormParDupÓ for the count de novo.
Sequencing
Significant CNVR
Output Fields Description:
Column |
Description |
CNVR |
CNV Region of greatest significance and
overlap coordinates |
CountSNPs |
The number of probes available in the CNVR for
this dataset In this case, contributing individual CNV calls may be larger |
SNP |
Tag SNP for ease and clarity of reporting and
replication |
DelTwoTailed |
Two Tailed Fisher's Exact P-value based on the
contingency table Cases Del/Cases Diploid/Controls Del/Controls Diploid as
listed separately |
DupTwoTailed |
Two Tailed Fisher's Exact P-value based on the
contingency table Cases Dup/Cases Diploid/Controls Dup/Controls Diploid as
listed separately |
ORDel |
The Odds Ratio for deletion. |
ORDup |
The Odds Ratio for duplication. |
Cases Del |
The number of cases with a deletion detected
in this region by PennCNV |
Cases Diploid |
The number of cases without a deletion or
duplication detected in this region by PennCNV |
Control Del |
The number of controls with a deletion
detected in this region by PennCNV |
Control Diploid |
The number of controls without a deletion or
duplication detected in this region by PennCNV |
Cases Dup |
The number of cases with a duplication
detected in this region by PennCNV |
Cases Diploid |
The number of cases without a deletion or
duplication detected in this region by PennCNV |
Control Dup |
The number of controls with a duplication
detected in this region by PennCNV |
Control Diploid |
The number of controls without a deletion or
duplication detected in this region by PennCNV |
IDsCasesDel |
The sample IDs of cases corresponding to the
Cases Del column for clinical data lookup. To convert to list in Excel:
Data-TextToColumns-Delimited-Space then Copy-PasteSpecial-Transpose |
IDsCasesDup |
The sample IDs of cases corresponding to the
Cases Dup column for clinical data lookup. To convert to list in Excel:
Data-TextToColumns-Delimited-Space then Copy-PasteSpecial-Transpose |
StatesCasesDel |
CN states listed corresponding to IDsCasesDel
(1(CN=0)/2(CN=1)) |
StatesCasesDup |
CN states listed corresponding to IDsCasesDup
(5(CN=3)/6(CN=4)) |
TotalStatesCases(1) |
The number of cases in Cases Del with a
homozygous deletion or both copies lost |
TotalStatesCases(2) |
The number of cases in Cases Del with a
hemizygous deletion or one copy lost |
TotalStatesCases(5) |
The number of cases in Cases Dup with a
hemizygous duplication or one copy gained |
TotalStatesCases(6) |
The number of cases in Cases Dup with a
homozygous duplication or two copies gained |
IDsDelControl |
The sample IDs of controls corresponding to
the Control Del column for clinical data lookup. |
IDsDupControl |
The sample IDs of controls corresponding to
the Control Dup column for clinical data lookup. |
StatesDelControl |
CN states listed corresponding to
IDsDelControl (1(CN=0)/2(CN=1)) |
StatesDupControl |
CN states listed corresponding to
IDsDupControl (5(CN=3)/6(CN=4)) |
TotalStates(1) |
The number of Controls in Controls Del with a
homozygous deletion or both copies lost |
TotalStates(2) |
The number of Controls in Controls Del with a
hemizygous deletion or one copy lost |
TotalStates(5) |
The number of Controls in Controls Dup with a
hemizygous duplication or one copy gained |
TotalStates(6) |
The number of Controls in Controls Dup with a
homozygous duplication or two copies gained |
ALLTwoTailed |
All CNV states considered together p |
ORALL |
All CNV states considered together OR |
ZeroTwoTailed |
Only CN=0 CNV state considered together p |
ORZero |
Only CN=0 CNV state considered together OR |
OneTwoTailed |
Only CN=1 CNV state considered together p |
OROne |
Only CN=1 CNV state considered together OR |
ThreeTwoTailed |
Only CN=3 CNV state considered together p |
ORThree |
Only CN=3 CNV state considered together OR |
FourTwoTailed |
Only CN=4 CNV state considered together p |
ORFour |
Only CN=4 CNV state considered together OR |
Gene |
The closest proximal gene based on UCSC Genes
which includes both RefSeq Genes and Hypothetical Gene transcripts |
Distance |
The distance from the CNVR to the closest
proximal gene annotated. If the value is 0, the CNVR resides directly on the
gene. |
Description |
The gene description delimited by
"/" for multiple gene transcripts or multiple genes listed |
Pathway |
Annotated pathway membership of Gene with
reference compiled from Gene Ontology database, BioCarta database and the
KEGG database (definition files in GeneRef folder) |
AverageNumsnpsCaseDel |
The average numsnp of CNV calls contributing
to Case Del CNVR. Allows for much more informative CNV size (confidence)
filtering post-hoc. |
AverageLengthCaseDel |
The average length of CNV calls contributing
to Case Del CNVR. Allows for much more informative CNV size (confidence)
filtering post-hoc. |
CNVRangeCaseDel |
Alternative larger CNV Range Case Del definition compared to
minimal common overlap definition of CNVR |
AverageNumsnpsControlDel |
The average numsnp of CNV calls contributing
to Control Del CNVR. Allows for much more informative CNV size (confidence)
filtering post-hoc. |
AverageLengthControlDel |
The average length of CNV calls contributing
to Control Del CNVR. Allows for much more informative CNV size (confidence)
filtering post-hoc. |
CNVRangeControlDel |
Alternative larger CNV Range Control Del definition compared to
minimal common overlap definition of CNVR |
CNVType |
Deletion or duplication CNVR Significant in
combined report |
Cytoband |
Cytoband genomic landmark designations |
redFlagCount |
Count red flag from association review of 9
(see text, briefly: SegDups, DGV, Centro/Telo, GC, ProbeCount, PopFreq, Peninsula,
Inflated) |
redFlagReasons |
The failing metrics for association review and
their values |
Reformat CNV Calls
from Different Algorithms/Sources into Standard PennCNV .rawcnv Format:
CNV Calling Algorithms Formats:
PennCNV
chr1:248740572-248795110 numsnp=20 length=54,539 state1,cn=0 7500809055_R02C01 startsnp=kgp8531759 endsnp=kgp6234332 conf=82.372 PennCNV
QuantiSNP
Sample Name Chromosome Start
Position (bp) End
Position (bp) Start
Probe ID End
Probe ID Length
(bp) No.
Probes Copy
Number Max.
Log BF Log
BF: State 0 Log
BF: State 1 Log
BF: State 2 Log
BF: State 3 Log
BF: State 4 Log
BF: State 5 Log
BF: State 6
7500809055_R02C01 1 248740572 248795110 kgp8531759 kgp6234332 54539 19 0 137.858 137.858 -9.05444 0 1.18944 -7.74257 -9.26705 -44.1485
Nexus
Sample Chromosome Region Event Length Cytoband % of CNV Overlap Probe Median %
Heterozygous Probes Min Size Min Region Max Size Max Region Locus
IDs Call Pvalue
4363511659_R02C01 chr3:191,219,380-191,224,532 Homozygous Copy Loss 5153 q28 84.14208075 -4.471399784 40 5
GenomeStudio
SampleID BookmarkType Chr Start End Size Author CreatedDate Value Comment
K069 [735] CNV
Bin: Min 0 To Max 0.5 1 17548878 17549783 905 2011-07-13
“÷¤’Ýã 2:15:11 0 CNV
Confidence: 177.519
TCGA
barcode chromosome start stop num.mark seg.mean
TCGA-13-0891-01A-01D-0403-02 Y 22008576 22014412 -11.7734
GenomicWorkbench (also header with colons and footer with equals)
AberrationNo Chr Cytoband Start Stop #Probes Amplification Deletion pval Gene Names &