The program returns 3 links to files : the input BED file (coordinates), the corresponding sequences (FASTA file), and a log text file that contain information on the execution of the program.
Introduction
GoalThe aim is to :
- Get familiar with motif analysis of ChIP-seq data.
- Learn de novo motif discovery methods.
- Motif discovery with peak-motifs
- Differential analysis
- Random controls
Retrieving sequences from your peaks
1 - Peak dataset: ER factor in human MCF7 cells
Theodorou et al published a ChIP-seq experiment in the MCF7 cell line (breast cancer) to identify genomic locations bound by the transcription factor Estrogen receptor alpha (ESR1)(PMID:23172872). The particularity of this system is that it is inducible with the E2 (oestradiol) hormone. The authors were interested in evaluating the role of another transcription factor (GATA3) They thus performed the following experiments (with E2 induction):
- inhibiting the GATA3 factor by small interference RNA (files with prefix siGATA in the Table below)
- control (files with prefix siNT)
Description of the peaks on chr7:
Peak BED file | Number of peaks | Sum of peak sizes (bp) |
---|---|---|
siGATA_ER_E2_r3_MACS_PeakSplitter.bed | 7.041 | 1.476.646 |
siNT_ER_E2_r3_MACS_PeakSplitter.bed | 6.991 | 1.801.622 |
2 - Fetch sequences from a bed file
- In a web browser window open the RSAT (Roscoff) or RSAT (Marseille) web page
- In the menu (left side) click on the NGS-ChIP-seq drop down menu, and select the tool: fetch-sequences from UCSC.
- Select the genome of interest, in this case: human, assembly hg19.
- Genomic coordinates can be provided in 3 alternative ways:
- Paste the content of the BED file, (not very convenient for large peak sets).
- Specify a URL. This option is generally suitable for importing peaks from a Galaxy server or any other Web site.
- Upload a file from your computer.
- Leave all other parameters unchanged and click on the button “GO”.
- click on the link of the log file, and look for the number of sequences retrieved.
- The FASTA file contain the sequences. Download it (right click, save as...) on your computer to keep a copy of them.
The program returns 3 links to files : the input BED file (coordinates), the corresponding sequences (FASTA file), and a log text file that contain information on the execution of the program.
; sequences 7041All sequences were retrieved.
Discovering motifs from peak sequences
1 - Getting to know peak-motifs
- Thomas-Chollier, M., Herrmann, C., Defrance, M., Sand, O., Thieffry, D. and van Helden, J. (2011). RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets Nucleic Acids Research doi:10.1093/nar/gkr1104, 9. [Paper]
- Thomas-Chollier M, Darbo E, Herrmann C, Defrance M, Thieffry D, van Helden J. (2012). A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nat Protoc 7(8): 1551-1568. [Paper]
- From the result page before, click on the button peak-motifs at the bottom of the page, to automatically transfer the sequences to this tool. Note that peak-motifs is also accessible from the left menu, in the NGS ChIP-seq
- A new page appear, displaying a form.
- The default peak-motifs web form only displays the essential options. There are only two
mandatory parameters.
a. The title box, which you will set as siGATA_ER_E2_r3 b. The sequences, that have been automatically passed from fetch-sequences. Alternatively, sequences can be pasted in the available box, input from a URL, and uploading a file from your computer. - We could launch the analysis like this, but we will now modify some of the advanced options in order to fine-tune the analysis according
to your data set.
- Open the "Reduce peak sequences" title, and make sure the "Cut peak sequences: +/- " option is set to 0 (we wish to analyse our full dataset)
- Open the “Motif Discovery parameters” title, and check the oligomer sizes 6 and 7 (but not 8). Check "Discover over-represented spaced word pairs [dyad-analysis]"
- Under “Compare discovered motifs with databases”, we'll keep "JASPAR core vertebrates" as the studied organism is human.
- Under “Locate motifs and export predicted sites as custom UCSC tracks”, select "Peak coordinates specified in fasta headers of the test sequence file (Galaxy format)".
- You can indicate your email address in order to receive notification of the task submission and completion. This is particularly useful because the full analysis may take some time for very large datasets.
- Click “GO”. As soon as the query has been launched, you should receive an email indicating confirming the task submission, and providing a link to the future result page.
- The Web page also displays a link, You can already click on this link. The report will be progressively updated during the processing of the workflow.
- Sequence Composition:The distribution of sequence lengths provides a useful way to detect outlier peaks (i.e., exceptionally long peaks that may ‘dilute’ the motif signal) or irregular length distributions resulting from problems during the peak-calling procedure. Nucleotide and dinucleotide compositions are computed and displayed in the form of heat maps and positional profiles
- Motif Discovery:The workflow combines four word-based pattern-discovery algorithms that rely on two complementary criteria (overrepresentation and positional bias) to detect exceptional words (oligonucleotides) and spaced pairs of words (dyads). Significant words are used as seeds to build probabilistic description of motifs (position-specific scoring matrices), indicating residue variability at each position of the motif, represented as logos.
- Motif comparisons: Discovered motifs are compared with one or several public databases of annotated motifs to predict associated transcription factors. Results of this comparison are also displayed as multiple motif alignments (click on the link indicated in the red box) to highlight matches with several annotated motifs (e.g., factors belonging to the same family, composite motifs bound by protein complexes). Motif comparison is performed against vertebrate transcription factors binding motifs from JASPAR database.
- Binding site predictions:Sequences are scanned with the discovered motifs to locate binding sites, and their positioning within peaks is analyzed (coverage, positional distribution along peaks).
- at the top of the report, in the gray box, click on small summary
- Do we discover significant motifs ?
- Are these motifs biologically relevant? In particular, did the program discover motifs related to ER or GATA3 ?
The peak-motifs output is formed by the following parts:
2 - Differential analysis
As we have two conditions (siGATA, siNT), we would like to find if some motifs are found in one dataset, but not the other. We are thus going to perform differential motif analysis.
- Use fetch-sequences to obtain the peak sequences for siNT (you should already have the peak sequences for siGATA condition). Save the two sequence files on your computer, since we will need to upload both of them separately for the next step.
- Run peak-motifs in differential analysis mode using siGATA (not treated) as Peak sequences and siNT as control sequences. Give as title : siGATA_vs_siNT_r3
- The swap datasets (siNT as test, and siGATA as control) result is available here.
2 - Negative controls
As in experimental biology, we perform negative controls for motif analyses. RSAT has multiple options to create datasets that can serve as controls. Here, we will use "random genome fragments", to select random regions of the same number and size as the siGATA peak set.
- In the left RSAT menu, click on build control sets and choose random genome fragments
- Use siGATA_vs_siNT_r3 FASTA sequence file as "template"
- Choose Local RSAT Organism => Homo Sapiens
- As output, choose FASTA sequences
- Click "GO" to run the program.
- Save the FASTA result on your machine
- Run peak-motifs with this file as input, same options as above, title : random-genome fragments-siGATA_template.
-
Do you expect to find significant motifs? Do you obtain significant motifs?Pre-calculated results are available here.
Visualizing the sites in the context of genome annotations
1 - Load predicted binding sites into UCSC browser
Peak-motifs can prepare the BED files to be directly visualized in UCSC browser (or IGV !).
- At the bottom of the report, click on the UCSC icon in "Motif locations (sites)".
- This will open UCSC browser, and load the tracks of the peaks (coordinates only, not the shape) + one track per found motif.
- Search for the location chr7:75,680,888-75,681,629 (located on chr7).