Bedparse tutorial¶
Hi, thanks for your interest in bedparse
!
The following is a short tutorial that will guide you through the functionality of bedparse
. You can find the example.bed
file in this repo under docs/example.bed.
This file contains 6 human transcript models from Gencode. The first three are non-coding transcripts (i.e. field 7 and 8 contain the same coordinate), whereas the last three are coding (i.e. fields 7 and 8 indicate the thickStart and thickEnd, i.e. start and end of the CDS).
Extracting the promoters¶
The bedparse promoter
command reports the promoter of each transcript, defined as a user specified interval around the TSS.
For example, we can extract promoters consisting of 1000bp upstream and 500bp downstream of the CDS:
$ bedparse promoter --up 1000 --down 500 example.bed
chr1 10868 12368 ENST00000456328.2
chr1 11009 12509 ENST00000450305.2
chr1 29070 30570 ENST00000488147.1
chr1 922927 924427 ENST00000420190.6
chr1 924149 925649 ENST00000437963.5
chr1 924737 926237 ENST00000342066.7
Note how the TSS (and as a consequence the promoter) depends on the strand: for transcripts on the negative strand the TSS is the end coordinate, i.e. column 3. The --unstranded
option allows you to override this behaviour and report promoters as an interval around column 2, thus disregarding the strand.
Extracting 5’ or 3’ UTRs¶
The UTRs are defined in a BED file as the region between the start (column 2) and the thickStart (column 7) for the 5’ and between the thickEnd (column 8) and the end (column 3) for the 3’. These rules are reversed for transcripts on the - strand, and bedparse
automatically takes care of this. Additionally, bedparse
also handles correctly UTRs that span multiple exons: in these cases bedparse
recomputes all exon starts and exon lengths as sets coulmns 11 and 12 accordingly.
$ bedparse 5pUTR example.bed
chr1 923927 924431 ENST00000420190.6 0 + 923927 923927 0 1 504, 0,
chr1 925149 925941 ENST00000437963.5 0 + 925149 925149 0 2 40,20, 0,772,
chr1 925737 925941 ENST00000342066.7 0 + 925737 925737 0 2 63,20, 0,184,
Clearly, as you can see from the output above, UTRs are only reported for coding transcripts.
Extracting the CDS¶
To extract the CDS of the coding transcripts in the BED file use the bedparse cds
command:
$ bedparse cds example.bed
chr1 924431 939291 ENST00000420190.6 0 + 924431 939291 0 7 517,92,182,51,125,90,17, 0,1490,5723,6607,11340,14608,14843,
chr1 925941 935793 ENST00000437963.5 0 + 925941 935793 0 4 72,182,51,22, 0,4213,5097,9830,
chr1 925941 944153 ENST00000342066.7 0 + 925941 944153 0 13 72,182,51,125,90,186,163,116,79,500,125,111,246, 0,4213,5097,9830,13098,13333,15202,16194,16468,16617,17311,17756,17966,
Note how non-coding transcripts are not reported (because by definition they don’t have a CDS). Also, note how the number of exons (column 10) and exon lengths and starts (columns 11 and 12) have been readjusted to reflect the fact that the transcripts have “lost” the UTRs. To visualise this operation you can save the output of the command above to a new text file and upload it as a custom track in the UCSC genome browser: you’ll see that the new transcripts only correspond to the thick portion of the original Gencode transcripts.
Extracting introns¶
In a BED file introns are implicitly defined as the genomic regions between exons. The bedparse introns
command creates new “artificial” transcripts that correspond to the introns of the original transcripts:
$ bedparse introns example.bed
chr1 12227 13220 ENST00000456328.2 0 + 12227 12227 0 2 385,499, 0,494,
chr1 12057 13452 ENST00000450305.2 0 + 12057 12057 0 5 121,385,277,168,78, 0,170,640,995,1317,
chr1 14501 29533 ENST00000488147.1 0 - 14501 14501 0 10 503,757,659,92,177,237,172,206,6371,4642, 0,537,1446,2264,2554,2867,3241,3560,3865,10390,
chr1 924948 939274 ENST00000420190.6 0 + 924948 924948 0 6 973,4141,702,4682,3143,145, 0,1065,5388,6141,10948,14181,
chr1 925189 935771 ENST00000437963.5 0 + 925189 925189 0 4 732,4141,702,4682, 0,824,5147,5900,
chr1 925800 943907 ENST00000342066.7 0 + 925800 925800 0 13 121,4141,702,4682,3143,145,1683,829,158,70,194,320,99, 0,213,4536,5289,10096,13329,13660,15506,16451,16688,17258,17577,18008,
Convert BED12 to BED6¶
It’s often convenient to convert a BED12 file into BED6, where each exon appears on its own line. This is easily done with bedparse bed12tobed6
:
$ bedparse bed12tobed6 --appendExN example.bed
chr1 11868 12227 ENST00000456328.2_Exon001 0 +
chr1 12612 12721 ENST00000456328.2_Exon002 0 +
chr1 13220 14409 ENST00000456328.2_Exon003 0 +
chr1 12009 12057 ENST00000450305.2_Exon001 0 +
chr1 12178 12227 ENST00000450305.2_Exon002 0 +
chr1 12612 12697 ENST00000450305.2_Exon003 0 +
chr1 12974 13052 ENST00000450305.2_Exon004 0 +
chr1 13220 13374 ENST00000450305.2_Exon005 0 +
chr1 13452 13670 ENST00000450305.2_Exon006 0 +
chr1 14403 14501 ENST00000488147.1_Exon001 0 -
chr1 15004 15038 ENST00000488147.1_Exon002 0 -
chr1 15795 15947 ENST00000488147.1_Exon003 0 -
chr1 16606 16765 ENST00000488147.1_Exon004 0 -
chr1 16857 17055 ENST00000488147.1_Exon005 0 -
chr1 17232 17368 ENST00000488147.1_Exon006 0 -
chr1 17605 17742 ENST00000488147.1_Exon007 0 -
chr1 17914 18061 ENST00000488147.1_Exon008 0 -
chr1 18267 18366 ENST00000488147.1_Exon009 0 -
chr1 24737 24891 ENST00000488147.1_Exon010 0 -
chr1 29533 29570 ENST00000488147.1_Exon011 0 -
chr1 923927 924948 ENST00000420190.6_Exon001 0 +
chr1 925921 926013 ENST00000420190.6_Exon002 0 +
chr1 930154 930336 ENST00000420190.6_Exon003 0 +
chr1 931038 931089 ENST00000420190.6_Exon004 0 +
chr1 935771 935896 ENST00000420190.6_Exon005 0 +
chr1 939039 939129 ENST00000420190.6_Exon006 0 +
chr1 939274 939291 ENST00000420190.6_Exon007 0 +
chr1 925149 925189 ENST00000437963.5_Exon001 0 +
chr1 925921 926013 ENST00000437963.5_Exon002 0 +
chr1 930154 930336 ENST00000437963.5_Exon003 0 +
chr1 931038 931089 ENST00000437963.5_Exon004 0 +
chr1 935771 935793 ENST00000437963.5_Exon005 0 +
chr1 925737 925800 ENST00000342066.7_Exon001 0 +
chr1 925921 926013 ENST00000342066.7_Exon002 0 +
chr1 930154 930336 ENST00000342066.7_Exon003 0 +
chr1 931038 931089 ENST00000342066.7_Exon004 0 +
chr1 935771 935896 ENST00000342066.7_Exon005 0 +
chr1 939039 939129 ENST00000342066.7_Exon006 0 +
chr1 939274 939460 ENST00000342066.7_Exon007 0 +
chr1 941143 941306 ENST00000342066.7_Exon008 0 +
chr1 942135 942251 ENST00000342066.7_Exon009 0 +
chr1 942409 942488 ENST00000342066.7_Exon010 0 +
chr1 942558 943058 ENST00000342066.7_Exon011 0 +
chr1 943252 943377 ENST00000342066.7_Exon012 0 +
chr1 943697 943808 ENST00000342066.7_Exon013 0 +
chr1 943907 944575 ENST00000342066.7_Exon014 0 +
The optional flag –appendExN adds ExonNNN to the end of each transcript name.
APIs¶
Bedparse can also be imported as a python module. The API documentation contains detailed information of the bedline class and its methods. The following is simple example of how to use it:
In [1]: from bedparse import bedline
In [2]: l = bedline(['chr1', 1000, 2000, 'Tx1', '0', '+'])
In [3]: prom = l.promoter()
In [4]: prom.print()
chr1 500 1500 Tx1
In [5]: prom.pprint()
['chr1', 500, 1500, 'Tx1']
In [6]: ens_prom = prom.translateChr(assembly="hg38", target="ens")
In [7]: ens_prom.print()
1 500 1500 Tx1