data wrangling for
computational genomics


ngsutilsj is an updated java port of the NGSUtils toolkit. This new version is largely a Java port of the the most commonly used tools from NGSUtils, with some additions thrown in. It is also a library, with utility classes for use in other various NGS related software (such as cgsplice).

Java was chosen for the ease of installation and relative speed (in comparison to the Python NGSUtils). The processing speed for gzipped compressed files was a major reason for the new update. This version has also been optimized for working on high-memory HPC clusters and streaming data analysis (to minimize disk IO).


ngsutilsj is distributed as a self-executing fat-JAR file. This means that for installation, all one needs is a working copy of Java and the ngsutilsj file. Unlike other JAR-file based NGS packages, ngsutilsj includes a shell script shim to make it executable like a traditional Unix program. This means that to install the program, you need to copy the ngsutilsj file to somewhere in your $PATH.

Commands available

There are many commands available. For information on an individual command, run ngsutilsj help command

ngsutilsj - Data wrangling for NGS

Usage: ngsutilsj cmd [options]

Available commands:
  bam-basecall     - For a BAM file, output the basecalls (ACGTN) at each genomic position.
  bam-best         - With reads mapped to two bam references, determine which reads mapped best to each
  bam-bins         - Quickly count the number of reads that fall into bins (bins assigned based on 5' end of the first read)
  bam-check        - Checks a BAM file to make sure it is valid
  bam-concat       - Concatenates BAM files (handles @RG, @PG)
  bam-count        - Counts the number of reads for genes (GTF), within a BED region, or by bins (--gtf, --bed, or --bins required)
  bam-coverage*    - Scans an aligned BAM file and calculates the number of reads covering each base
  bam-discord      - Extract all discordant reads from a BAM file
  bam-dups         - Flags or removes duplicate reads
  bam-expressed    - For a BAM file, output all regions with significant coverage in BED format.
  bam-filter       - Filters out reads based upon various criteria
  bam-readgroup    - Add a read group ID to each read in a BAM file
  bam-sample*      - Create a whitelist of read names sampled randomly from a file
  bam-split        - Split a BAM file into smaller files
  bam-stats        - Stats about a BAM file and the library orientation
  bam-tobed*       - Writes read positions to a BED6 file
  bam-tobedgraph   - Calculate coverave for an aligned BAM file in BedGraph format.
  bam-tofastq      - Export the read sequences from a BAM file in FASTQ format

  bed-clean        - Cleans score entries to be an integer
  bed-count        - Given reference and query BED files, count the number of query regions that are contained within each reference region
  bed-nearest      - Given reference and query BED files, for each query region, find the nearest reference region
  bed-reduce       - Merge overlaping BED regions
  bed-resize       - Resize BED regions (extend or shrink)
  bed-tobed3       - Convert a BED3+ file to a strict BED3 file
  bed-tobed6       - Convert a BED6+ file to a strict BED6 file
  bed-tofasta      - Extract FASTA sequences based on BED coordinates

  fasta-filter     - Filter out sequences from a FASTA file
  fasta-gc         - Determine the GC% for a given region or bins
  fasta-genreads*  - Generate mock reads from a reference FASTA file
  fasta-mask       - Mask regions of a FASTA reference
  fasta-names      - Display sequence names from a FASTA file
  fasta-split      - Split a FASTA file into a new file for each sequence present
  fasta-subseq     - Extract subsequences from a FASTA file (optionally, indexed)
  fasta-tag        - Add prefix/suffix to FASTA sequence names
  fasta-wrap       - Change the sequence wrapping length of a FASTA file

  fastq-barcode    - Given Illumina 1.8+ naming, find the lane/barcodes included
  fastq-check      - Verify a FASTQ single, paired, or interleaved file(s)
  fastq-demux      - Splits a FASTQ file based on lane/barcode values
  fastq-filter     - Filters reads from a FASTQ file.
  fastq-merge      - Merges two FASTQ files (R1/R2) into one interleaved file.
  fastq-separate   - Splits an interleaved FASTQ file by read number.
  fastq-sort       - Sorts a FASTQ file
  fastq-split      - Splits an FASTQ file into smaller files
  fastq-stats      - Statistics about a FASTQ file
  fastq-tobam      - Converts a FASTQ file (or two paired files) into an unmapped BAM file
  fastq-tofasta    - Convert FASTQ sequences to FASTA format

  gtf-export       - Export gene annotations from a GTF file as BED regions
  gtf-geneinfo*    - Calculate information about genes (based on GTF model)

  annotate-gtf     - Annotate GTF gene regions (for tab-delimited text, BED, or BAM input)
  annotate-repeat* - Calculates Repeat masker annotations

  vcf-annotate     - Annotate a VCF file
  vcf-chrfix       - Changes the reference (chrom) format (Ensembl/UCSC)
  vcf-export       - Export information from a VCF file
  vcf-filter       - Filter a VCF file
  vcf-tobed        - Export allele positions from a VCF file to BED format

  help             - Help for a specific command
  license          - Show the license

* = experimental command