ngsutilsj
ngsutilsj is an updated java port of the NGSUtils toolkit. This new version is largely a Java port of the the most commonly used tools from NGSUtils, with some additions thrown in. It is also a library, with utility classes for use in other various NGS related software (such as cgsplice).
Java was chosen for the ease of installation and relative speed (in comparison to the Python NGSUtils). The processing speed for gzipped compressed files was a major reason for the new update. This version has also been optimized for working on high-memory HPC clusters and streaming data analysis (to minimize disk IO).
Installation
ngsutilsj is distributed as a self-executing fat-JAR file. This means that for installation, all one needs is
a working copy of Java and the ngsutilsj file. Unlike other JAR-file based NGS packages, ngsutilsj includes a
shell script shim to make it executable like a traditional Unix program. This means that to install the program,
you need to copy the ngsutilsj file to somewhere in your $PATH.
Commands available
There are many commands available. For information on an individual command, run ngsutilsj help command
ngsutilsj - Data wrangling for NGS
---------------------------------------
Usage: ngsutilsj cmd [options]
Available commands:
[bam]
bam-basecall - For a BAM file, output the basecalls (ACGTN) at each genomic position.
bam-best - With reads mapped to two bam references, determine which reads mapped best to each
bam-bins - Quickly count the number of reads that fall into bins (bins assigned based on 5' end of the first read)
bam-check - Checks a BAM file to make sure it is valid
bam-clean - Cleans a BAM file from common errors
bam-concat - Concatenates BAM files (handles @RG, @PG)
bam-count - Counts the number of reads for genes (GTF), within a BED region, or by bins (--gtf, --bed, or --bins required)
bam-coverage* - Scans an aligned BAM file and calculates the number of reads covering each base
bam-discord - Extract all discordant reads from a BAM file
bam-dups - Flags or removes duplicate reads
bam-expressed - For a BAM file, output all regions with significant coverage in BED format.
bam-extract* - Extract reads from a BAM file using either VCF or BED file coordinates
bam-filter - Filters out reads based upon various criteria
bam-phase* - Given a BAM and VCF file, split the BAM file into smaller phased files.
bam-pir* - For a BAM file, extract the phase-informative reads
bam-readgroup - Add a read group ID to each read in a BAM file
bam-refcount - Only count the number of reads aligned to each reference (only R1 is counted)
bam-removeclipping* - Removes clipped bases (soft) from BAM file reads
bam-sample* - Create a list of read names sampled randomly from a file
bam-softclip* - Calculate the amount of soft-clipping at each position across a genome. Output in bedGraph format.
bam-sort - Sort a BAM file using HTSJDK ordering
bam-split - Split a BAM file into smaller files
bam-stats - Stats about a BAM file and the library orientation
bam-tobed* - Writes read positions to a BED6 file
bam-tobedgraph - Calculate coverage for an aligned BAM file in BedGraph format.
bam-tobedpe* - Writes read positions to a BEDPE file
bam-tofasta - For a BAM file, output the basecalls (ACGTN) at each genomic position.
bam-tofastq - Export the read sequences from a BAM file in FASTQ format
bam-varcall* - For a BAM file, call variants from a reference genome (germline, tumor-only, or tumor/normal).
bam-wps* - For each location in the genome, calculate a window positioning score (WPS)
[bed]
bed-clean - Cleans score entries to be an integer
bed-count - Given reference and query BED files, count the number of query regions that are contained within each reference region
bed-merge* - Given two or more (sorted) BED files, combine the BED annotations into one output BED file.
bed-nearest - Given reference and query BED files, for each query region, find the nearest reference region
bed-reduce - Merge overlaping BED regions
bed-resize - Resize BED regions (extend or shrink)
bed-sort - Sort BED file (by coordinate or name)
bed-stats - Summary statistics for a BED file
bed-tobed3 - Convert a BED3+ file to a strict BED3 file
bed-tobed6 - Convert a BED6+ file to a strict BED6 file
bed-tobedgraph - Convert a BED file to a coverage BedGraph file
bed-tobedpe - Combine two name-sorted BED files to BEDPE format
bed-tofasta - Extract FASTA sequences based on BED coordinates
bedpe-tobed - Convert a BEDPE file to a BED file
[fasta]
fasta-bins - For an indexed FASTA file, calculate bins and write them to a BED file
fasta-filter - Filter out sequences from a FASTA file
fasta-gc - Determine the GC% for a given region or bins (DNA)
fasta-genreads* - Generate mock reads from a reference FASTA file (DNA)
fasta-grep - Find subsequences (exact match) in a FASTA file
fasta-mask - Mask regions of a FASTA reference
fasta-motif - Scan a FASTA file for matches to a motif (DNA only)
fasta-names - Display sequence names from a FASTA file
fasta-random - Generate random DNA sequences
fasta-revcomp* - Reverse compliment the sequences in a FASTA file (DNA)
fasta-split - Split a FASTA file into a new file for each sequence or a number of sequences
fasta-subseq - Extract subsequences from a FASTA file (optionally, indexed)
fasta-tag - Add prefix/suffix to FASTA sequence names
fasta-tri - Determine the trinucleotide counts for a genome (DNA)
fasta-wrap - Change the sequence wrapping length of a FASTA file
[fastq]
fastq-barcode - Given Illumina 1.8+ naming, find the lane/barcodes included
fastq-check - Verify a FASTQ single, paired, or interleaved file(s)
fastq-demux - Splits a FASTQ file based on lane/barcode values
fastq-filter - Filters reads from a FASTQ file.
fastq-merge - Merges two FASTQ files (R1/R2) into one interleaved file.
fastq-overlap - For paired FASTQ files, attempt to find overlapping reads
fastq-remix* - Remix one or more different FASTQ files in different sampling ratios
fastq-separate - Splits an interleaved FASTQ file by read number.
fastq-sort - Sorts a FASTQ file
fastq-split - Splits an FASTQ file into smaller files
fastq-stats - Statistics about a FASTQ file
fastq-tobam - Converts a FASTQ file (or two paired files) into an unmapped BAM file
fastq-tofasta - Convert FASTQ sequences to FASTA format
[gtf]
gtf-export - Export gene annotations from a GTF file as BED regions
gtf-geneinfo* - Calculate information about genes (based on GTF model)
gtf-tofasta - Export transcript/protein sequences as FASTA files
[annotation]
annotate-gtf - Annotate GTF gene regions (for tab-delimited text, BED, or BAM input)
annotate-repeat* - Calculates Repeat masker annotations
tab-annotate - Annotate a tab-delimited file (Tabix-indexed)
tabix - Query a tabix file
tabix-concat - Re-combine split tabix files (natural sorting by filename)
tabix-split - Splits a tabix file by ref/chrom
tdf-join - Annotate a tab-delimited file (not-tabix indexed)
[vcf]
vcf-annotate - Annotate a VCF file
vcf-bedcount - For a given BED file, count the number of variants present in the BED region
vcf-check - Validate a VCF file
vcf-chrfix - Changes the reference (chrom) format (Ensembl/UCSC)
vcf-clearfilter - Remove a filter from a VCF file
vcf-concat - Concatenate VCF files that have different variants but the same samples.
vcf-concat-n - Concatenate VCF files that have different variants but the same samples.
vcf-consensus* - For a set of VCF files, extract consensus variants (SNV and SV)
vcf-count - For each variant in a VCF file, count the number of ref and alt alleles in a BAM file
vcf-effect* - For variants, calculate the effect of each variant on it's gene
vcf-export - Export information from a VCF file as a tab-delimited file
vcf-filter - Filter a VCF file
vcf-header-info - Extract annotation/named fields from a VCF file
vcf-merge - Combine multiple VCF files that have different annotations, but the same variants.
vcf-peptide* - For variants, extract new peptides (SNV or indel, in coding regions)
vcf-refbuild - Verify the reference/build for a VCF file
vcf-remove-flags - Replace all INFO flags with a comma separated list
vcf-rename - Change the names of samples
vcf-reorder - Reorders samples in a VCF file
vcf-sample-export - Write sample FORMAT values to a tab-delimited file, with one sample per line
vcf-samples - Output the sample names in a VCF file
vcf-stats - Summary statistics about a VCF file
vcf-strip - Remove all annotation and sample information but keep output in VCF format
vcf-svtofasta - Extract SV flanking sequences and write them to a FASTA file.
vcf-tobed - Export allele positions from a VCF file to BED format
vcf-tobedpe - Convert a SV VCF file to BEDPE format
vcf-tocount - Convert a VCF to a count file using the AD (or RO/AO) format field
vcf-tstv - Calculate a Ts/TV ratio for SNVs
[help]
help - Help for a specific command
license - Show the license
version - Show program version
* = experimental command