dxda

package module
v0.6.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 19, 2024 License: Apache-2.0 Imports: 28 Imported by: 2

README

dx-download-agent

CLI tool to manage the download of large quantities of files from DNAnexus

Build Status

NOTE: This is an early version of this tool and is undergoing testing in a variety of settings. Please contact DNAnexus if you are interested in seeing if this tool is appropriate for your application.

Quick Start

To get started with dx-download-agent, download the the latest pre-compiled binary from the release page. The download agent accepts two files:

  • manifest_file: A BZ2-compressed JSON manifest file that describes, at minimimum, the following information for a download, for example:
{
  "project-AAAA": [
    {
      "id": "file-XXXX",
      "name": "foo",
      "folder": "/path/to",
      "parts": {
        "1": { "size": 10, "md5": "49302323" },
        "2": { "size": 5,  "md5": "39239329" }
      }
    },
    "..."
  ],
  "project-BBBB": [ "..." ]
}

To start a download process, first generate a DNAnexus API token that is valid for a time period that you plan on downloading the files. Store it in the following environment variable:

export DX_API_TOKEN=<INSERT API TOKEN HERE>

If no API token is provided, the download agent will look to the ~/.dnanexus_config/environment.json also used by the dx-toolkit.

To start the download:

dx-download-agent download exome_bams_manifest.json.bz2
Obtained token using environment
Creating manifest database manifest.json.bz2.stats.db
Required disk space = 1.2TB, available = 3.6TB
Logging detailed output to: manifest.json.bz2.download.log
Preparing files for download
Downloading files using 8 threads
Downloaded 11904/1098469 MB     124/11465 Parts (104.0 MB written to disk in the last 60s)

A continuous report on download progress is written to the screen. Prior to starting the data transfer, a check is made to see that there is sufficient disk space for the entire list of files. If not, an error is reported, and nothing is downloaded. Download speed reflects not only network bandwidth, but also the IO capability of your machine.

A download log contains more detailed information about the download should an error occur. If an error does occur and you do not understand how to deal with it, please contact support@dnanexus.com with the log file attached and we will assist you.

Please note that rerunning dx-download-agent download command will NOT re-download any previously downloaded files that were subsequently moved, deleted or modified. Please run dx-download-agent inspect (described below) to detect any changes to previously downloaded files and mark them for re-download. See Moving downloaded files for more details.

You can query the progress of an existing download in a separate terminal

dx-download-agent progress exome_bams_manifest.json.bz2

and you will get a brief summary of the status the downloads:

21.6 MB/sec	1184/27078 MB	18/327 Parts Downloaded and written to disk

To check the integrity of the downloaded files, you can run

dx-download-agent inspect exome_bams_manifest.json.bz2

This command will perform an inspection of the files and ensure that their MD5sums match the manifest. If a file is missing or an MD5sum does not match, the download agent will report the affected files and you can then run dx-download-agent download again to re-download the affected files.

Execution options

  • -num_threads (integer): maximum # of concurrent threads to use when downloading or inspecting files

For example, the commmand

dx-download-agent download -num_threads=20 exome_bams_manifest.json.bz2

will create a worker pool of 20 threads that will download parts of files in parallel. A maximum of 20 workers will perform downloads at any time. Rate-limiting of downloads can be controlled to an extent by varying this number.

Manifest stats database spec

Information about what parts have been downloaded is maintained in a sqlite3 database file that contains similar information as to what's in the JSON file format plus an additional bytes_fetched field.

Table name: manifest_stats

Fields (all fields are strings unless otherwise specified)

  • file_id: file ID for file part
  • project: project ID for file part
  • name: name of file
  • folder: folder containing file on DNAnexus
  • part_id (integer): part ID for file
  • md5: md5sum for part ID
  • size (integer): size of the part
  • block_size (integer): primary block size of file (assumed equal to size except for the last part)
  • bytes_fetched (integer <= size): total number of bytes downloaded

It is up to the implementation to decide whether or not bytes_fetched is updated in a more coarse- vs. fine-grained fashion. For example, bytes_fetched can be updated only when the part download is complete. In this case, its values will only be 0 or the value of size.

The manifest includes four fields for each file: file_id, project, name, and parts. If all four are specified, the file is assumed to be live and closed, making it available for download. If the parts field is omitted, the file will be described on the platform. Bulk describes are used to do this efficiently for many files in batch. Files that are archived or not closed cannot be downloaded, and will trigger an error.

It is possible to download DNAx symbolic links, which do not have parts. The required fields for symbolic links are file_id, project, and name. Note that a symbolic link has a global MD5 checksum, which is checked at the end of the download.

Running using Docker

Aside the self-contained Go binary we provide also Dockerized version. It can be used in a very similar manner as a standalone executable with exception of neccessity to mount your local folder and to provide your DX API token.

Currently, we offer following image tags:

  • latest - the most recent build of master branch
  • 0.5.11, 0.5.12, ... - dedicated tags for each release (starting from 0.5.11)
  • <commit_hash> - development builds for each commit on master branch

Example usage:

$ docker run -v $PWD:/workdir -w /workdir -e DX_API_TOKEN=$DX_API_TOKEN dnanexus/dxda:latest download -max_threads=20 manifest.json.bz2

where:

  • $PWD is a path to directory on your computer to download files to
  • DX_API_TOKEN is a token to access our platform (see Quick start)

Proxy and TLS settings

To direct dx-download-agent to a proxy, please set the HTTP_PROXY environment variable to something like export HTTP_PROXY=hostname:port. HTTPS_PROXY is also supported.

By default, dx-download-agent uses certificates installed on the system to create secure connections. If your system requires an additional TLS certificate and the dx-download-agent doesn't appear to be using a certificate installed on your system, there are two options in order of preference. First, set the DX_TLS_CERTIFICATE_FILE environment variable to the path of the PEM-encoded TLS certificate file required by your parent organization. As a last-resort, you can connect insecurely by avoiding certificate verification all together by setting DX_TLS_SKIP_VERIFY=true. Use this for testing purposes only.

Creating and filtering manifest files

For convenience, the create_manifest.py file in the scripts/ directory is one way to create manifest files for the download agent. This script requires that the dx-toolkit is installed on your system and that you are logged in to the DNAnexus platform. An example of how it can be used:

python3 create_manifest.py "Project:/Folder" --recursive --output_file "myfiles.manifest.json.bz2"

Here, a manifest is created for recursively all files under the project name Project and in the folder Folder.

The manifest can be subsequently filtered using the filter_manifest.py script. For example, if you want to capture files in a particular folder (e.g. Folder) with testcall in them (e.g. /Folder/ALL.chr22._testcall_20190222.genotypes.vcf.gz), you can run the command:

$ python3 filter_manifest.py manifest.json.bz2 '^/Folder.*testcall.*'

where the second argument given to the script is a regular expression on the entire path (folder + filename).

Splitting manifest files

In some cases it may be desirable to split the download manifest into multiple manifest files for testing purposes or to manage multiple downloads of an entire data set across different environments. To split the file, we provide a simple Python utility that requires no additional packages in the scripts/ directory. For example, executing the command:

python3 scripts/split_manifest.py manifest.json.bz2 -n 100

will create manifest files containing each containing 100 files per project. For example if there are 300 total files in manifest.json.bz2, the output of this command will create three files named: manifest_001.json.bz2, manifest_002.json.bz2, and manifest_003.json.bz2. Each of these files can be used independently with the download agent.

Development environment

This repository can be used directly as a Go module as well. In the cmd/dx-download-agent directory, the dx-download-agent.go file is an example of how it can be used.

For developing and experimenting with the source inside isolated Docker environment, the Dockerfile in this repository may be a good start.

Moving downloaded files

After successfully downloading (and optionally inspecting post-download) it should be safe to move files to your desired location.

WARNING: In general we advise not to move files during the course of a download but moving them could be safe in certain special cases.The download agent works by maintaining a lightweight database of what parts of files have and havent been downloaded so that is what it primarily operates off of. This means that even if you move files the download agent won’t realize it until you run the inspect sub command that performs post-download checks for file integrity on disk. The inspect command will notice the files are missing, update the database, and when you re-issue a download command attempt to download them again. Therefore, if you move completed files and don’t run the inspect subcommand, the download agent should continue from where it left off. This being said, there is a danger in moving files is if a file download is not yet complete. In that case you will have moved an incomplete file.

Additional notes

Documentation

Index

Constants

View Source
const (
	KiB = 1024
	MiB = 1024 * KiB
	GiB = 1024 * MiB
)
View Source
const (

	// Extracted automatically with a shell script, so keep the format:
	// version = XXXX
	Version = "v0.6.2"
)

Variables

View Source
var UserAgent = fmt.Sprintf("dxda/%s (%s)", Version, runtime.GOOS)

example 'dxda/v0.1.2 (linux)

Functions

func DxAPI added in v0.4.0

func DxAPI(
	ctx context.Context,
	client *http.Client,
	numRetries int,
	dxEnv *DXEnvironment,
	api string,
	payload string) ([]byte, error)

DxAPI - Function to wrap a generic API call to DNAnexus

func DxDescribeBulkObjects added in v0.4.0

func DxDescribeBulkObjects(
	ctx context.Context,
	httpClient *http.Client,
	dxEnv *DXEnvironment,
	projectId string,
	objIds []string) (map[string]DxDescribeDataObject, error)

func DxHttpRequest added in v0.4.0

func DxHttpRequest(
	ctx context.Context,
	client *http.Client,
	numRetries int,
	requestType string,
	URL string,
	headers map[string]string,
	data []byte) (*http.Response, error)

Add retries around the core http-request method

func DxHttpRequestData added in v0.4.0

func DxHttpRequestData(
	ctx context.Context,
	httpClient *http.Client,
	requestType string,
	url string,
	headers map[string]string,
	data []byte,
	dataLen int,
	memoryBuf []byte) error

Read data from a remote URL.

Add retries around the core http-request method, especially in the case of short reads.

func MinInt64 added in v0.4.0

func MinInt64(x, y int64) int64

func NewHttpClient added in v0.4.0

func NewHttpClient() *http.Client

These clients are intended for reuse in the same host. Throwing them away will gradually leak file descriptors.

func PrintLogAndOut added in v0.1.5

func PrintLogAndOut(a string, args ...interface{})

print to the log and to stdout

Types

type DBPart

type DBPart interface {
	// contains filtered or unexported methods
}

a part to be downloaded. Can be: 1) part of a regular file 2) part of symbolic link (a web address)

type DBPartRegular added in v0.4.0

type DBPartRegular struct {
	FileId           string
	Project          string
	FileName         string
	Folder           string
	PartId           int
	Offset           int64
	Size             int
	MD5              string
	BytesFetched     int
	DownloadDoneTime int64 // The time when it completed downloading
}

Part of a dnanexus file

type DBPartSymlink struct {
	FileId           string
	Project          string
	FileName         string
	Folder           string
	PartId           int
	Offset           int64
	Size             int
	BytesFetched     int
	DownloadDoneTime int64 // The time when it completed downloading
	Url              string
}

symlink parts do not have checksum. There is only a global MD5 checksum on the entire file. There is also no need to get a pre-auth URL for the file

type DXAuthorization

type DXAuthorization struct {
	AuthToken     string `json:"auth_token"`
	AuthTokenType string `json:"auth_token_type"`
}

DXAuthorization - Basic variables regarding DNAnexus authorization

type DXConfig

type DXConfig struct {
	DXSECURITYCONTEXT    string `json:"DX_SECURITY_CONTEXT"`
	DXAPISERVERHOST      string `json:"DX_APISERVER_HOST"`
	DXPROJECTCONTEXTNAME string `json:"DX_PROJECT_CONTEXT_NAME"`
	DXPROJECTCONTEXTID   string `json:"DX_PROJECT_CONTEXT_ID"`
	DXAPISERVERPORT      string `json:"DX_APISERVER_PORT"`
	DXUSERNAME           string `json:"DX_USERNAME"`
	DXAPISERVERPROTOCOL  string `json:"DX_APISERVER_PROTOCOL"`
	DXCLIWD              string `json:"DX_CLI_WD"`
}

DXConfig - Basic variables regarding DNAnexus environment config

type DXDownloadURL

type DXDownloadURL struct {
	URL     string            `json:"url"`
	Headers map[string]string `json:"headers"`
}

DXDownloadURL ...

type DXEnvironment added in v0.2.2

type DXEnvironment struct {
	ApiServerHost     string `json:"apiServerHost"`
	ApiServerPort     int    `json:"apiServerPort"`
	ApiServerProtocol string `json:"apiServerProtocol"`
	Token             string `json:"token"`
	DxJobId           string `json:"dxJobId"`
}

A subset of the configuration parameters that the dx-toolkit uses.

func GetDxEnvironment added in v0.2.2

func GetDxEnvironment() (DXEnvironment, string, error)

Construct the environment structure. Return an additional string describing the source of the security token.

The DXEnvironment has its fields set from the following sources, in order (with later items overriding earlier items):

1. Hardcoded defaults 2. Environment variables of the format DX_* 3. Configuration file ~/.dnanexus_config/environment.json

If no token can be obtained from these methods, an empty environment is returned. If the token was received from the 'DX_API_TOKEN' environment variable, the second variable in the pair will be the string 'environment'. If it is obtained from a DNAnexus configuration file, the second variable in the pair will be '.dnanexus_config/environment.json'.

type DXFile

type DXFile interface {
	// contains filtered or unexported methods
}

one interface representing both symbolic links and data files

type DXFileRegular added in v0.4.0

type DXFileRegular struct {
	Folder string
	Id     string
	ProjId string
	Name   string
	Size   int64
	Parts  []DXPart
}

Data file on dnanexus

type DXFileSymlink struct {
	Folder string
	Id     string
	ProjId string
	Name   string
	Size   int64
	MD5    string
}

type DXPart

type DXPart struct {
	// we add the part-id in a post-processing step
	Id int

	// these fields are in the input JSON
	MD5  string `json:"md5"`
	Size int    `json:"size"`
}

description of part of a file

type DXSymlink struct {
	Drive string
	MD5   string
}

a full URL for symbolic links, with a corresponding MD5 checksum for the entire file. Drive and MD5 of symlnk

type DownloadStatus

type DownloadStatus struct {
	NumParts         int64
	NumBytes         int64
	NumPartsComplete int64
	NumBytesComplete int64

	// periodicity of progress report
	ProgressInterval time.Duration

	// Size of window in nanoseconds where to look for
	// completed downloads
	MaxWindowSize int64
}

DownloadStatus ...

type DxDescribeDataObject added in v0.4.0

type DxDescribeDataObject struct {
	Id            string
	ProjId        string
	Name          string
	State         string
	ArchivalState string
	Folder        string
	Size          int64
	Parts         map[string]DXPart // a list of parts for a DNAx file
	Symlink       *DXSymlink
}

Description of a DNAx data object

type DxDescribeRaw added in v0.4.0

type DxDescribeRaw struct {
	Id            string            `json:"id"`
	ProjId        string            `json:"project"`
	Name          string            `json:"name"`
	State         string            `json:"state"`
	ArchivalState string            `json:"archivalState"`
	Size          int64             `json:"size"`
	Parts         map[string]DXPart `json:"parts"`
	Symlink       *DxSymlinkRaw     `json:"symlinkPath,omitempty"`
	MD5           *string           `json:"md5,omitempty"`
	Drive         *string           `json:"drive,omitempty"`
}

type DxDescribeRawTop added in v0.4.0

type DxDescribeRawTop struct {
	Describe DxDescribeRaw `json:"describe"`
}

type DxError added in v0.4.0

type DxError struct {
	EType                 string
	Message               string
	HttpCode              int
	HttpCodeHumanReadable string
}

func (*DxError) Error added in v0.4.0

func (dxErr *DxError) Error() string

type DxErrorJson added in v0.4.0

type DxErrorJson struct {
	E DxErrorJsonInternal `json:"error"`
}

type DxErrorJsonInternal added in v0.4.0

type DxErrorJsonInternal struct {
	EType   string `json:"type"`
	Message string `json:"message"`
}

type DxSymlinkRaw added in v0.4.0

type DxSymlinkRaw struct {
	Url string `json:"object"`
}

type HttpError added in v0.4.0

type HttpError struct {
	Message             []byte
	StatusCode          int
	StatusHumanReadable string
}

func (*HttpError) Error added in v0.4.0

func (hErr *HttpError) Error() string

implement the error interface

type JobInfo

type JobInfo struct {
	// contains filtered or unexported fields
}

JobInfo ...

type Manifest

type Manifest struct {
	Files []DXFile
}

Manifest.

  1. a map from file-id to a description of a regular file
  2. a map from file-id to a description of a symbolic link

func ReadManifest

func ReadManifest(fname string, dxEnv *DXEnvironment) (*Manifest, error)

read the manifest from a file into a memory structure

type ManifestRaw added in v0.4.0

type ManifestRaw map[string][]ManifestRawFile

Raw manifest. A list provided by the user of projects and files within them that need to be downloaded.

The representation is a mapping from project-id to a list of files

type ManifestRawFile added in v0.4.0

type ManifestRawFile struct {
	Folder string             `json:"folder"`
	Id     string             `json:"id"`
	Name   string             `json:"name"`
	Parts  *map[string]DXPart `json:"parts,omitempty"`
}

File description in the manifest. Additional details will be gathered with an API call.

type Opts

type Opts struct {
	NumThreads int  // number of workers to process downloads
	Verbose    bool // verbose logging
	GcInfo     bool // Garbage collection statistics
}

Configuration options for the download agent

type Reply added in v0.4.0

type Reply struct {
	Results []DxDescribeRawTop `json:"results"`
}

type Request added in v0.4.0

type Request struct {
	Objects         []string                   `json:"objects"`
	DescribeOptions map[string]map[string]bool `json:"describe"`
}

type RequestWithScope added in v0.5.4

type RequestWithScope struct {
	Objects         []string                   `json:"id"`
	Scope           map[string]string          `json:"scope"`
	DescribeOptions map[string]map[string]bool `json:"describe"`
}

type State added in v0.4.0

type State struct {
	// contains filtered or unexported fields
}

func NewDxDa added in v0.4.0

func NewDxDa(dxEnv DXEnvironment, fname string, optsRaw Opts) *State

Initialize the state

func (*State) CheckDiskSpace added in v0.4.0

func (st *State) CheckDiskSpace() error

CheckDiskSpace ... Check that we have enough disk space for all downloaded files

func (*State) CheckFileIntegrity added in v0.4.0

func (st *State) CheckFileIntegrity() bool

check the on-disk integrity of all files return false if there is an integrity problem.

func (*State) Close added in v0.4.0

func (st *State) Close()

func (*State) CreateManifestDB added in v0.4.0

func (st *State) CreateManifestDB(manifest Manifest, fname string)

Read the manifest file, and build a database with an empty state for each part in each file.

func (*State) DownloadManifestDB added in v0.4.0

func (st *State) DownloadManifestDB(fname string)

Download all the files that are mentioned in the manifest.

func (*State) DownloadProgressOneTime added in v0.4.0

func (st *State) DownloadProgressOneTime(timeWindowNanoSec int64) string

DownloadProgressOneTime ... Report on progress so far

func (*State) InitDownloadStatus added in v0.4.0

func (st *State) InitDownloadStatus()

InitDownloadStatus ...

func (*State) PrepareFilesForDownload added in v0.5.5

func (st *State) PrepareFilesForDownload(m Manifest)

create an empty file for each download filepath.

TODO: Optimize this for only files that need to be downloaded

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL