Skip to content
Commits on Source (9)
ENTREZ DIRECT - README
ENTREZ DIRECT: COMMAND LINE ACCESS TO NCBI ENTREZ DATABASES
Entrez Direct (EDirect) is an advanced method for accessing the NCBI's set of interconnected Entrez databases (publication, nucleotide, protein, structure, gene, variation, expression, etc.) from a terminal window. It uses command-line arguments for the query terms and combines individual operations with UNIX pipes.
Searching, retrieving, and parsing data from NCBI databases through the Unix command line.
EDirect also provides an argument-driven function that simplifies the extraction of data from document summaries or other results that are returned in XML format. Queries can move seamlessly between EDirect commands and UNIX utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.
INTRODUCTION
EDirect consists of a set of scripts that are downloaded to the user's computer. If you extract the archive in your home directory, you may need to enter:
Entrez Direct (EDirect) provides access to the NCBI's suite of interconnected databases (biomedical literature, nucleotide and protein sequence, molecular structure, gene, genome assembly, gene expression, clinical variation, etc.) from a Unix terminal window. Search terms are given in command-line arguments. Individual operations are connected with Unix pipes to allow construction of multi-step queries. Selected records can then be retrieved in a variety of formats.
PATH=$PATH:$HOME/edirect
EDirect also includes an argument-driven function that simplifies the extraction of data from document summaries or other results that are in structured XML format. This can eliminate the need for writing custom software to answer ad hoc questions. Queries can move seamlessly between EDirect commands and Unix utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.
in a terminal window to temporarily add EDirect functions to the PATH environment variable so they can be run by name. You can then try EDirect by copying the sample query below and pasting it into the terminal window for execution:
PROGRAMMATIC ACCESS
esearch -db pubmed -query "Beadle AND Tatum AND Neurospora" |
Several underlying network services provide access to different facets of Entrez. These include searching by indexed terms, looking up precomputed neighbors or links, filtering results by date or category, and downloading record summaries or reports. The same functionalities are available on the web or when using programmatic methods.
EDirect navigation programs (esearch, elink, efilter, and efetch) communicate by means of a small structured message, which can be passed invisibly between operations with a Unix pipe. The message includes the current database, so it does not need to be given as an argument after the first step.
All EDirect commands are designed to work on large sets of data. There is no need to write a script to loop over records one at a time. Intermediate results are stored on the Entrez history server. For best performance, obtain an API Key from NCBI, and place the following line in your .bash_profile file:
export NCBI_API_KEY=user_api_key_goes_here
Each program also has a -help command that prints detailed information about available arguments.
NAVIGATION FUNCTIONS
Esearch performs a new Entrez search using terms in indexed fields. It requires a -db argument for the database name and uses -query to obtain the search terms. For PubMed, without field qualifiers, the server uses automatic term mapping to compose a search strategy by translating the supplied query:
esearch -db pubmed -query "selective serotonin reuptake inhibitor"
Search terms can be also qualified with bracketed field names:
esearch -db nucleotide -query "insulin [PROT] AND rodents [ORGN]"
Elink looks up precomputed neighbors within a database, or finds associated records in other databases:
elink -related
elink -target gene
Efilter limits the results of a previous query, with shortcuts that can also be used in esearch:
efilter -molecule genomic -location chloroplast -country sweden
Efetch downloads selected records or reports in a designated format:
efetch -format abstract
ENTREZ EXPLORATION
Individual query commands are connected by a Unix vertical bar pipe symbol:
esearch -db pubmed -query "transposition immunity" | efetch -format medline
PubMed related articles are calculated by a statistical algorithm using the title, abstract, and medical subject headings (MeSH terms). These connections between papers can be used for knowledge discovery.
Lycopene cyclase converts lycopene to beta-carotene, the immediate precursor of vitamin A. An initial search on the enzyme results in 232 articles. Looking up precomputed neighbors returns 14,387 PubMed papers, some of which might be expected to discuss adjacent steps in the biosynthetic pathway:
esearch -db pubmed -query "lycopene cyclase" |
elink -related |
efilter -query "NOT historical article [FILT]" |
efetch -format docsum |
xtract -pattern DocumentSummary -if Author -and Title \
-element Id -first "Author/Name" -element Title |
grep -i -e enzyme -e synthesis |
sort -t $'\t' -k 2,3f |
column -s $'\t' -t |
head -n 10 |
cut -c 1-80
This query returns the PubMed ID, first author name, and article title for PubMed "neighbors" (related citations) of the original publications. It then requires specific words in the resulting rows, sorts alphabetically by author name and title, aligns the columns, and truncates the lines for easier viewing:
2960822 Anton IA A eukaryotic repressor protein, the qa-1S gene prod
5264137 Arroyo-Begovich A In vitro formation of an active multienzyme complex
14942736 BONNER DM Gene-enzyme relationships in Neurospora.
5361218 Caroline DF Pyrimidine synthesis in Neurospora crassa: gene-enz
123642 Case ME Genetic evidence on the organization and action of
elink -target protein |
efilter -organism mouse |
efetch -format fasta
Linking to the protein database finds 251,887 sequence records, each of which has standardized organism information from the NCBI taxonomy. Limiting to proteins in mice returns 39 records. (Animals do not encode the genes involved in carotene biosynthesis.) Records are then retrieved in FASTA format. As anticipated, the results include the enzyme that splits beta-carotene into two molecules of retinal:
...
>NP_067461.2 beta,beta-carotene 15,15'-dioxygenase isoform 1 [Mus musculus]
MEIIFGQNKKEQLEPVQAKVTGSIPAWLQGTLLRNGPGMHTVGESKYNHWFDGLALLHSFSIRDGEVFYR
SKYLQSDTYIANIEANRIVVSEFGTMAYPDPCKNIFSKAFSYLSHTIPDFTDNCLINIMKCGEDFYATTE
TNYIRKIDPQTLETLEKVDYRKYVAVNLATSHPHYDEAGNVLNMGTSVVDKGRTKYVIFKIPATVPDSKK
KGKSPVKHAEVFCSISSRSLLSPSYYHSFGVTENYVVFLEQPFKLDILKMATAYMRGVSWASCMSFDRED
KTYIHIIDQRTRKPVPTKFYTDPMVVFHHVNAYEEDGCVLFDVIAYEDSSLYQLFYLANLNKDFEEKSRL
TSVPTLRRFAVPLHVDKDAEVGSNLVKVSSTTATALKEKDGHVYCQPEVLYEGLELPRINYAYNGKPYRY
IFAAEVQWSPVPTKILKYDILTKSSLKWSEESCWPAEPLFVPTPGAKDEDDGVILSAIVSTDPQKLPFLL
ILDAKSFTELARASVDADMHLDLHGLFIPDADWNAVKQTPAETQEVENSDHPTDPTAPELSHSENDFTAG
HGGSSL
...
STRUCTURED DATA EXTRACTION
The xtract program uses command-line arguments to direct the conversion of XML data into a tab-delimited table. The -pattern argument divides the results into rows, while placement of data into columns is controlled by -element.
Formatting arguments allow extensive customization of the output. The line break between -pattern objects can be changed with -ret, and the tab character between -element fields can be replaced by -tab.
The -sep argument is used to distinguish multiple elements of the same type, and controls their separation independently of the -tab argument. The -sep value also applies to unrelated -element arguments that are grouped with commas. The query:
efetch -db pubmed -id 6271474,1413997,16589597 -format docsum |
xtract -pattern DocumentSummary -sep "|" -element Id PubDate Name
returns a table with individual author names separated by vertical bars:
6271474 1981 Casadaban MJ|Chou J|Lemaux P|Tu CP|Cohen SN
1413997 1992 Oct Mortimer RK|Contopoulou CR|King JS
16589597 1954 Dec Garber ED
Selection arguments are specialized derivatives of -element. Among these are positional commands (-first and -last) and numeric processing operations (including -num, -len, -sum, -min, -max, and -avg). There are also functions that perform sequence coordinate conversion (-0-based, -1-based, and -ucsc-based).
NESTED EXPLORATION
Exploration arguments (-pattern, -group, -block, and -subset) limit data extraction to specified regions of the XML, visiting all relevant objects one at a time. This design allows nested exploration of complex, hierarchical data to be controlled by a linear chain of command-line argument statements.
PubmedArticle XML contains the MeSH terms applied to a publication. Each MeSH term can have its own unique set of qualifiers. A single level of nested exploration within the current pattern:
esearch -db gene -query "beta-carotene oxygenase 1" -organism human |
elink -target pubmed | efilter -released last_year | efetch -format xml |
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block MeshHeading \
-pfc "\n" -sep "/" -element DescriptorName,QualifierName
retains the proper association of subheadings for each MeSH term:
30396924
Age Factors
Animals
Cell Cycle Proteins/deficiency/genetics/metabolism
Cellular Senescence/physiology
...
CONDITIONAL EXECUTION
Conditional processing arguments (-if and -unless) restrict exploration by object name and value. These may be used in conjunction with string or numeric constraints:
esearch -db pubmed -query "Casadaban MJ [AUTH]" |
efetch -format xml |
xtract -pattern PubmedArticle -if "#Author" -lt 6 \
-block Author -if LastName -is-not Casadaban \
-sep ", " -tab "\n" -element LastName,Initials |
sort-uniq-count-rank
to select papers with fewer than 6 authors and print a table of the most frequent coauthors:
11 Chou, J
8 Cohen, SN
7 Groisman, EA
...
SAVING DATA IN VARIABLES
A value can be recorded in a variable and used wherever needed. Variables are created by a hyphen followed by a name consisting of a string of capital letters or digits (e.g., -PMID). Values are retrieved by placing an ampersand before the variable name (e.g., "&PMID") in an -element statement:
efetch -db pubmed -id 3201829,6301692,781293 -format xml |
xtract -pattern PubmedArticle -PMID MedlineCitation/PMID \
-block Author -element "&PMID" \
-sep " " -tab "\n" -element Initials,LastName
producing a list of authors, with the PubMed Identifier (PMID) in the first column of each row:
3201829 JR Johnston
3201829 CR Contopoulou
3201829 RK Mortimer
6301692 MA Krasnow
6301692 NR Cozzarelli
781293 MJ Casadaban
The variable can be used even though the original object is no longer visible inside the -block section.
SEQUENCE QUALIFIERS
The NCBI represents sequence records in a data model based on the central dogma of molecular biology. A sequence can have multiple features, which carry information about the biology of a given region, including the transformations involved in gene expression. A feature can have multiple qualifiers, which store specific details about that feature (e.g., name of the gene, genetic code used for translation).
The data hierarchy is easily explored using a -pattern {sequence} -group {feature} -block {qualifier} construct. As a convenience, an -insd helper function is provided for generating the appropriate nested extraction commands from feature and qualifier names on the command line. Processing the results of a search on cone snail venom:
esearch -db protein -query "conotoxin" -feature mat_peptide |
efetch -format gpc |
xtract -insd complete mat_peptide "%peptide" product peptide |
grep -i conotoxin | sort -t $'\t' -u -k 2,2n
returns the accession, length, name, and sequence for a sample of neurotoxic peptides:
ADB43131.1 15 conotoxin Cal 1b LCCKRHHGCHPCGRT
ADB43128.1 16 conotoxin Cal 5.1 DPAPCCQHPIETCCRR
AIC77105.1 17 conotoxin Lt1.4 GCCSHPACDVNNPDICG
ADB43129.1 18 conotoxin Cal 5.2 MIQRSQCCAVKKNCCHVG
ADD97803.1 20 conotoxin Cal 1.2 AGCCPTIMYKTGACRTNRCR
AIC77085.1 21 conotoxin Bt14.8 NECDNCMRSFCSMIYEKCRLK
ADB43125.1 22 conotoxin Cal 14.2 GCPADCPNTCDSSNKCSPGFPG
AIC77154.1 23 conotoxin Bt14.19 VREKDCPPHPVPGMHKCVCLKTC
...
EDirect will run on UNIX and Macintosh computers that have the Perl language installed, and under the Cygwin UNIX-emulation environment on Windows PCs.
INSTALLATION
EDirect consists of a set of scripts and programs that are downloaded to the user's computer.
EDirect will run on Unix and Macintosh computers that have the Perl language installed, and under the Cygwin Unix-emulation environment on Windows PCs.
To install the EDirect software, copy the following commands and paste them into a terminal window:
cd ~
/bin/bash
perl -MNet::FTP -e \
'$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
$ftp->login; $ftp->binary;
$ftp->get("/entrez/entrezdirect/edirect.tar.gz");'
gunzip -c edirect.tar.gz | tar xf -
rm edirect.tar.gz
builtin exit
export PATH=${PATH}:$HOME/edirect >& /dev/null || setenv PATH "${PATH}:$HOME/edirect"
./edirect/setup.sh
This downloads several scripts into an "edirect" folder in the user's home directory. The setup.sh script then downloads any missing Perl modules, and may print an additional command for updating the PATH environment variable in the user's configuration file. Copy that command, if present, and paste it into the terminal window to complete the installation process. The editing instructions will look something like:
echo "export PATH=\$PATH:\$HOME/edirect" >> $HOME/.bash_profile
DOCUMENTATION
Documentation for EDirect is on the web at:
http://www.ncbi.nlm.nih.gov/books/NBK179288
Questions or comments on EDirect may be sent to eutilities@ncbi.nlm.nih.gov.
Information on how to obtain an API Key is described in this NCBI blogpost:
https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities
Questions or comments on EDirect may be sent to info@ncbi.nlm.nih.gov.
......@@ -21,6 +21,7 @@ if [ "$#" -gt 0 ]
then
target="$1"
MASTER=$(cd "$target" && pwd)
CONFIG=${MASTER}
shift
else
if [ -z "${EDIRECT_PUBMED_MASTER}" ]
......@@ -70,7 +71,7 @@ do
mkdir -p "$MASTER/$dir"
done
for dir in Current Indexed Inverted Merged Pubmed
for dir in Indexed Inverted Merged Pubmed
do
mkdir -p "$WORKING/$dir"
done
......@@ -98,3 +99,20 @@ fetch-pubmed -path "$MASTER/Archive" |
xtract -pattern Author -if Affiliation -contains Medicine \
-pfx "Archive is " -element Initials
echo ""
if [ -n "$CONFIG" ]
then
target=bash_profile
if ! grep "$target" "$HOME/.bashrc" >/dev/null 2>&1
then
if [ ! -f $HOME/.$target ] || grep 'bashrc' "$HOME/.$target" >/dev/null 2>&1
then
target=bashrc
fi
fi
echo ""
echo "For convenience, please execute the following to save the archive path to a variable:"
echo ""
echo " echo \"export EDIRECT_PUBMED_MASTER='${CONFIG}'\" >>" "\$HOME/.$target"
echo ""
fi
......@@ -43,7 +43,7 @@ use File::Spec;
# EDirect version number
$version = "10.5";
$version = "10.9";
BEGIN
{
......@@ -116,6 +116,7 @@ sub clearflags {
$batch = false;
$chr_start = -1;
$chr_stop = -1;
$class = "";
$clean = false;
$cmd = "";
$compact = false;
......@@ -143,6 +144,7 @@ sub clearflags {
$http = "";
$id = "";
$input = "";
$internal = false;
$journal = "";
$json = false;
$just_num = false;
......@@ -170,6 +172,7 @@ sub clearflags {
$query = "";
$raw = false;
$related = false;
$result = 0;
$rldate = 0;
$seq_start = 0;
$seq_stop = 0;
......@@ -192,6 +195,7 @@ sub clearflags {
$verbose = false;
$volume = "";
$web = "";
$released = "";
$word = false;
$year = "";
......@@ -210,10 +214,20 @@ sub clearflags {
$api_key = "";
$api_key = $ENV{NCBI_API_KEY} if defined $ENV{NCBI_API_KEY};
$abbrv_flag = false;
if (defined $ENV{EDIRECT_DO_AUTO_ABBREV} && $ENV{EDIRECT_DO_AUTO_ABBREV} eq "true" ) {
$abbrv_flag = true;
}
}
sub do_sleep {
if ( $internal ) {
Time::HiRes::usleep(1000);
return;
}
if ( $api_key ne "" ) {
if ( $log ) {
print STDERR "sleeping 1/10 second\n";
......@@ -360,8 +374,17 @@ sub read_aliases {
sub adjust_base {
if ( $basx ne "" ) {
$internal = false;
}
if ( $basx eq "" ) {
if ( $internal ) {
$base = "https://eutils-internal.ncbi.nlm.nih.gov/entrez/eutils/";
return;
}
# if base not overridden, check URL of previous query, stick with main or colo site,
# since history server data for EUtils does not copy between locations, by design
......@@ -512,12 +535,14 @@ sub get_count {
$output = get ($url);
if ( ! defined $output ) {
print STDERR "Failure of '$url'\n";
print STDERR "Failure of get_count '$url'\n";
$result = 1;
return "", "";
}
if ( $output eq "" ) {
print STDERR "No get_count output returned from '$url'\n";
$result = 1;
return "", ""
}
......@@ -531,10 +556,12 @@ sub get_count {
if ( $errx ne "" ) {
close (STDOUT);
$result = 1;
die "ERROR in count output: $errx\nURL: $url\n\n";
}
if ( $numx eq "" ) {
$result = 1;
die "Count value not found in count output - WebEnv $webx\n";
}
......@@ -607,7 +634,8 @@ sub get_uids {
if ( defined $data ) {
$keep_trying = false;
} else {
print STDERR "Failure of '$url'\n";
print STDERR "Failure of get_uids '$url'\n";
$result = 1;
}
}
if ( $keep_trying ) {
......@@ -687,12 +715,14 @@ sub do_post_yielding_ref {
$rslt = get ($urlx);
if ( ! defined $rslt ) {
print STDERR "Failure of '$urlx'\n";
print STDERR "Failure of do_get '$urlx'\n";
$result = 1;
return "";
}
if ( $rslt eq "" ) {
print STDERR "No do_get output returned from '$urlx'\n";
$result = 1;
return "";
}
......@@ -717,7 +747,15 @@ sub do_post_yielding_ref {
if ( $res->is_success) {
$rslt = $res->content_ref;
} else {
print STDERR $res->status_line . "\n";
$stts = $res->status_line;
print STDERR $stts . "\n";
if ( $stts eq "429 Too Many Requests" ) {
if ( $api_key eq "" ) {
print STDERR "PLEASE REQUEST AN API_KEY FROM NCBI\n";
} else {
print STDERR "TOO MANY REQUESTS EVEN WITH API_KEY\n";
}
}
}
if ( $$rslt eq "" ) {
......@@ -1052,12 +1090,32 @@ sub write_edirect {
# wrapper to detect command line errors
my $abbrev_help = qq{
To enable argument auto abbreviation resolution, run:
export EDIRECT_DO_AUTO_ABBREV="true"
in the terminal, or add that line to your .bash_profile configuration file.
};
sub MyGetOptions {
my $help_msg = shift @_;
if ( $abbrv_flag ) {
Getopt::Long::Configure("auto_abbrev");
} else {
Getopt::Long::Configure("no_auto_abbrev");
}
if ( !GetOptions(@_) ) {
if ( $abbrv_flag ) {
die $help_msg;
} else {
print $help_msg;
die $abbrev_help;
}
} elsif (@ARGV) {
die ("Entrez Direct does not support positional arguments.\n"
. "Please remember to quote parameter values containing\n"
......@@ -1089,6 +1147,7 @@ sub ecntc {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
......@@ -1151,8 +1210,10 @@ Spell Check
Publication Filters
-pub abstract, clinical, english, free, historical,
journal, last_week, last_month, last_year,
medline, preprint, published, review, structured
journal, medline, preprint, published, review,
structured
-journal pnas, "j bacteriol", ...
-released last_week, last_month, last_year, prev_years
Sequence Filters
......@@ -1170,6 +1231,11 @@ Gene Filters
-status alive
-type coding, pseudo
SNP Filters
-class acceptor, donor, frameshift, indel, intron,
missense, nonsense, synonymous
Miscellaneous Arguments
-label Alias for query step
......@@ -1180,6 +1246,8 @@ sub process_extras {
my $frst = shift (@_);
my $publ = shift (@_);
my $rlsd = shift (@_);
my $jrnl = shift (@_);
my $ctry = shift (@_);
my $fkey = shift (@_);
my $locn = shift (@_);
......@@ -1188,8 +1256,11 @@ sub process_extras {
my $sorc = shift (@_);
my $stat = shift (@_);
my $gtyp = shift (@_);
my $clss = shift (@_);
$publ = lc($publ);
$rlsd = lc($rlsd);
$jrnl = lc($jrnl);
$ctry = lc($ctry);
$fkey = lc($fkey);
$bmol = lc($bmol);
......@@ -1198,6 +1269,7 @@ sub process_extras {
$sorc = lc($sorc);
$stat = lc($stat);
$gtyp = lc($gtyp);
$clss = lc($clss);
%pubHash = (
'abstract' => 'has abstract [FILT]',
......@@ -1219,6 +1291,15 @@ sub process_extras {
'trial' => 'clinical trial [FILT]',
);
%releasedHash = (
'last_month' => 'published last month [FILT]',
'last month' => 'published last month [FILT]',
'last_week' => 'published last week [FILT]',
'last week' => 'published last week [FILT]',
'last_year' => 'published last year [FILT]',
'last year' => 'published last year [FILT]',
);
@featureArray = (
"-10_signal",
"-35_signal",
......@@ -1362,6 +1443,17 @@ sub process_extras {
'viruses' => 'viruses [FILT]',
);
%snpHash = (
'acceptor' => 'splice acceptor variant [FXN]',
'donor' => 'splice donor variant [FXN]',
'frameshift' => 'frameshift [FXN]',
'indel' => 'cds indel [FXN]',
'intron' => 'intron [FXN]',
'missense' => 'missense [FXN]',
'nonsense' => 'nonsense [FXN]',
'synonymous' => 'synonymous codon [FXN]',
);
%sourceHash = (
'ddbj' => 'srcdb ddbj [PROP]',
'embl' => 'srcdb embl [PROP]',
......@@ -1387,34 +1479,62 @@ sub process_extras {
my @working = ();
my $suffix = "";
my $suffix1 = "";
my $suffix2 = "";
my $is_published = false;
my $is_prev_year = false;
if ( $frst ne "" ) {
push (@working, $frst);
}
if ( $publ ne "" ) {
if ( defined $pubHash{$publ} ) {
$val = $pubHash{$publ};
# -pub can use comma-separated list
my @pbs = split (',', $publ);
foreach $pb (@pbs) {
if ( defined $pubHash{$pb} ) {
$val = $pubHash{$pb};
push (@working, $val);
} elsif ( $publ eq "published" ) {
$suffix = "published";
} elsif ( $pb eq "published" ) {
$is_published = true;
} else {
die "\nUnrecognized -pub argument '$publ', use efilter -help to see available choices\n\n";
die "\nUnrecognized -pub argument '$pb', use efilter -help to see available choices\n\n";
}
}
}
if ( $rlsd ne "" ) {
if ( defined $releasedHash{$rlsd} ) {
$val = $releasedHash{$rlsd};
push (@working, $val);
} elsif ( $rlsd eq "prev_years" ) {
$is_prev_year = true;
} else {
die "\nUnrecognized -released argument '$rlsd', use efilter -help to see available choices\n\n";
}
}
if ( $jrnl ne "" ) {
$val = $jrnl . " [JOUR]";
push (@working, $val);
}
if ( $ctry ne "" ) {
$val = "country " . $ctry . " [TEXT]";
push (@working, $val);
}
if ( $fkey ne "" ) {
if ( grep( /^$fkey$/, @featureArray ) ) {
$val = $fkey . " [FKEY]";
# -feature can use comma-separated list
my @fts = split (',', $fkey);
foreach $ft (@fts) {
if ( grep( /^$ft$/, @featureArray ) ) {
$val = $ft . " [FKEY]";
push (@working, $val);
} else {
die "\nUnrecognized -feature argument '$fkey', use efilter -help to see available choices\n\n";
die "\nUnrecognized -feature argument '$ft', use efilter -help to see available choices\n\n";
}
}
}
......@@ -1475,12 +1595,25 @@ sub process_extras {
}
}
if ( $clss ne "" ) {
if ( defined $snpHash{$clss} ) {
$val = $snpHash{$clss};
push (@working, $val);
} else {
die "\nUnrecognized -class argument '$clss', use efilter -help to see available choices\n\n";
}
}
my $xtras = join (" AND ", @working);
if ( $suffix eq "published" ) {
if ( $is_published ) {
$xtras = $xtras . " NOT ahead of print [FILT]";
}
if ( $is_prev_year ) {
$xtras = $xtras . " NOT published last year [FILT]";
}
return $xtras;
}
......@@ -1514,6 +1647,7 @@ sub efilt {
MyGetOptions(
$filt_help,
"query=s" => \$query,
"q=s" => \$query,
"sort=s" => \$sort,
"days=i" => \$rldate,
"mindate=s" => \$mndate,
......@@ -1523,7 +1657,9 @@ sub efilt {
"field=s" => \$field,
"spell" => \$spell,
"pairs=s" => \$pair,
"journal=s" => \$journal,
"pub=s" => \$pub,
"released=s" => \$released,
"country=s" => \$country,
"feature=s" => \$feature,
"location=s" => \$location,
......@@ -1532,6 +1668,7 @@ sub efilt {
"source=s" => \$source,
"status=s" => \$status,
"type=s" => \$gtype,
"class=s" => \$class,
"api_key=s" => \$api_key,
"email=s" => \$emaddr,
"tool=s" => \$tuul,
......@@ -1539,6 +1676,7 @@ sub efilt {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
......@@ -1554,7 +1692,7 @@ sub efilt {
}
# process special filter flags, add to query string
$query = process_extras ( $query, $pub, $country, $feature, $location, $molecule, $organism, $source, $status, $gtype );
$query = process_extras ( $query, $pub, $released, $journal, $country, $feature, $location, $molecule, $organism, $source, $status, $gtype, $class );
if ( -t STDIN ) {
if ( $query eq "" ) {
......@@ -1585,8 +1723,8 @@ sub efilt {
$email = $emaddr;
}
if ( $query eq "" && $rldate < 1 and $mndate eq "" and $mxdate eq "" ) {
die "Must supply -query or -days or -mindate and -maxdate arguments on command line\n";
if ( $query eq "" && $sort eq "" && $rldate < 1 and $mndate eq "" and $mxdate eq "" ) {
die "Must supply -query or -sort or -days or -mindate and -maxdate arguments on command line\n";
}
binmode STDOUT, ":utf8";
......@@ -1897,6 +2035,7 @@ sub esmry {
my $silent = shift (@_);
my $verbose = shift (@_);
my $debug = shift (@_);
my $internal = shift (@_);
my $log = shift (@_);
my $http = shift (@_);
my $alias = shift (@_);
......@@ -2175,6 +2314,7 @@ Format Examples
summary Summary
gene
full_report Detailed Report
gene_table Gene Table
native Gene Report
native asn.1 Entrezgene ASN.1
......@@ -2357,9 +2497,7 @@ sub xml_to_json {
my $conv = $xc->XMLin($data);
convert_bools($conv);
my $jc = JSON::PP->new->ascii->pretty->allow_nonref;
my $result = $jc->encode($conv);
$data = "$result";
$data = $jc->encode($conv);
return $data;
}
......@@ -2405,6 +2543,7 @@ sub eftch {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"raw" => \$raw,
......@@ -2456,6 +2595,10 @@ sub eftch {
$style = "withparts";
}
if ( $type eq "gbc" and $mode eq "" ) {
$mode = "xml";
}
if ( -t STDIN and not @ARGV ) {
} elsif ( $db ne "" and $id ne "" ) {
} else {
......@@ -2517,7 +2660,7 @@ sub eftch {
if ( $type eq "docsum" or $fnc eq "-summary" ) {
esmry ( $dbase, $web, $key, $num, $id, $mode, $min, $max, $tool, $email,
$silent, $verbose, $debug, $log, $http, $alias, $basx );
$silent, $verbose, $debug, $internal, $log, $http, $alias, $basx );
return;
}
......@@ -2683,10 +2826,11 @@ sub eftch {
$arg = "db=$dbase&id=$id";
if ( $type eq "gb" ) {
if ( $type eq "gb" or $type eq "gbc" ) {
if ( $style eq "withparts" or $style eq "master" ) {
$arg .= "&rettype=gbwithparts";
$arg .= "&rettype=$type";
$arg .= "&retmode=$mode";
$arg .= "&style=$style";
} elsif ( $style eq "conwithfeat" or $style eq "withfeat" or $style eq "contigwithfeat" ) {
$arg .= "&rettype=$type";
$arg .= "&retmode=$mode";
......@@ -2826,10 +2970,11 @@ sub eftch {
$arg = "db=$dbase&query_key=$key&WebEnv=$web";
if ( $type eq "gb" ) {
if ( $type eq "gb" or $type eq "gbc" ) {
if ( $style eq "withparts" or $style eq "master" ) {
$arg .= "&rettype=gbwithparts";
$arg .= "&rettype=$type";
$arg .= "&retmode=$mode";
$arg .= "&style=$style";
} elsif ( $style eq "conwithfeat" or $style eq "withfeat" or $style eq "contigwithfeat" ) {
$arg .= "&rettype=$type";
$arg .= "&retmode=$mode";
......@@ -3034,6 +3179,7 @@ sub einfo {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
......@@ -3125,6 +3271,7 @@ sub einfo {
if ( ! defined $output ) {
print STDERR "Failure of '$url'\n";
$result = 1;
return;
}
......@@ -3575,6 +3722,7 @@ sub elink {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
......@@ -3929,6 +4077,7 @@ sub entfy {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
......@@ -4102,6 +4251,7 @@ sub epost {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
......@@ -4317,6 +4467,7 @@ sub epost {
}
if ( $combo eq "" ) {
$result = 1;
die "Failure of post to find data to load\n";
}
......@@ -4369,6 +4520,7 @@ sub espel {
$spell_help,
"db=s" => \$db,
"query=s" => \$query,
"q=s" => \$query,
"api_key=s" => \$api_key,
"email=s" => \$emaddr,
"tool=s" => \$tuul,
......@@ -4376,6 +4528,7 @@ sub espel {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
......@@ -4472,6 +4625,7 @@ sub ecitmtch {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
......@@ -4586,6 +4740,7 @@ sub eprxy {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
......@@ -4821,13 +4976,16 @@ sub esrch {
$srch_help,
"db=s" => \$db,
"query=s" => \$query,
"q=s" => \$query,
"sort=s" => \$sort,
"days=i" => \$rldate,
"mindate=s" => \$mndate,
"maxdate=s" => \$mxdate,
"datetype=s" => \$dttype,
"label=s" => \$lbl,
"journal=s" => \$journal,
"pub=s" => \$pub,
"released=s" => \$released,
"country=s" => \$country,
"feature=s" => \$feature,
"location=s" => \$location,
......@@ -4836,6 +4994,7 @@ sub esrch {
"source=s" => \$source,
"status=s" => \$status,
"type=s" => \$gtype,
"class=s" => \$class,
"clean" => \$clean,
"field=s" => \$field,
"word" => \$word,
......@@ -4853,6 +5012,7 @@ sub esrch {
"silent" => \$silent,
"verbose" => \$verbose,
"debug" => \$debug,
"internal" => \$internal,
"log" => \$log,
"compact" => \$compact,
"http=s" => \$http,
......@@ -4899,7 +5059,7 @@ sub esrch {
binmode STDOUT, ":utf8";
# support all efilter shortcut flags in esearch (undocumented)
$query = process_extras ( $query, $pub, $country, $feature, $location, $molecule, $organism, $source, $status, $gtype );
$query = process_extras ( $query, $pub, $released, $journal, $country, $feature, $location, $molecule, $organism, $source, $status, $gtype, $class );
if ( $query eq "" ) {
die "Must supply -query search expression on command line\n";
......@@ -5250,3 +5410,5 @@ if ( scalar @ARGV > 0 and $ARGV[0] eq "-version" ) {
close (STDIN);
close (STDOUT);
close (STDERR);
exit $result;
......@@ -21,6 +21,7 @@ if [ "$#" -gt 0 ]
then
target="$1"
MASTER=$(cd "$target" && pwd)
CONFIG=${MASTER}
shift
else
if [ -z "${EDIRECT_PUBMED_MASTER}" ]
......@@ -70,7 +71,7 @@ do
mkdir -p "$MASTER/$dir"
done
for dir in Current Indexed Inverted Merged Pubmed
for dir in Indexed Inverted Merged Pubmed
do
mkdir -p "$WORKING/$dir"
done
......@@ -108,19 +109,9 @@ echo "$seconds seconds"
REF=$seconds
echo ""
seconds_start=$(date "+%s")
echo "Collecting PubMed Records"
pm-current "$WORKING/Current" "$MASTER/Archive"
seconds_end=$(date "+%s")
seconds=$((seconds_end - seconds_start))
echo "$seconds seconds"
COL=$seconds
echo ""
seconds_start=$(date "+%s")
echo "Indexing PubMed Records"
cd "$WORKING/Current"
pm-index "$WORKING/Indexed"
pm-index "$MASTER/Archive" "$WORKING/Indexed"
seconds_end=$(date "+%s")
seconds=$((seconds_end - seconds_start))
echo "$seconds seconds"
......@@ -160,7 +151,6 @@ echo ""
echo "DWN $DWN seconds"
echo "POP $POP seconds"
echo "REF $REF seconds"
echo "COL $COL seconds"
echo "IDX $IDX seconds"
echo "INV $INV seconds"
echo "MRG $MRG seconds"
......@@ -172,3 +162,20 @@ fetch-pubmed -path "$MASTER/Archive" |
xtract -pattern Author -if Affiliation -contains Medicine \
-pfx "Archive and Index are " -element Initials
echo ""
if [ -n "$CONFIG" ]
then
target=bash_profile
if ! grep "$target" "$HOME/.bashrc" >/dev/null 2>&1
then
if [ ! -f $HOME/.$target ] || grep 'bashrc' "$HOME/.$target" >/dev/null 2>&1
then
target=bashrc
fi
fi
echo ""
echo "For convenience, please execute the following to save the archive path to a variable:"
echo ""
echo " echo \"export EDIRECT_PUBMED_MASTER='${CONFIG}'\" >>" "\$HOME/.$target"
echo ""
fi
#!/bin/sh
target=""
mode="query"
debug=false
while [ $# -gt 0 ]
do
case "$1" in
-h | -help | --help )
mode=help
break
;;
-debug )
debug=true
shift
;;
-path | -master )
target=$2
shift
shift
;;
-count )
mode="count"
shift
;;
-counts )
mode="counts"
shift
;;
-countr )
mode="countr"
shift
;;
-countp )
mode="countp"
shift
;;
-query | -phrase )
mode="query"
shift
;;
-search )
mode="search"
shift
;;
-exact )
mode="exact"
shift
;;
-mock )
mode="mock"
shift
;;
-mocks )
mode="mocks"
shift
;;
-mockx )
mode="mockx"
shift
;;
-* )
exec >&2
echo "$0: Unrecognized option $1"
exit 1
;;
* )
break
;;
esac
done
if [ $mode = "help" ]
then
cat <<EOF
USAGE: $0
[-path path_to_pubmed_master]
-count | -counts | -search | -exact | [-query]
query arguments
EXAMPLE: local-phrase-search -query catabolite repress* AND protease inhibit*
EOF
exit
fi
if [ -z "$target" ]
then
if [ -z "${EDIRECT_PUBMED_MASTER}" ]
then
echo "Must supply path to postings files or set EDIRECT_PUBMED_MASTER environment variable"
exit 1
else
MASTER="${EDIRECT_PUBMED_MASTER}"
MASTER=${MASTER%/}
target="$MASTER/Postings"
fi
else
argument="$target"
target=$(cd "$argument" && pwd)
target=${target%/}
case "$target" in
*/Postings ) ;;
* ) target=$target/Postings ;;
esac
fi
osname=`uname -s | sed -e 's/_NT-.*$/_NT/; s/^MINGW[0-9]*/CYGWIN/'`
if [ "$osname" = "CYGWIN_NT" -a -x /bin/cygpath ]
then
target=`cygpath -w "$target"`
fi
target=${target%/}
if [ "$debug" = true ]
then
echo "mode: $mode, path: '$target', args: '$*'"
exit
fi
case "$mode" in
count )
rchive -path "$target" -count "$*"
;;
counts )
rchive -path "$target" -counts "$*"
;;
countr )
rchive -path "$target" -countr "$*"
;;
countp )
rchive -path "$target" -countp "$*"
;;
query )
rchive -path "$target" -query "$*"
;;
search )
rchive -path "$target" -search "$*"
;;
exact )
rchive -path "$target" -exact "$*"
;;
mock )
rchive -path "$target" -mock "$*"
;;
mocks )
rchive -path "$target" -mocks "$*"
;;
mockx )
rchive -path "$target" -mockx "$*"
;;
esac
......@@ -43,7 +43,7 @@ use File::Spec;
# nquire version number
$version = "10.5";
$version = "10.9";
BEGIN
{
......@@ -63,10 +63,12 @@ BEGIN
}
use lib $LibDir;
use JSON::PP;
use LWP::UserAgent;
use POSIX;
use URI::Escape;
use Net::FTP;
use XML::Simple;
# definitions
......@@ -81,6 +83,7 @@ sub clearflags {
$alias = "";
$debug = false;
$http = "";
$j2x = false;
$output = "";
}
......@@ -229,6 +232,33 @@ sub do_uri_escape {
return $rslt;
}
sub convert_bools {
my %unrecognized;
local *_convert_bools = sub {
my $ref_type = ref($_[0]);
if (!$ref_type) {
# Nothing.
}
elsif ($ref_type eq 'HASH') {
_convert_bools($_) for values(%{ $_[0] });
}
elsif ($ref_type eq 'ARRAY') {
_convert_bools($_) for @{ $_[0] };
}
elsif (
$ref_type eq 'JSON::PP::Boolean' || $ref_type eq 'Types::Serialiser::Boolean'
) {
$_[0] = $_[0] ? 1 : 0;
}
else {
++$unrecognized{$ref_type};
}
};
&_convert_bools;
}
# nquire executes an external URL query from command line arguments
my $nquire_help = qq{
......@@ -439,8 +469,103 @@ Federated Query
}" |
xtract -pattern result -block binding -element "binding\@name" literal
BioThings Queries
nquire -variant variant "chr6:g.26093141G>A" -fields dbsnp.gene |
xtract -pattern gene -element \@geneid
nquire -gene query -q "symbol:OPN1MW" -species 9606 |
xtract -pattern hits -element "\@_id"
nquire -gene query -q "symbol:OPN1MW AND taxid:9606" |
xtract -pattern hits -element "\@_id"
nquire -gene gene 2652 -fields pathway.wikipathways |
xtract -pattern pathway -element "\@id"
nquire -gene query -q "pathway.wikipathways.id:WP455" -size 300 |
xtract -pattern hits -element "\@_id"
nquire -chem query -q "drugbank.targets.uniprot:P05231 AND drugbank.targets.actions:inhibitor" -fields hgvs |
xtract -pattern hits -element "\@_id"
EDirect Expansion
ExtractIDs() {
xtract -pattern BIO_THINGS -block Id -tab "\\n" -element "Id"
}
WrapIDs() {
xtract -wrp BIO_THINGS -pattern opt -wrp "Type" -lbl "\$1" \\
-wrp "Count" -num "\$2" -block "\$2" -wrp "Id" -element "\$3" |
xtract -format
}
nquire -gene query -q "symbol:OPN1MW AND taxid:9606" |
WrapIDs entrezgene hits "\@entrezgene" |
ExtractIDs |
while read geneid
do
nquire -gene gene "\$geneid" -fields pathway.wikipathways
done |
WrapIDs pathway.wikipathways.id pathway "\@id" |
ExtractIDs |
while read pathid
do
nquire -gene query -q "pathway.wikipathways.id:\$pathid" -size 300
done |
WrapIDs entrezgene hits "\@entrezgene" |
ExtractIDs |
sort -n
};
my @pubchem_properties = qw(
MolecularFormula
MolecularWeight
CanonicalSMILES
IsomericSMILES
InChI
InChIKey
IUPACName
XLogP
ExactMass
MonoisotopicMass
TPSA
Complexity
Charge
HBondDonorCount
HBondAcceptorCount
RotatableBondCount
HeavyAtomCount
IsotopeAtomCount
AtomStereoCount
DefinedAtomStereoCount
UndefinedAtomStereoCount
BondStereoCount
DefinedBondStereoCount
UndefinedBondStereoCount
CovalentUnitCount
Volume3D
XStericQuadrupole3D
YStericQuadrupole3D
ZStericQuadrupole3D
FeatureCount3D
FeatureAcceptorCount3D
FeatureDonorCount3D
FeatureAnionCount3D
FeatureCationCount3D
FeatureRingCount3D
FeatureHydrophobeCount3D
ConformerModelRMSD3D
EffectiveRotorCount3D
ConformerCount3D
Fingerprint2D
);
sub nquire {
# nquire -url http://... -tag value -tag value | ...
......@@ -450,10 +575,19 @@ sub nquire {
$pfx = "";
$amp = "";
$pat = "";
$sfx = "";
@args = @ARGV;
$max = scalar @args;
%biothingsHash = (
'-gene' => 'http://mygene.info/v3',
'-variant' => 'http://myvariant.info/v1',
'-chem' => 'http://mychem.info/v1',
'-drug' => 'http://c.biothings.io/v1',
'-taxon' => 'http://t.biothings.io/v1',
);
if ( $max < 1 ) {
return;
}
......@@ -515,7 +649,6 @@ sub nquire {
$ftp->cwd($dir) or die "Unable to change to $dir: ", $ftp->message;
$ftp->binary or warn "Unable to set binary mode: ", $ftp->message;
if (! -e $fl) {
if (! $ftp->get($fl, "/dev/stdout") ) {
my $msg = $ftp->message;
chomp $msg;
......@@ -524,12 +657,11 @@ sub nquire {
}
}
}
}
return;
}
}
# if present, -http get or -get must be next
# if present, -http get or -get must be next (now also allow -http post or -post)
# nquire -get -url "http://collections.mnh.si.edu/services/resolver/resolver.php" -voucher "Birds:625456"
......@@ -544,6 +676,9 @@ sub nquire {
} elsif ( $pat eq "-get" ) {
$i++;
$http = "get";
} elsif ( $pat eq "-post" ) {
$i++;
$http = "post";
}
}
......@@ -605,6 +740,13 @@ sub nquire {
if ( $i < $max ) {
$url = "https://eutilstest.ncbi.nlm.nih.gov/entrez/eutils";
}
} elsif ( $pat eq "-qa" ) {
# shortcut for eutils QA base (undocumented)
$i++;
if ( $i < $max ) {
$url = "http://qa.ncbi.nlm.nih.gov/entrez/eutils";
}
} elsif ( $pat eq "-hydra" ) {
# internal citation match request (undocumented)
$i++;
......@@ -617,6 +759,7 @@ sub nquire {
$amp = "&";
$i++;
}
} elsif ( $pat eq "-revhist" ) {
# internal sequence revision history request (undocumented)
$i++;
......@@ -627,6 +770,47 @@ sub nquire {
$amp = "&";
$i++;
}
} elsif ( $pat eq "-pubchem" ) {
# shortcut for PubChem Power User Gateway REST service base (undocumented)
# nquire -pubchem "compound/name/creatine/property" "IUPACName,MolecularWeight,MolecularFormula" "XML"
$i++;
if ( $i < $max ) {
$url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug";
if ( $i + 2 == $max && $args[$i] eq "compound" ) {
# even shorter shortcut
# nquire -pubchem compound creatine
$pat = $args[$i + 1];
if ( $pat =~ /^-(.+)/ ) {
} elsif ( $pat !~ /\// ) {
$i = $i + 2;
$url .= "/compound/name/";
$pat = map_macros ($pat);
$url .= $pat;
$url .= "/property/";
$sfx = join(",", @pubchem_properties);
$url .= $sfx;
$url .= "/XML";
}
}
}
} elsif ( defined $biothingsHash{$pat} ) {
# shortcuts for biothings services (undocumented)
$i++;
$url = $biothingsHash{$pat};
if ( $http eq "" ) {
$http = "get";
}
$j2x = true;
} elsif ( $pat eq "-wikipathways" ) {
# shortcut for webservice.wikipathways.org (undocumented)
$i++;
if ( $i < $max ) {
$url = "http://webservice.wikipathways.org";
}
} elsif ( $pat eq "-biosample" ) {
# internal biosample_chk request on live database (undocumented)
$i++;
......@@ -709,6 +893,33 @@ sub nquire {
# perform query
$output = do_post ($url, $arg);
if ( $j2x ) {
my $jc = JSON::PP->new->ascii->pretty->allow_nonref;
my $conv = $jc->decode($output);
convert_bools($conv);
my $result = XMLout($conv, SuppressEmpty => undef);
# remove newlines, tabs, space between tokens, compress runs of spaces
$result =~ s/\r/ /g;
$result =~ s/\n/ /g;
$result =~ s/\t//g;
$result =~ s/ +/ /g;
$result =~ s/> +</></g;
# remove <opt> flanking object
if ( $result =~ /<opt>\s*?</ and $result =~ />\s*?<\/opt>/ ) {
$result =~ s/<opt>\s*?</</g;
$result =~ s/>\s*?<\/opt>/>/g;
}
$output = "$result";
# restore newlines between objects
$output =~ s/> *?</>\n</g;
binmode(STDOUT, ":utf8");
}
print "$output";
}
......
#!/bin/sh
if [ "$#" -eq 0 ]
then
echo "Must supply path for cleaned files"
exit 1
fi
target="$1"
target=${target%/}
for fl in *.xml.gz
do
base=${fl%.xml.gz}
if [ -f "$target/$base.xml.gz" ]
then
continue
fi
echo "$base"
gunzip -c "$fl" |
xtract -mixed -format flush |
gzip > "$target/$base.xml.gz"
done
#!/bin/sh
if [ "$#" -eq 0 ]
then
echo "Must supply path for current files"
exit 1
fi
target="$1"
shift
target=${target%/}
if [ "$#" -eq 0 ]
then
echo "Must supply path for archive files"
exit 1
fi
archive="$1"
shift
archive=${archive%/}
find "$target" -name "*.xml.gz" -delete
fr=0
chunk_size=250000
if [ -n "${EDIRECT_CHUNK_SIZE}" ]
then
chunk_size="${EDIRECT_CHUNK_SIZE}"
fi
to=$((chunk_size - 1))
loop_max=$((50000000 / chunk_size))
seq 1 $((loop_max)) | while read n
do
base=$(printf pubmed%03d $n)
if [ -f "$target/$base.xml.gz" ]
then
fr=$((fr + chunk_size))
to=$((to + chunk_size))
continue
fi
echo "$base XML"
seconds_start=$(date "+%s")
seq -f "%0.f" $fr $to | stream-pubmed -path "$archive" > "$target/$base.xml.gz"
fr=$((fr + chunk_size))
to=$((to + chunk_size))
seconds_end=$(date "+%s")
seconds=$((seconds_end - seconds_start))
echo "$seconds seconds"
fsize=$(wc -c <"$target/$base.xml.gz")
if [ "$fsize" -le 300 ]
then
rm "$target/$base.xml.gz"
exit 0
fi
sleep 1
done
#!/bin/sh
if [ "$#" -eq 0 ]
then
echo "Must supply path to archive files"
exit 1
fi
target="$1"
target=${target%/}
rchive -trie -gzip |
while read dir
do
rm "$target/$dir"
done
......@@ -2,35 +2,74 @@
dir=`dirname "$0"`
if [ "$#" -eq 0 ]
then
echo "Must supply path for archive files"
exit 1
fi
archive="$1"
shift
archive=${archive%/}
if [ "$#" -eq 0 ]
then
echo "Must supply path for indexed files"
exit 1
fi
target="$1"
indexed="$1"
shift
target=${target%/}
indexed=${indexed%/}
find "$target" -name "*.e2x.gz" -delete
cd "$archive"
for fl in *.xml.gz
find "$indexed" -name "*.e2x.gz" -delete
q=0
fr=0
chunk_size=250000
if [ -n "${EDIRECT_CHUNK_SIZE}" ]
then
chunk_size="${EDIRECT_CHUNK_SIZE}"
fi
to=$((chunk_size - 1))
loop_max=$((50000000 / chunk_size))
seq 1 $((loop_max)) | while read n
do
base=${fl%.xml.gz}
echo "$base"
base=$(printf pubmed%03d $n)
if [ -f "$indexed/$base.e2x.gz" ]
then
fr=$((fr + chunk_size))
to=$((to + chunk_size))
continue
fi
echo "$base XML"
seconds_start=$(date "+%s")
if [ -s "$dir/meshtree.txt" ]
then
gunzip -c "$fl" |
seq -f "%0.f" $fr $to |
fetch-pubmed -path "$archive" |
xtract -transform "$dir/meshtree.txt" -e2index |
gzip -1 > "$target/$base.e2x.gz"
gzip -1 > "$indexed/$base.e2x.gz"
else
gunzip -c "$fl" |
seq -f "%0.f" $fr $to |
fetch-pubmed -path "$archive" |
xtract -e2index |
gzip -1 > "$target/$base.e2x.gz"
gzip -1 > "$indexed/$base.e2x.gz"
fi
fr=$((fr + chunk_size))
to=$((to + chunk_size))
seconds_end=$(date "+%s")
seconds=$((seconds_end - seconds_start))
echo "$seconds seconds"
fsize=$(wc -c < "$indexed/$base.e2x.gz")
if [ "$fsize" -le 300 ]
then
rm -f "$target/$base.xml.gz"
exit 0
fi
sleep 1
done
#!/bin/sh
printAdditions() {
f="$1"
base=${f%.xml.gz}
gunzip -c "$f" |
xtract -strict -pattern PubmedArticle \
-block MedlineCitation/PMID -lbl "$base" -sep "." \
-element MedlineCitation/PMID,MedlineCitation/PMID@Version
}
printDeletions() {
f="$1"
base=${f%.xml.gz}
gunzip -c "$f" |
xtract -strict -pattern DeleteCitation \
-block PMID -lbl "$base" -tab "\tD\n" -sep "." -element "PMID,@Version"
}
for fl in *.xml.gz
do
printAdditions "$fl"
printDeletions "$fl"
done > transactions.txt
#!/bin/sh
for fl in *.xml.gz
do
echo "$fl"
base=${fl%.xml.gz}
gunzip -c "$fl" | xtract -strict -compress -format flush > "$base.tmp"
xtract -input "$base.tmp" -pattern PubmedArticle -element MedlineCitation/PMID > "$base.uid"
rchive -input "$base.tmp" -unique "$base.uid" -index MedlineCitation/PMID \
-head "<PubmedArticleSet>" -tail "</PubmedArticleSet>" -pattern PubmedArticle |
xtract -format indent -xml '<?xml version="1.0" encoding="UTF-8"?>' \
-doctype '<!DOCTYPE PubmedArticleSet SYSTEM "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_180601.dtd">' > "$base.xml"
rm "$base.tmp"
rm "$base.uid"
done
rm *.xml.gz
......@@ -45,7 +45,8 @@ deleteCitations() {
reportVersioned() {
inp="$1"
pmidlist=.TO-REPORT
xtract -input "$inp" -pattern PubmedArticle -block MedlineCitation/PMID -if "@Version" -gt 1 -element "PMID" |
xtract -input "$inp" -pattern PubmedArticle \
-block MedlineCitation/PMID -if "@Version" -gt 1 -element "PMID" |
sort -n | uniq > $pmidlist
if [ -s $pmidlist ]
then
......
#!/bin/sh
if [ "$#" -eq 0 ]
then
echo "Must supply path to archive files"
exit 1
fi
target="$1"
find "$target" -name "*.xml.gz" |
sed -e 's,.*/\(.*\)\.xml\.gz,\1,' |
sort -n | uniq
#!/bin/sh
for fl in *.gz
do
echo "$fl"
gunzip -c "$fl" | xtract -mixed -verify
done
......@@ -62,7 +62,7 @@ import (
// RCHIVE VERSION AND HELP MESSAGE TEXT
const rchiveVersion = "10.5"
const rchiveVersion = "10.9"
const rchiveHelp = `
Processing Flags
......@@ -5794,7 +5794,7 @@ func main() {
if dbug {
// drain results, but suppress normal output
for _ = range unsq {
for range unsq {
recordCount++
runtime.Gosched()
}
......@@ -5922,7 +5922,7 @@ func main() {
if dbug {
// drain results, but suppress normal output
for _ = range sptr {
for range sptr {
recordCount++
runtime.Gosched()
}
......@@ -6503,7 +6503,7 @@ func main() {
if dbug {
// drain results, but suppress normal output
for _ = range rslq {
for range rslq {
recordCount++
runtime.Gosched()
}
......
......@@ -73,7 +73,7 @@ my @lwp_deps = qw(Encode::Locale File::Listing
HTTP:Cookies HTTP::Date HTTP::Message HTTP::Negotiate
IO::Socket::SSL LWP::MediaTypes LWP::Protocol::https
Net::HTTP URI WWW::RobotRules Mozilla::CA);
for my $module (@lwp_deps, 'Time::HiRes', 'JSON::PP', 'XML::Simple') {
for my $module (@lwp_deps, 'Time::HiRes', 'JSON::PP', 'MIME::Base64', 'XML::Simple') {
if ( ! CheckAvailability($module) ) {
CPAN::Shell->install($module);
}
......
......@@ -23,7 +23,7 @@ do
fi
if [ -z "$ttl" ]
then
echo "$uid TRIM"
echo "$uid TRIM -- $ttl"
continue
fi
res=`phrase-search -exact "$ttl"`
......
......@@ -43,7 +43,7 @@ use File::Spec;
# transmute version number
$version = "10.5";
$version = "10.9";
BEGIN
{
......@@ -64,6 +64,7 @@ BEGIN
use lib $LibDir;
use JSON::PP;
use MIME::Base64;
use URI::Escape;
use XML::Simple;
......@@ -102,7 +103,11 @@ Transformation Commands
};
my $type = shift or die "Must supply -decode, -encode, -j2x, or -x2j on command line\n";
# read required function argument
my $type = shift or die "Must supply conversion type on command line\n";
# read optional parent object name
my $obj = shift;
sub transmute {
......@@ -131,7 +136,7 @@ sub transmute {
# perform specific conversions
if ( $type eq "decode" || $type eq "-decode" ) {
if ( $type eq "unescape" || $type eq "-unescape" ) {
$data = uri_unescape($data);
......@@ -144,7 +149,7 @@ sub transmute {
print "$data";
}
if ( $type eq "encode" || $type eq "-encode" ) {
if ( $type eq "escape" || $type eq "-escape" ) {
# compress runs of spaces
$data =~ s/ +/ /g;
......@@ -154,6 +159,20 @@ sub transmute {
print "$data";
}
if ( $type eq "decode64" || $type eq "-decode64" ) {
$data = decode_base64($data);
print "$data";
}
if ( $type eq "encode64" || $type eq "-encode64" ) {
$data = encode_base64($data);
print "$data";
}
if ( $type eq "plain" || $type eq "-plain" ) {
# remove embedded mixed-content tags
......@@ -262,6 +281,96 @@ sub transmute {
print "$data\n";
}
if ( $type eq "docsum" || $type eq "-docsum" ) {
# remove newlines, tabs, space between tokens, compress runs of spaces
$data =~ s/\r/ /g;
$data =~ s/\n/ /g;
$data =~ s/\t//g;
$data =~ s/ +/ /g;
$data =~ s/> +</></g;
# move UID from attribute to object
if ($data !~ /<Id>\d+<\/Id>/i) {
$data =~ s/<DocumentSummary uid=\"(\d+)\">/<DocumentSummary><Id>$1<\/Id>/g;
}
$data =~ s/<DocumentSummary uid=\"\d+\">/<DocumentSummary>/g;
# fix bad encoding
my @accum = ();
my @working = ();
my $prefix = "";
my $suffix = "";
my $docsumset_attrs = '';
if ( $data =~ /(.+?)<DocumentSummarySet(\s+.+?)?>(.+)<\/DocumentSummarySet>(.+)/s ) {
$prefix = $1;
$docsumset_attrs = $2;
my $docset = $3;
$suffix = $4;
my @vals = ($docset =~ /<DocumentSummary>(.+?)<\/DocumentSummary>/sg);
foreach $val (@vals) {
push (@working, "<DocumentSummary>");
if ( $val =~ /<Title>(.+?)<\/Title>/ ) {
my $x = $1;
if ( $x =~ /\&amp\;/ || $x =~ /\&lt\;/ || $x =~ /\&gt\;/ || $x =~ /\</ || $x =~ /\>/ ) {
while ( $x =~ /\&amp\;/ || $x =~ /\&lt\;/ || $x =~ /\&gt\;/ ) {
HTML::Entities::decode_entities($x);
}
# removed mixed content tags
$x =~ s|<b>||g;
$x =~ s|<i>||g;
$x =~ s|<u>||g;
$x =~ s|<sup>||g;
$x =~ s|<sub>||g;
$x =~ s|</b>||g;
$x =~ s|</i>||g;
$x =~ s|</u>||g;
$x =~ s|</sup>||g;
$x =~ s|</sub>||g;
$x =~ s|<b/>||g;
$x =~ s|<i/>||g;
$x =~ s|<u/>||g;
$x =~ s|<sup/>||g;
$x =~ s|<sub/>||g;
# Reencode any resulting less-than or greater-than entities to avoid breaking the XML.
$x =~ s/</&lt;/g;
$x =~ s/>/&gt;/g;
$val =~ s/<Title>(.+?)<\/Title>/<Title>$x<\/Title>/;
}
}
if ( $val =~ /<Summary>(.+?)<\/Summary>/ ) {
my $x = $1;
if ( $x =~ /\&amp\;/ ) {
HTML::Entities::decode_entities($x);
# Reencode any resulting less-than or greater-than entities to avoid breaking the XML.
$x =~ s/</&lt;/g;
$x =~ s/>/&gt;/g;
$val =~ s/<Summary>(.+?)<\/Summary>/<Summary>$x<\/Summary>/;
}
}
push (@working, $val );
push (@working, "</DocumentSummary>");
}
}
if ( scalar @working > 0 ) {
push (@accum, $prefix);
push (@accum, "<DocumentSummarySet$docsumset_attrs>");
push (@accum, @working);
push (@accum, "</DocumentSummarySet>");
push (@accum, $suffix);
$data = join ("\n", @accum);
$data =~ s/\n\n/\n/g;
}
# restore newlines between objects
$data =~ s/> *?</>\n</g;
print "$data\n";
}
if ( $type eq "json2xml" || $type eq "-json2xml" || $type eq "j2x" || $type eq "-j2x" ) {
# convert JSON to XML
......@@ -284,9 +393,6 @@ sub transmute {
$result =~ s/>\s*?<\/opt>/>/g;
}
# read optional parent object name
my $obj = shift;
if ( defined($obj) && $obj ne "" ) {
my $xml = '<?xml version="1.0" encoding="UTF-8"?>';
......
......@@ -53,7 +53,7 @@ import (
// XTRACT VERSION AND HELP MESSAGE TEXT
const xtractVersion = "10.5"
const xtractVersion = "10.9"
const xtractHelp = `
Overview
......@@ -202,6 +202,7 @@ Text Processing
-terms Partition text at spaces
-words Split at punctuation marks
-pairs Adjacent informative words
-reverse Reverse words in string
-letters Separate individual letters
-clauses Break at phrase separators
-indices Index normalized words
......@@ -209,6 +210,7 @@ Text Processing
Sequence Processing
-revcomp Reverse-complement nucleotide sequence
-nucleic Subrange determines forward or revcomp
Sequence Coordinates
......@@ -270,7 +272,7 @@ Notes
-num and -len selections are synonyms for Object Count (#) and Item Length (%).
-words, -pairs, and -indices convert to lower case.
-words, -pairs, -reverse, and -indices convert to lower case.
Examples
......@@ -457,10 +459,11 @@ Formatted Authors
xtract -pattern PubmedArticle -element MedlineCitation/PMID \
-block PubDate -sep "-" -element Year,Month,MedlineDate \
-block Author -sep " " -tab "" \
-element "&COM" Initials,LastName -COM "(, )"
-element "&COM" Initials,LastName -COM "(|)" |
perl -pe 's/(\t[^\t|]*)\|([^\t|]*)$/$1 and $2/; s/\|([^|]*)$/, and $1/; s/\|/, /g'
1413997 1992-Oct RK Mortimer, CR Contopoulou, JS King
6301692 1983-Apr MA Krasnow, NR Cozzarelli
1413997 1992-Oct RK Mortimer, CR Contopoulou, and JS King
6301692 1983-Apr MA Krasnow and NR Cozzarelli
781293 1976-Jul MJ Casadaban
Medical Subject Headings
......@@ -1640,6 +1643,7 @@ const (
TERMS
WORDS
PAIRS
REVERSE
LETTERS
CLAUSES
INDICES
......@@ -1697,6 +1701,7 @@ const (
ONEBASED
UCSCBASED
REVCOMP
NUCLEIC
ELSE
VARIABLE
VALUE
......@@ -1794,6 +1799,7 @@ var argTypeIs = map[string]ArgumentType{
"-terms": EXTRACTION,
"-words": EXTRACTION,
"-pairs": EXTRACTION,
"-reverse": EXTRACTION,
"-letters": EXTRACTION,
"-clauses": EXTRACTION,
"-indices": EXTRACTION,
......@@ -1823,6 +1829,7 @@ var argTypeIs = map[string]ArgumentType{
"-bed-based": EXTRACTION,
"-bed-coords": EXTRACTION,
"-revcomp": EXTRACTION,
"-nucleic": EXTRACTION,
"-else": EXTRACTION,
"-pfx": CUSTOMIZATION,
"-sfx": CUSTOMIZATION,
......@@ -1854,6 +1861,7 @@ var opTypeIs = map[string]OpType{
"-terms": TERMS,
"-words": WORDS,
"-pairs": PAIRS,
"-reverse": REVERSE,
"-letters": LETTERS,
"-clauses": CLAUSES,
"-indices": INDICES,
......@@ -1917,6 +1925,7 @@ var opTypeIs = map[string]OpType{
"-bed-based": UCSCBASED,
"-bed-coords": UCSCBASED,
"-revcomp": REVCOMP,
"-nucleic": NUCLEIC,
"-else": ELSE,
}
......@@ -2692,8 +2701,8 @@ func ParseArguments(cmdargs []string, pttrn string) *Block {
op := &Operation{Type: status, Value: ""}
comm = append(comm, op)
status = UNSET
case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED:
case NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP:
case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, REVERSE, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED:
case NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP, NUCLEIC:
case TAB, RET, PFX, SFX, SEP, LBL, PFC, DEQ, PLG, ELG, WRP, DEF, COLOR:
case UNSET:
fmt.Fprintf(os.Stderr, "\nERROR: No -element before '%s'\n", str)
......@@ -2872,8 +2881,8 @@ func ParseArguments(cmdargs []string, pttrn string) *Block {
switch status {
case UNSET:
status = nextStatus(str)
case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP:
case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, REVERSE, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP, NUCLEIC:
for !strings.HasPrefix(str, "-") {
// create one operation per argument, even if under a single -element statement
op := &Operation{Type: status, Value: str}
......@@ -3349,6 +3358,27 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
return "", false
}
// reverseComplement reverse-complements a nucleotide sequence
reverseComplement := func(str string) string {
runes := []rune(str)
// reverse sequence letters - middle base in odd-length sequence is not touched, so cannot also complement here
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i]
}
found := false
// now complement every base, also handling uracil, leaving case intact
for i, ch := range runes {
runes[i], found = revComp[ch]
if !found {
runes[i] = 'X'
}
}
str = string(runes)
return str
}
// processElement handles individual -element constructs
processElement := func(acc func(string)) {
......@@ -3467,13 +3497,36 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
}
}
doRevComp := false
doUpCase := false
if status == NUCLEIC {
// -nucleic uses direction of range to decide between forward strand or reverse complement
if min+1 > max {
min, max = max-1, min+1
doRevComp = true
}
doUpCase = true
}
// numeric range now calculated, apply slice to string
if min == 0 && max == 0 {
if doRevComp {
str = reverseComplement(str)
}
if doUpCase {
str = strings.ToUpper(str)
}
acc(str)
} else if max == 0 {
if min > 0 && min < len(str) {
str = str[min:]
if str != "" {
if doRevComp {
str = reverseComplement(str)
}
if doUpCase {
str = strings.ToUpper(str)
}
acc(str)
}
}
......@@ -3481,6 +3534,12 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
if max > 0 && max <= len(str) {
str = str[:max]
if str != "" {
if doRevComp {
str = reverseComplement(str)
}
if doUpCase {
str = strings.ToUpper(str)
}
acc(str)
}
}
......@@ -3488,6 +3547,12 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
if min < max && min > 0 && max <= len(str) {
str = str[min:max]
if str != "" {
if doRevComp {
str = reverseComplement(str)
}
if doUpCase {
str = strings.ToUpper(str)
}
acc(str)
}
}
......@@ -3501,8 +3566,8 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
sendSlice(str)
}
})
case TERMS, WORDS, PAIRS, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
VALUE, LEN, SUM, MIN, MAX, SUB, AVG, DEV, MED, BIN, BIT, REVCOMP:
case TERMS, WORDS, PAIRS, REVERSE, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
VALUE, LEN, SUM, MIN, MAX, SUB, AVG, DEV, MED, BIN, BIT, REVCOMP, NUCLEIC:
exploreElements(func(str string, lvl int) {
if str != "" {
sendSlice(str)
......@@ -3744,7 +3809,7 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
buffer.WriteString(single)
between = sep
}
case ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, VALUE, NUM, INC, DEC, ZEROBASED, ONEBASED, UCSCBASED:
case ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, VALUE, NUM, INC, DEC, ZEROBASED, ONEBASED, UCSCBASED, NUCLEIC:
processElement(func(str string) {
if str != "" {
ok = true
......@@ -3955,20 +4020,7 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
if str != "" {
ok = true
buffer.WriteString(between)
runes := []rune(str)
// reverse sequence letters - middle base in odd-length sequence is not touched, so cannot also complement here
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i]
}
found := false
// now complement every base, also handling uracil, leaving case intact
for i, ch := range runes {
runes[i], found = revComp[ch]
if !found {
runes[i] = 'X'
}
}
str = string(runes)
str = reverseComplement(str)
buffer.WriteString(str)
between = sep
}
......@@ -4270,6 +4322,37 @@ func ProcessClause(curr *Node, stages []*Step, mask, prev, pfx, sfx, plg, sep, d
}
}
})
case REVERSE:
processElement(func(str string) {
if str != "" {
words := strings.FieldsFunc(str, func(c rune) bool {
return !unicode.IsLetter(c) && !unicode.IsDigit(c)
})
for lf, rt := 0, len(words)-1; lf < rt; lf, rt = lf+1, rt-1 {
words[lf], words[rt] = words[rt], words[lf]
}
for _, item := range words {
item = strings.ToLower(item)
if DeStop {
if IsStopWord(item) {
continue
}
}
if DoStem {
item = porter2.Stem(item)
item = strings.TrimSpace(item)
}
if item == "" {
continue
}
ok = true
buffer.WriteString(between)
buffer.WriteString(item)
between = sep
}
}
})
case LETTERS:
processElement(func(str string) {
if str != "" {
......@@ -4447,8 +4530,8 @@ func ProcessInstructions(commands []*Operation, curr *Node, mask, tab, ret strin
str := op.Value
switch op.Type {
case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP:
case ELEMENT, FIRST, LAST, ENCODE, UPPER, LOWER, TITLE, YEAR, TRANSLATE, TERMS, WORDS, PAIRS, REVERSE, LETTERS, CLAUSES, INDICES, MESHCODE, MATRIX, ACCENTED,
NUM, LEN, SUM, MIN, MAX, INC, DEC, SUB, AVG, DEV, MED, BIN, BIT, ZEROBASED, ONEBASED, UCSCBASED, REVCOMP, NUCLEIC:
txt, ok := ProcessClause(curr, op.Stages, mask, tab, pfx, sfx, plg, sep, def, op.Type, index, level, variables, transform)
if ok {
plg = ""
......@@ -4474,11 +4557,16 @@ func ProcessInstructions(commands []*Operation, curr *Node, mask, tab, ret strin
case LBL:
lbl := str
accum(tab)
accum(plg)
accum(pfx)
if plain {
accum(lbl)
} else {
printInColor(lbl)
}
accum(sfx)
plg = ""
lst = elg
tab = col
ret = lin
case PFC:
......@@ -5697,7 +5785,7 @@ func ProcessINSD(args []string, isPipe, addDash, doIndex bool) []string {
acc = append(acc, "-element", "INSDSeq_accession-version", "-clr", "-rst", "-tab", "\\n")
}
} else {
acc = append(acc, "-pattern", "INSDSeq", "-ACCN", "INSDSeq_accession-version")
acc = append(acc, "-pattern", "INSDSeq", "-ACCN", "INSDSeq_accession-version", "-SEQ", "INSDSeq_sequence")
}
if doIndex {
......@@ -5868,6 +5956,30 @@ func ProcessINSD(args []string, isPipe, addDash, doIndex bool) []string {
// report capitalization or vocabulary failure
checkAgainstVocabulary(str, "element", insdtags)
} else if str == "sub_sequence" {
// special sub_sequence qualifier shows sequence under feature intervals
acc = append(acc, "-block", "INSDFeature_intervals")
if isPipe {
acc = append(acc, "-lbl", "")
} else {
acc = append(acc, "-lbl", "\"\"")
}
acc = append(acc, "-subset", "INSDInterval", "-FR", "INSDInterval_from", "-TO", "INSDInterval_to")
if isPipe {
acc = append(acc, "-pfx", "", "-tab", "", "-nucleic", "&SEQ[&FR:&TO]")
} else {
acc = append(acc, "-pfx", "\"\"", "-tab", "\"\"", "-nucleic", "\"&SEQ[&FR:&TO]\"")
}
acc = append(acc, "-subset", "INSDFeature_intervals")
if isPipe {
acc = append(acc, "-lbl", "\\t")
} else {
acc = append(acc, "-lbl", "\"\\t\"")
}
} else {
acc = append(acc, "-block", "INSDQualifier")
......@@ -8364,7 +8476,7 @@ func main() {
// -e2index shortcut for experimental indexing code (documented in rchive.go)
if args[0] == "-e2index" {
// e.g., xtract -transform meshtree.txt -e2index
// e.g., xtract -transform "$EDIRECT_MESH_TREE" -e2index
args = args[1:]
......