citation-exporter
This is a citation exporting web service, based on the following excellent open-source tools:
Quick start
To run this service, identify the jar file with its version number, and run:
java -jar pmc-citation-exporter-<version>.jar
This starts the server with the test item source, which reads article citation data from the samples directory. To run with real data, you will have to configure an item source to use.
Configuration - system properties
Configuration is controlled with system properties, which you can set on the command line. For example, when running via the jetty maven plugin:
mvn jetty:run -Djetty.port=9876 -Did_cache=true -Did_cache_ttl=8
Or, when running the executable jar (note that property settings must come before
the -jar option):
java -Djetty.port=9876 -Did_cache=true \
-jar pmc-citation-exporter-<version>.jar
The properties that control the behavior of the service are:
com.sun.management.jmxremote.authenticate- Set this to "false" to turn on turn on the JMX monitor console, without user authentication. Note that this is only suitable for a development environment.com.sun.management.jmxremote.port- Set this to a port number, if you want to use the JMX monitor console.id_cache- either "true" or "false". Default is "false".id_cache_size- specify the size of the cache for IDs. Default is 50000.id_cache_ttl- time-to-live for each of the IDs in the ID cache, in seconds. Default is 86400.id_converter_params- Query string parameters to send to the the PMC ID converter API. Default is "showaiid=yes&format=json&tool=ctxp&email=[email protected]".id_converter_url- URL of the PMC ID converter API. Default is "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/".item_source- string specifying which ItemSource to use. The default is "test", which indicates to use the TestItemSource, which loads data items from files in the class path. If not "test", then the value should be the fully qualified name of the class. Possible values are:- gov.ncbi.pmc.cite.StcachePubOneItemSource - requires item_source_loc to also be set
- gov.ncbi.pmc.cite.StcacheNxmlItemSource - requires item_source_loc to also be set
- gov.ncbi.pmc.cite.ConvAppNxmlItemSource - Get NXML from an HTTP web service. Requires item_source_loc to also be set
item_source_loc- Whenitem_sourceis:- An Stcache: this needs to be the full pathname of the stcache image file.
- ConvAppNxmlItemSource: this should be the URL of the converter app.
- PubmedPubOneItemSource: this should be a URL template, with
${id}in the place where the numerical PubMed ID should be inserted.
item_source_id_type- Used only with the test item_source; this specifies which ID type that item_source prefers. Default is "aiid".java.io.tmpdir- used when running as an "uber jar"; this is where the application is unpacked.jetty.port- the IP port number that the service will listen on. The default for this is set in the pom.xml files, as 11999.log- location of the log files. Defaults to a subdirectory named "log" of the directory from which the app is run. The default for this is set in the pom.xml file.log_level- sets the logging level for the root logger, as one of the log4j levels; e.g. "DEBUG", "INFO". The defaults for this are specified in the pom.xml file, and isDEBUGwhen running from the jetty maven plugin, andINFOwhen running as an "uber jar".xml.catalog.files- used by the Apache commons CatalogResolver; this is the pathname of the OASIS catalog file to use when parsing XML files. See below for more info. Default value is "catalog.xml"proxy- In the form "host:port". This is for use with the PubmedPubOneItemSource. If set, then the service will access the item source backend through the indicated HTTP proxy.
DTDs and XML catalog files
The repository comes with an OASIS catalog file, catalog.xml, that is used, by default, to find DTDs. This contains:
<nextCatalog catalog="catalog-local.xml"/>
<nextCatalog catalog="jats/catalog.xml"/>This causes the resolver to try to resolve IDs from:
- catalog-local.xml, if it exists. If you create this file, then you can override any definitions from other catalogs.
- jats/catalog.xml, if it exists. This file is included in the repository, and you can use the jats/get-dtds.sh script to download the corresponding DTDs from the JATS site. (This is executed by default in a Maven build).
If the JATS (and other) DTDs are located somewhere else on your system, then there are two ways to override the default behavior.
-
Set the
xml.catalog.filessystem property, to point to some other master catalog file. For example:mvn test -Dxml.catalog.files=/pmc/load/catalog/linux-oxygen-pmc3-catalog.xml -
Create a catalog-local.xml file in the root directory of the repo, and override specific DTDs there.
API
Special URLs
The following two URLs are special:
- /samples - provides a list of links to sample outputs of various document, in various formats
- /errortest - strictly for testing, this causes the application to generate an error page
- /echotest - for testing, causes the application to echo a very simple text page
- /info - some status information
Parameters:
- id or ids - List of IDs, comma-delimited. The types and expected
patterns of the values given here are the same as for
the PMC ID converter
API.
The type can either be specified explicitly with the idtype parameter, or
can be inferred. IDs are always resolved to one of
aiidorpmid. - idtype - Specifies the type of the IDs given in the ids parameter.
Any of these types is allowed:
- pmcid - includes versioned ids
- pmid
- mid
- doi
- report - Specifies the embedded format of the data. Defaults to "html". See the table below for allowed values.
- format - corresponds to the returned "content-type". Can be used as substitute for content negotiation. Default depends on report; see the table below.
- style or styles - CSL style name, or a list of stylenames. If just one ID is given, this can include multiple names, comma-delimited. If multiple IDs are given, then this must be only one style name. In other words, you can have multiple IDs or multiple styles, but not both. Defaults to "modern-language-association".
Value combinations of report and format are listed in the following table.
report format Comments
------ ------ --------
html html Styled citations in raw HTML format. Same as citeproc-node.
ris ris Machine-readable citation, in RIS format. Media type is application/x-research-info-systems.
nbib nbib Machine-readable citation, in NBIB (MEDLINE) format. Media type is application/nbib.
citeproc json Machine-readable citation, in citeproc-json format.
pub-one or pub1 xml New unified literature format.
Error responses
If everything goes well, the service will return a status of 200, of course. If there is a problem, one of the following codes will be returned:
- 400 Bad request - if the request parameters can't be deciphered, or similar problems
- 404 Not found - for IDs that are of the correct form, but can't be found in the data
- 500 Internal server error - if there is problem with an upstream service, like the ID converter, or a runtime exception in this service's software.
Styled citation responses
Each styled citation will be in it's own <div> element, which will have
some special attributes to identify it:
- data-id - the ID used in the request for the resource. This will have a type prefix, followed by a colon, and then the ID value. E.g. pmdi
- data-style - the name of the citation style
- data-resolved-id - optional, if the requested ID doesn't match the resolved ID.
The individual record <div>s will be wrapped in an outer <div> element.
If there were requested IDs that couldn't be resolved, then they will be
listed in the data-not-found attribute on this outer <div>.
For example, a single record styled in a single citation style (request "?ids=PMC3000436"):
<div>
<div class="csl-entry" data-id="aiid:3000436"
data-style="modern-language-association">Barash,
Uri et al. “Proteoglycans in Health and Disease: New Concepts for Heparanase Function in Tumor
Progression and Metastasis.” <i>FEBS J</i> 277.19 (2010): n. pag.</div>
</div>A response for a request for multiple IDs, one of which cannot be resolved (request "?ids=PMC3000436,PMC99999999,PMC3155436"):
<div data-not-found="pmcid:PMC99999999">
<div class="csl-entry" data-id="pmcid:PMC3000436"
data-resolved-id="aiid:3000436"
data-style="modern-language-association">Barash, Uri et al. “Proteoglycans in Health
and Disease: New Concepts for Heparanase Function in Tumor Progression and Metastasis.”
<i>FEBS J</i> 277.19 (2010): n. pag.</div>
<div class="csl-entry" data-id="pmcid:PMC3155436"
data-resolved-id="aiid:3155436"
data-style="modern-language-association">“Correction.” <i>Can Fam Physician</i> 57.8
(2011): 879–879. Print.</div>
</div>PubOne format responses
The response for a single record in PubOne format will look something like the following (request "?ids=PMC3000436&report=pub1"):
<pub-one-record xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns:xlink="http://www.w3.org/1999/xlink"
record-type="article" xml:lang="en">
<source-meta>
<object-id pub-id-type="nlm-journal-id">101229646</object-id>
...
</source-meta>
<document-meta>
<object-id pub-id-type="doi">10.1111/j.1742-4658.2010.07799.x</object-id>
<object-id pub-id-type="manuscript">nihpa226684</object-id>
<object-id pub-id-type="pmcid">3000436</object-id>
<object-id pub-id-type="pmid">20840586</object-id>
...
</document-meta>
</pub-one-record>If more than one record is returned, they will be wrapped in an outer
<pub-one-records> element, and each identified with attributes in the
namespace "http://www.ncbi.nlm.nih.gov/ns/search". For example, a response
with for two good IDs and one bad one (request
"?ids=PMC3000436,PMC99999999,PMC3155436&report=pub1"):
<pub-one-records xmlns:s="http://www.ncbi.nlm.nih.gov/ns/search"
s:not-found="pmcid:PMC99999999">
<pub-one-record s:id="pmcid:PMC3000436"
s:resolved-id="aiid:3000436" ...>...</pub-one-record>
<pub-one-record s:id="pmcid:PMC3155436"
s:resolved-id="aiid:3155436" ...>...</pub-one-record>
</pub-one-records>citeproc-json format responses
The response for a single record in this format will be a JSON object. For example (request "?ids=PMC3000436&report=citeproc"):
{
"id": "aiid:3000436",
"title": "Proteoglycans in health and disease: New concepts for
heparanase function in tumor progression and metastasis",
"author": [ ... ],
...
"issue": "19",
"PMID": "20840586",
"PMCID": "3000436",
"DOI": "10.1111/j.1742-4658.2010.07799.x",
}Note that the value used for the id field is the resolved, canonical
identifier, and that the other forms of identifier appear in citeproc-json
specific fields PMID, PMCID, etc.
If there are multiple records, they will be wrapped in a JSON array. For example, the response with two good records and one bad one will look like this (request ""):
[
{
"not-found": "pmcid:PMC99999999"
},
{
"id": "aiid:3000436",
"title": "Proteoglycans in health and disease: New concepts for heparanase function in
tumor progression and metastasis",
...
"PMID": "20840586",
"PMCID": "3000436",
"DOI": "10.1111/j.1742-4658.2010.07799.x",
},
{
"id": "aiid:3155436",
"title": "Correction",
...
"PMCID": "3155436",
}
]Testing
Run unit tests as follows:
mvn test
To run just one specific set of tests (all the tests defined in the TestRequests class):
mvn -Dtest=TestRequests test
To run a single method of a test class:
mvn -Dtest=TestCitationProcessor#testBiblFromPubMedPubOne test
You can use wildcards; for example:
mvn '-Dtest=Test*' test
mvn -Dtest='*Transform*' test
See documentation on the Maven Surefire plugin for more options.
Test samples
A good set of samples is listed in the application's samples page. The source files for these are in src/main/resources/samples, and the
Data-driven validation unit tests
Among the unit tests that mvn test will run are two sets that use data files
to verify the output of transformations, and the responses to requests:
TestTransforms.java and TestRequests.java.
TestTransforms.java
This provides data-driven schematron and regular-expression matching tests of the XSLT transforms. The individual test cases are defined in the transform-tests.json file, which is read into a List of TransformTestCase objects.
You can use the test_cases system property to select which specific test
case to run:
- If omitted, or empty, all tests are run
- Otherwise, it's matched against the description, as a regular expression
So, for example, to test all the cases that have "PubOne" in the description, run:
mvn -Dtest=TestTransforms -Dtest_cases=PubOne test
The format of the transform-tests.json file is defined in comments there.
When checking XML output, Schematron files are used, that are also in the src/test/resources directory.
TestRequests.java
This provides unit tests for the Request class, which handles HTTP requests. It uses Mockito to mock HttpServletRequest and HttpServletResponse objects. It reads test cases from the request-tests.json file into a List of RequestTestCase objects.
As with TestTransforms, you can use the test_cases property to
select which tests to run.
For example, to test all the cases that have "style" in the description, run:
mvn -Dtest=TestRequests -Dtest_cases=style test
Monitoring
Logs
The location of log files is controlled by the system parameter log, which
is usually set to the value "log" using -Dlog=log command-line switch.
Logging is controlled by properties set in the
src/main/resources/log4j.properties file. The log level is controlled by
the log4j.rootLogger property, and can be set to one of TRACE, DEBUG, INFO,
WARN, ERROR or FATAL.
JMX Monitor Console
You can use JMX monitoring, to check the service, either locally or remote.
When running with "jetty:run", local JMX monitoring is enabled by default.
After starting the server, from another terminal, enter the command jconsole,
and then open the connection to the "org.codehaus.plexus..." process that is
listed.
When running as an "uber jar", start it with the following options (for example):
mvn package
java -Dcom.sun.management.jmxremote.port=11997 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false \
-jar target/pmc-citation-exporter-*.jar
Then, you can connect to a monitor console by using the command:
jconsole localhost:11997
Development
Runtime environments
There are two ways to run the server:
- Using the Jetty Maven
plugin,
with
mvn jetty:run. This is useful for development. - As an executable "uber jar", created using the Apache Maven Shade
plugin, with
mvn package; java -jar target/pmc-citation-exporter...jar.
Running with Jetty Maven plugin
To use the Jetty Maven plugin, run with, for example:
mvn clean jetty:run
The main entry point to the app is in MainServlet.java. This environment does not use:
- the src/main/webapp/jetty.xml configuration file, or
- The code in WebServer.java
Some options are set in the pom.xml so that, when running with jetty:run,
it scans for changes to
source files, and to automatically redeploys whenever any are detected. Also,
in the pom file, some default values are given for System properties that are
used to configure the app.
For example:
<!-- Jetty Maven plugin -->
<plugin>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-maven-plugin</artifactId>
<version>${jetty.version}</version>
<configuration>
<scanIntervalSeconds>2</scanIntervalSeconds>
<scanTargetPatterns>
<scanTargetPattern>
<directory>src/main/resources</directory>
<includes>
<scanTargetPatterns>**/*.*</scanTargetPatterns>
</includes>
</scanTargetPattern>
</scanTargetPatterns>
<systemProperties>
<systemProperty>
<name>jetty.port</name>
<value>11999</value>
</systemProperty>
<systemProperty>
<name>log</name>
<value>log</value>
</systemProperty>
<systemProperty>
<name>log_level</name>
<value>DEBUG</value>
</systemProperty>
</systemProperties>
</configuration>
</plugin>Bug
But note that there seems to be a bug in the "hot redeploy" feature of the Jetty Maven plugin. When changes are made to sample files under src/main/resources/samples, the application is restarted, but for some reason, those changed files are not copied into the target/classes/samples directory as they should be. Those sample files are copied, however, when the application restarts as a result of a Java source file changing.
Running the Jetty shaded uber-jar
An executable jar is created with the mvn package command, and put into the
target subdirectory (target/pmc-citation-exporter-<version>.jar). This one
"uber jar" includes all of the dependencies, including Jetty itself.
This process is controlled by the Apache Maven Shade plugin, which is configured by a <plugin> section of the pom.xml.
In addition, another plugin section, build-helper-maven-plugin, is required to specify additional directories to copy into the target jar file.
To run the server from this executable jar, execute something like this
java -Djava.io.tmpdir=./jetty-temp-dir \
-jar target/pmc-citation-exporter-*.jar
Note that system properties must be set on the command line before the
-jar option.
When running this way, Jetty is configured by the src/main/webapp/jetty.xml file. The main entry point to the application is in WebServer.java.
System properties
By convention, in the source code, we try to encapsulate the handling of System properties in the App class. All other classes should use plain Java APIs for controlling options.
Property values are set in a number of ways:
-
src/main/resources/ctxp.properties - This is a template, and values for these properties are computed and set at build time. The result is then written to the root directory of the project (./ctxp.properties). Properties set in this file are:
- ctxp.version
- ctxp.sha
- ctxp.config.sha
-
pom.xml - These properties are applied when running with the jetty maven plugin (
mvn jetty:run). Currently, the ones specified in the pom.xml file are:- jetty.port
- log
- log_level
-
WebServer.java - These are applied when running the executable jar.
-
App.java - In a few cases, defaults are specified in this class.
-
On the command line. For example:
mvn jetty:run -Djetty.port=9876 -Did_cache=true -Did_cache_ttl=8
Build environments
This repository has been configured such that, by default, it can run stand-alone, without any dependencies on NCBI-internal libraries or services.
In the production environment, however, we require access to a Java library which has not been released openly (groupId=gov.ncbi.pmc, artifactId=pmc-lib). References to that library exist in two class files, StcachePubOneItemSource and StcacheNxmlItemSource; so, by default, those are excluded from compilation. This is accomplished by making the default build profile is "dev", which explicitly excludes those two class files, and doesn't include the dependency on the pmc-lib library.
Building for production is done with the use of the "prod" Maven profile
(mvn -Pprod). That profile doesn't exclude those class files for
compilation, and does declares the dependency on the pmc-lib library.
Eclipse setup
To work in Eclipse, import this project into your workspace.
Turn off validation of some subdirectories.
Turn off validation of the DTD subdirectories of jats by right-clicking on them, selecting "Properties", and then check "derived". This will cause all of the files under these directories to be excluded from validation. You might have to select "Project" -> "Clean" to get rid of existing errors.
It might help even more, to speed up builds, to turn off all validation in the project. Right click on the project, then select Properties, then Validation, and check "Suspend all validators".
Set up some file types
In Preferences → General → Content Types, select "Text", then "XML". Then add two new types: ".nxml" (JATS PMC NXML format) and ".sch" (Schematron files).
Test item provider
When the value of *item_source" is "test", the citation data is mock data from the src/main/webapp/test directory.
Javadocs
When you run mvn package, Javadocs are generated from the sources, and
written to target/apidocs, as well as to their own Jar file
target/pmc-citation-exporter--javadoc.jar
Exception handling
Exception classes used when handling a request:
- CiteException
- BadParamException - e.g. an id value that doesn't match one of the known patterns
- NotFoundException - when an id is of the correct form, but not resolved, or the pub-one not found in the stcache
- ServiceException - some problem with upstream service; results in 500
- java.io.IOException - from, for example, java xml processing, error creating CSL object, etc. Results in 500
When handling a request, the PrintWriter for the page is instantiated last. If there is an exception when that is done, then the code just writes to the error log and returns.
This is implemented with the pattern:
resp.setContentType("text/html;charset=UTF-8");
resp.setCharacterEncoding("UTF-8");
resp.setStatus(HttpServletResponse.SC_OK);
initPage();
if (page == null) return;So, this means that the whole page is prepared as a String first, and then written out after the page is initialized.
Dependencies
Dependencies are declared in the pom.xml file, and are resolved automatically by Maven.
Below is a list of some of the stable dependencies, along with links to documentation, useful when doing development, and more details, where warranted.
Several of the dependencies use a free third-party service Jitpack.io in order to freeze specific revisions from GitHub repositories. These are libraries that are not on Maven Central, and Jitpack.io provides a way to ensure that we are using a stable version. This requires adding the following to the <repositories> section of the pom:
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>Java
Requires Java version 8.
- Platform / library Javadocs
Saxon Home Edition
It uses Saxon-HE for XSLT tranformations.
Jetty
citeproc-java
The official source for this code is on GitHub at
The current version of citation exporter uses a fork of this at
It uses Jitpack.io to freeze the version at a specific commit.
To use the latest development version of this library, rather than the release package, clone the GitHub repository to your local machine. Then (see the build instructions do the following to create the jar file, and install that in your local Maven cache:
./gradlew install
In the citation-exporter pom, make sure you change the dependency to use the "real" coordinates, instead of the ones from Jitpack.io:
<dependency>
<!-- Use these coordinates for the jitpack.io version: -->
<groupId>com.github.Klortho</groupId>
<artifactId>citeproc-java</artifactId>
<version>3f30f220c279557892d69c9ec95b92d84cf2114d</version>
<!-- Use these when running from a local, development version:
<groupId>de.undercouch</groupId>
<artifactId>citeproc-java</artifactId>
<version>0.7-SNAPSHOT</version> -->
</dependency>citeproc-js
This is included by reference, from citeproc-java. The repository URL and the exact commit are specified in the citeproc-java/build.gradle file.
As of 12/10/2015, we are using a fork of this, on BitBucket at klortho/citeproc-js, branch pmc-22661-epub-date. Note that we've tried to update to a later version, that includes a fix for this issue. In this comment, the author indicates the fix is in version 1.0.559, but the API must have changed, because updating citeproc-java to use this commit causes its unit tests to break.
Citation style language (CSL) libraries
These are two libraries that are used to format citations. The originals are here:
styles
Currently (12/10/2015) we're using forks of these, at:
Changes to the styles repo include support of an epublication date and an indication of ahead-of-print.
These branches also includes the addition of pom.xml files, so that they can be brought in as Maven dependencies.
These use Jitpack.io to freeze the version at a specific commit.
In order to use a development versions of these, instead of the default version, merely clone one or the other (or both). Then, in these working copies of the repos, install the packages to your local Maven cache with:
mvn install
Then, in the citation-exporter pom, change the dependency (dependencies) to use the "real" coordinates, instead of the ones from Jitpack.io. For example,
<dependency>
<!-- Use these coordinates for the jitpack.io version: -->
<groupId>com.github.Klortho</groupId>
<artifactId>styles</artifactId>
<version>2b644a6c51743991d6d26b44961f9063dbd92c49</version>
<!-- Use these when running from a local, development version:
<groupId>org.citationstyles</groupId>
<artifactId>styles</artifactId>
<version>1.0.1-pmc-22661-olf</version> -->
</dependency>Note that the official GitHub repos for these are not "mavenized": they do not have pom files. Michel Kraemer provides a service that mavenizes them automatically, and upload SNAPSHOT versions of them to Sonatype's OSS repo daily, see here.
PMC ID Converter API
Jackson
We're using the Jackson library to read JSON objects. Here are some handy links:
- Home page
- Data binding - includes tutorial on how to use it.
- Javadocs
- Jackson annotations - how to annotate the classes that map to JSON objects
kitty-cache
The library is on Google code here, and is also mirrored to GitHub, at treeder/kitty-cache, but it is not in Maven Central.
It is declared in the citation-exporter pom.xml, using JitPack.io to build and deploy it from a fork of the repository on GitHub, at Klortho/kitty-cache. (The reason for using a fork is to protect against the possibility of the original repo being removed.)
PubOne
This is a library of XSLTs that convert to and from NCBI's PubOne format, and is on GitHub at klortho/pub-one.
We use Jitpack.io to freeze the version and to make it available from a Maven repository.
In order to use a development version of this, instead of the default version, clone it to your local machine, and build and install the package to your local Maven cache with:
mvn install
Then, in the citation-exporter pom, change the dependency to use the "real" coordinates, instead of the ones from Jitpack.io. For example,
<dependency>
<!-- Use these coordinates for the jitpack.io version: -->
<groupId>com.github.Klortho</groupId>
<artifactId>pub-one</artifactId>
<version>v1.0.2</version>
<!-- Use these when running from a local, development version:
<groupId>gov.ncbi.pmc</groupId>
<artifactId>pub-one</artifactId>
<version>1.0.2</version> -->
</dependency>Public Domain notice
National Center for Biotechnology Information.
This software is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as United States Government employees and thus cannot be copyrighted. This software is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.
Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.
Please cite NCBI in any work or product based on this material.