This repository is dedicated to the curation of published data from the polymer nanocomposites literature into a structured format for NanoMine.
- Data Science Productivity Tools: A free-to-audit course on edX covering RStudio, Git/Github, and Unix/Linux.
- WebPlotDigitizer: Tool for semi-automated extraction of raw data from plots.
- NanoMine Schema: XML Schema Definition (.xsd) files containing the full list of terms (refer to most recent schema; filenames contain date in MMDDYY format)
- NanoMine Excel Template: Microsoft Excel workbook into which data for a given sample are assembled
- Tidy Data: Principles for data tidiness laid out in R for Data Science by Hadley Wickham
- Tetherless World repo for NanoMine: Ontology and Knowledge Graph approach to data storage (see nanomine.ttl), using the RDF data model
- RDF Primer: W3C introduction to the RDF data model
- NanoMine SPARQL Endpoint: Direct querying of RDF data in the NanoMine Knowledge Graph, using the SPARQL query language
- Semantic Data Dictionaries: Developed by our collaborators at RPI, a specification and method for mapping tabular data into RDF format
- NanoMine Ontology spreadsheet: Google Sheet used to collaboratively develop the NanoMine ontology
Within a curation job's sub-directory, the file organization depends on what makes the most sense for the curator. However, the sub-directory should contain a "Traveler" (README.md) with the DOI and other information relevant to the curation process. Because data in NanoMine are uploaded on a per-sample basis, it is suggested to give each "sample" its own child sub-directory containing the completed Excel template along with any supplemental data files (.csv, .jpg, etc.).
There are five directories in this repository that can be considered "stages" of the curation process:
At the "Wishlist" stage, a curation job is prepared by creating a sub-directory, initializing a "Traveler" (README.md file) in the sub-directory, and identifying figures/data of interest. Once the raw data have been retrieved (either provided by the original authors or through a digital extraction tool),
The "In-Progress" stage should be kept as uncluttered as possible, with only those curation jobs that are actively in progress. Curation jobs should spend as little time as reasonable in this directory and should be moved to either "Completed" or "Stalled."
The "Completed" stage is designed to keep track of curation jobs that have already been uploaded to NanoMine QA. This is ideally the final location for the curation job, unless there are modifications or updates to make in which case the sub-directory should be moved to the "Revisited" directory.
If a significant roadblock is encountered, the curation job can be moved to the "Stalled" stage. Documenting the issue as clearly as possible will help the team make the necessary improvements or updates to the system. Once a solution has been identified, the job can be returned to the "In-Progress" directory.
If a curation job in the NanoMine system ("Completed") requires some revision, the sub-directory should be moved to the "Revision" stage. At this point, the issue should be clearly described before moving the curation job to the "In-Progress" stage.
The overall workflow is illustrated in the diagram below.

To collaboratively manage and keep track of changes to curation-related files, the git workflow will be adopted. Raw data tables and the code used to prepare the raw data should be included in a shared repository (e.g. Dropbox, Google Drive). This Github repository is not designed to host raw data, so any curation jobs in the "Completed" stage should be configured to ignore the raw data files and only track the Traveler and other small files (such as the master Excel template and any R code).