Skip to content

forrtproject/FReD-data

Repository files navigation

FORRT Replication Database (FReD) β€” Data Processing & API

This repository contains the complete data processing pipeline for the FORRT Replication Database (FReD) and FLoRA datasets, plus the backend API for the Zotero Replication Checker plugin.

πŸ“‹ Table of Contents


Overview

This repository manages two independent datasets:

  1. FReD (FORRT Replication Effects Database)

    • Effect-level data: individual effect sizes from replications
    • Created by merging individual effects with paper-level success coding
    • Output: output/FReD.xlsx
  2. FLoRA (FORRT Literature on Replications and Reproductions Archive)

    • Paper-level data: metadata about replication and reproduction studies
    • Combines both replications and reproductions from two separate Google Sheets
    • Deduplicated by original-replication/reproduction DOI pairs
    • Enriched with keywords and language metadata from OpenAlex
    • Output: output/flora.csv

Both datasets are augmented with:

  • CrossRef metadata (titles, authors, years)
  • APA-formatted references with manual override support
  • Author overlap detection
  • OpenAlex keywords

The datasets power the Zotero Replication Checker API backend via privacy-preserving DOI hash lookups.


Repository Structure

fred_data/
β”œβ”€β”€ R/                              # Shared helper functions
β”‚   β”œβ”€β”€ cache_config.R              # Cache paths by data type
β”‚   β”œβ”€β”€ data_cleaning.R             # FReD data cleaning
β”‚   β”œβ”€β”€ crossref_cache.R            # Citation & author caching
β”‚   β”œβ”€β”€ augmentation.R              # Augmentation functions
β”‚   └── release_helpers.R           # OSF release automation
β”‚
β”œβ”€β”€ pipelines/                      # Independent data pipelines
β”‚   β”œβ”€β”€ fred/
β”‚   β”‚   β”œβ”€β”€ prepare_fred.qmd        # Download β†’ clean β†’ augment β†’ save
β”‚   β”‚   └── release_fred.qmd        # Release to OSF (optional)
β”‚   β”‚
β”‚   └── flora/
β”‚       β”œβ”€β”€ prepare_flora.qmd       # Download β†’ deduplicate β†’ augment β†’ save
β”‚       └── release_flora.qmd       # Release to OSF (optional)
β”‚
β”œβ”€β”€ cache/                          # Cache files (gitignored)
β”‚   β”œβ”€β”€ crossref_doi_cache.rds      # DOI metadata
β”‚   β”œβ”€β”€ crossref_citations.rds      # APA/BibTeX references
β”‚   β”œβ”€β”€ crossref_authors.xlsx       # Author lists
β”‚   β”œβ”€β”€ author_overlap.xlsx         # Overlap calculations
β”‚   β”œβ”€β”€ manual_references.xlsx      # Manual reference overrides
β”‚   └── openalex_keywords.csv       # Keywords cache
β”‚
β”œβ”€β”€ output/                         # Generated datasets (gitignored)
β”‚   β”œβ”€β”€ FReD.xlsx                   # Effect-level dataset
β”‚   └── flora.csv                   # Paper-level dataset
β”‚
β”œβ”€β”€ cos_integration/                # COS test data (optional)
β”‚   β”œβ”€β”€ cos_test_set_phase1.csv
β”‚   β”œβ”€β”€ cos_test_set_phase1_prepared.xlsx
β”‚   β”œβ”€β”€ prepare_cos_data.R
β”‚   └── README.md                   # COS toggle instructions
β”‚
β”œβ”€β”€ fred_dynamodb_loader/           # API backend loader
β”œβ”€β”€ release/                        # Release automation scripts
β”œβ”€β”€ COS Reports/                    # COS competition reports
β”œβ”€β”€ archive/                        # Historical files
β”‚   └── old_scripts/                # Previous pipeline versions
β”‚
└── [Documentation files]
    β”œβ”€β”€ README.md                   # This file
    β”œβ”€β”€ .env.example                # Environment variables template
    β”œβ”€β”€ REORGANIZATION_PROGRESS.md
    β”œβ”€β”€ PHASE2_SUMMARY.md
    β”œβ”€β”€ PHASE3-4_SUMMARY.md
    └── IMPLEMENTATION_STATUS.md

Quick Start

Installation

# Clone repository
git clone https://github.com/forrtproject/FReD-data.git
cd fred_data

# Set up environment variables
cp .env.example .env
# Edit .env and add:
# - OSF_TOKEN (for releases)
# - ENABLE_COS_MERGE (TRUE/FALSE)
# - OPENALEX_MAILTO (your email for API)

# Install R dependencies (one-time setup)
Rscript -e "install.packages(c('tidyverse', 'readxl', 'openxlsx', 'rcrossref', 'osfr', 'quarto'))"

Running Pipelines

# Prepare FReD (effect-level dataset)
quarto render pipelines/fred/prepare_fred.qmd

# Prepare FLoRA (paper-level dataset)
quarto render pipelines/flora/prepare_flora.qmd

# Output files created:
# - output/FReD.xlsx (effect-level data with augmentation)
# - output/flora.csv (paper-level data with augmentation)

Data Processing Pipelines

FReD Pipeline (Effect-level)

File: pipelines/fred/prepare_fred.qmd

8-step process:

  1. Load helpers - Source R functions for cleaning and augmentation
  2. Download - Fetch validated FReD data from Google Sheets
  3. COS Integration (optional) - Merge COS Phase 1 test data if enabled
  4. Clean - Standardize formatting, remove duplicates, fix DOIs
  5. Validate - Check data quality (ready when validation module complete)
  6. Generate IDs - Create fred_id, entry_id, effect_id
  7. Augment:
    • Author overlap detection (% shared authors)
    • Clean references (APA-formatted with manual overrides)
    • Keywords from OpenAlex (optional)
  8. Save - Output to output/FReD.xlsx

Run:

# Without COS data (default)
quarto render pipelines/fred/prepare_fred.qmd

# With COS data merged
export ENABLE_COS_MERGE=TRUE
quarto render pipelines/fred/prepare_fred.qmd

# Output: output/FReD.xlsx
# HTML report: pipelines/fred/prepare_fred.html

FLoRA Pipeline (Paper-level: Replications + Reproductions)

File: pipelines/flora/prepare_flora.qmd

12-step process:

  1. Load helpers - Source R functions for augmentation
  2. Download & Combine - Fetch both replications and reproductions from Google Sheets, combine on common columns
  3. Prepare - Select relevant columns
  4. Deduplicate - Remove duplicate (doi_o, doi_r) pairs
  5. Validate DOIs - Ensure format starts with "10."
  6. Fetch metadata - CrossRef/DataCite lookup (framework ready)
  7. Augment with references - Clean references (APA-formatted with manual overrides)
  8. Add IDs - Privacy-preserving 3-char DOI hash prefixes
  9. Add language & keywords - Fetch from OpenAlex API (only fills empty fields)
  10. Format - Reorder columns for output (includes keywords and language)
  11. Save - Output to output/flora.csv
  12. Summary - Report statistics on papers, coverage, and augmentation success

Run:

quarto render pipelines/flora/prepare_flora.qmd

# Output: output/flora.csv
# HTML report: pipelines/flora/prepare_flora.html

Helper Functions

All helper functions are in R/ and ready to use:

Data Cleaning (R/data_cleaning.R)

  • clean_fred_data(data) - Standardize formatting, fix DOIs, remove non-printable characters

CrossRef Caching (R/crossref_cache.R)

  • get_apa_references(dois) - Get APA references (manual β†’ cache β†’ API lookup)
  • get_crossref_authors(dois) - Fetch author lists from CrossRef
  • compute_author_overlap(data) - Calculate author overlap

Augmentation (R/augmentation.R)

  • augment_with_author_overlap(data) - Add author overlap columns
  • augment_with_clean_references(data) - Add APA reference columns
  • augment_with_keywords(data) - Add OpenAlex keywords

Release (R/release_helpers.R)

  • release_to_osf(dataset_path, ...) - Release dataset to OSF with versioning

Usage:

source("R/augmentation.R")

# Augment data
data <- augment_with_author_overlap(data)
data <- augment_with_clean_references(data)
data <- augment_with_keywords(data)

COS Integration

COS (Collaborative Open Science) Phase 1 test data can be optionally merged with FReD.

Enabling COS Integration

# Method 1: Environment variable
export ENABLE_COS_MERGE=TRUE
quarto render pipelines/fred/prepare_fred.qmd

# Method 2: .env file
# Edit .env and set: ENABLE_COS_MERGE=TRUE
quarto render pipelines/fred/prepare_fred.qmd

# Disable (default)
export ENABLE_COS_MERGE=FALSE
quarto render pipelines/fred/prepare_fred.qmd

How It Works

When ENABLE_COS_MERGE=TRUE:

  1. FReD pipeline downloads main dataset
  2. Merges with COS data on common columns
  3. Both processed identically (cleaning, validation, augmentation)
  4. Single output: FReD.xlsx with both datasets

See: cos_integration/README.md for detailed instructions


Caching Strategy

All caches are organized by data type (not purpose) for maximum efficiency:

Cache File Type Purpose
crossref_doi_cache.rds DOI Metadata CrossRef/DataCite results
crossref_citations.rds References APA/BibTeX formatted citations
crossref_authors.xlsx Authors Author lists by DOI
author_overlap.xlsx Overlap Data Computed author overlaps
manual_references.xlsx Overrides Manual reference corrections
openalex_keywords.csv Keywords OpenAlex keywords by DOI

Three-tier lookup for references (fastest to slowest):

  1. Manual reference overrides (manual_references.xlsx)
  2. Cached references (RDS cache files)
  3. Live CrossRef API call

API Documentation

See original API Readme file API endpoints:

  • Prefix Lookup (privacy-preserving 3-char hash lookups)
  • Original DOI Lookup (direct DOI searches)

The API is powered by FLoRA dataset (output/flora.csv) loaded into DynamoDB.

API Backend: fred_dynamodb_loader/load_fred_to_dynamodb.py


Breaking Changes from Previous Version

This reorganization introduces breaking changes:

  • Old scripts moved to archive/old_scripts/
  • All helper functions now in R/
  • Pipelines now in pipelines/fred/ and pipelines/flora/ (not at root)
  • Output files now in output/ (use pipelines to generate)
  • No symlinks created at root

Migration path:

  1. Run new pipelines: quarto render pipelines/fred/prepare_fred.qmd
  2. Use output/FReD.xlsx and output/flora.csv as outputs
  3. All helper functions available via source("R/...") in your scripts

Configuration

Environment Variables

Set in .env file (or shell environment):

# OSF Release Authentication
OSF_TOKEN=your_osf_token_here

# COS Integration Toggle
ENABLE_COS_MERGE=FALSE  # Set to TRUE to merge COS test data

# OpenAlex API Contact
OPENALEX_MAILTO=your_email@example.com

Cache Configuration

All cache paths defined in R/cache_config.R:

CACHE_DIR <- "cache"
CROSSREF_DOI_CACHE <- "cache/crossref_doi_cache.rds"
CROSSREF_CITATIONS_CACHE <- "cache/crossref_citations.rds"
# ... etc

To change cache locations, edit R/cache_config.R and update the paths.


Troubleshooting

Pipeline fails to download data

  • Check internet connection
  • Verify Google Sheets URLs are still active
  • Check that CSV format is still used for export

Missing cache files

  • Caches are auto-generated on first run
  • Ensure cache/ directory exists
  • Check file permissions

COS data not merging

  • Ensure ENABLE_COS_MERGE=TRUE
  • Verify cos_integration/cos_test_set_phase1_prepared.xlsx exists
  • Run Rscript cos_integration/prepare_cos_data.R to prepare COS data

Reference lookup fails

  • Check internet connection for CrossRef API
  • Verify manual_references.xlsx exists if using overrides
  • Check OSF_TOKEN if OpenAlex requires authentication

Contributing

Running with Debug Output

# Enable verbose logging
quarto render pipelines/fred/prepare_fred.qmd --quiet false

Testing Individual Functions

# Test cleaning
source("R/data_cleaning.R")
result <- clean_fred_data(sample_data)

# Test augmentation
source("R/augmentation.R")
data <- augment_with_author_overlap(sample_data)

# Test caching
source("R/crossref_cache.R")
refs <- get_apa_references(c("10.1234/example"))

Adding New Augmentations

  1. Create function in R/augmentation.R
  2. Follow pattern: augment_with_[feature](data)
  3. Add cache management as needed
  4. Call from appropriate pipeline file
  5. Document in pipeline comments

Related Resources

  • FORRT Project: https://forrt.org
  • FReD Dataset: https://osf.io/9r62x (OSF project)
  • Zotero Plugin: Replication Checker plugin in Zotero marketplace
  • API Backend: fred_dynamodb_loader/ (AWS Lambda + DynamoDB)

License

[See LICENSE file in repository]


Contact

For questions about the data processing pipeline:

  • Open an issue on GitHub
  • Contact FORRT team

For API issues:

  • See API documentation in this README
  • Check fred_dynamodb_loader/ for backend code

Last Updated: 2025-12-17 Version: 2.0 (Reorganized with independent pipelines) Status: Production-ready


πŸ“Š Flora Dataset Update Log

This chart is automatically updated after each pipeline run, showing the size of output/flora.csv over time.

xychart-beta
    title "FLoRA Dataset Size Over Time (Total / Replications / Reproductions)"
    x-axis ["2026-03-16", "2026-03-17", "2026-03-18", "2026-03-19"]
    line [1126, 1436, 1473, 1474]
    line [1118, 1428, 1465, 1466]
    line [8, 8, 8, 8]
Loading

Latest (2026-03-19): Total = 1474 | Replications = 1466 | Reproductions = 8

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages