FORRT Replication Database (FReD) — Data Processing & API

This repository contains the complete data processing pipeline for the FORRT Replication Database (FReD) and FLoRA datasets, plus the backend API for the Zotero Replication Checker plugin.

📋 Table of Contents

Overview
Repository Structure
Quick Start
Data Processing Pipelines
- FReD Pipeline (Effect-level)
- FLoRA Pipeline (Paper-level)
COS Integration
API Documentation
Contributing

Overview

This repository manages two independent datasets:

FReD (FORRT Replication Effects Database)
- Effect-level data: individual effect sizes from replications
- Created by merging individual effects with paper-level success coding
- Output: output/FReD.xlsx
FLoRA (FORRT Literature on Replications and Reproductions Archive)
- Paper-level data: metadata about replication and reproduction studies
- Combines both replications and reproductions from two separate Google Sheets
- Deduplicated by original-replication/reproduction DOI pairs
- Enriched with keywords and language metadata from OpenAlex
- Output: output/flora.csv

Both datasets are augmented with:

CrossRef metadata (titles, authors, years)
APA-formatted references with manual override support
Author overlap detection
OpenAlex keywords

The datasets power the Zotero Replication Checker API backend via privacy-preserving DOI hash lookups.

Repository Structure

fred_data/
├── R/                              # Shared helper functions
│   ├── cache_config.R              # Cache paths by data type
│   ├── data_cleaning.R             # FReD data cleaning
│   ├── crossref_cache.R            # Citation & author caching
│   ├── augmentation.R              # Augmentation functions
│   └── release_helpers.R           # OSF release automation
│
├── pipelines/                      # Independent data pipelines
│   ├── fred/
│   │   ├── prepare_fred.qmd        # Download → clean → augment → save
│   │   └── release_fred.qmd        # Release to OSF (optional)
│   │
│   └── flora/
│       ├── prepare_flora.qmd       # Download → deduplicate → augment → save
│       └── release_flora.qmd       # Release to OSF (optional)
│
├── cache/                          # Cache files (gitignored)
│   ├── crossref_doi_cache.rds      # DOI metadata
│   ├── crossref_citations.rds      # APA/BibTeX references
│   ├── crossref_authors.xlsx       # Author lists
│   ├── author_overlap.xlsx         # Overlap calculations
│   ├── manual_references.xlsx      # Manual reference overrides
│   └── openalex_keywords.csv       # Keywords cache
│
├── output/                         # Generated datasets (gitignored)
│   ├── FReD.xlsx                   # Effect-level dataset
│   └── flora.csv                   # Paper-level dataset
│
├── cos_integration/                # COS test data (optional)
│   ├── cos_test_set_phase1.csv
│   ├── cos_test_set_phase1_prepared.xlsx
│   ├── prepare_cos_data.R
│   └── README.md                   # COS toggle instructions
│
├── fred_dynamodb_loader/           # API backend loader
├── release/                        # Release automation scripts
├── COS Reports/                    # COS competition reports
├── archive/                        # Historical files
│   └── old_scripts/                # Previous pipeline versions
│
└── [Documentation files]
    ├── README.md                   # This file
    ├── .env.example                # Environment variables template
    ├── REORGANIZATION_PROGRESS.md
    ├── PHASE2_SUMMARY.md
    ├── PHASE3-4_SUMMARY.md
    └── IMPLEMENTATION_STATUS.md

Quick Start

Installation

# Clone repository
git clone https://github.com/forrtproject/FReD-data.git
cd fred_data

# Set up environment variables
cp .env.example .env
# Edit .env and add:
# - OSF_TOKEN (for releases)
# - ENABLE_COS_MERGE (TRUE/FALSE)
# - OPENALEX_MAILTO (your email for API)

# Install R dependencies (one-time setup)
Rscript -e "install.packages(c('tidyverse', 'readxl', 'openxlsx', 'rcrossref', 'osfr', 'quarto'))"

Running Pipelines

# Prepare FReD (effect-level dataset)
quarto render pipelines/fred/prepare_fred.qmd

# Prepare FLoRA (paper-level dataset)
quarto render pipelines/flora/prepare_flora.qmd

# Output files created:
# - output/FReD.xlsx (effect-level data with augmentation)
# - output/flora.csv (paper-level data with augmentation)

Data Processing Pipelines

FReD Pipeline (Effect-level)

File: pipelines/fred/prepare_fred.qmd

8-step process:

Load helpers - Source R functions for cleaning and augmentation
Download - Fetch validated FReD data from Google Sheets
COS Integration (optional) - Merge COS Phase 1 test data if enabled
Clean - Standardize formatting, remove duplicates, fix DOIs
Validate - Check data quality (ready when validation module complete)
Generate IDs - Create fred_id, entry_id, effect_id
Augment:
- Author overlap detection (% shared authors)
- Clean references (APA-formatted with manual overrides)
- Keywords from OpenAlex (optional)
Save - Output to output/FReD.xlsx

Run:

# Without COS data (default)
quarto render pipelines/fred/prepare_fred.qmd

# With COS data merged
export ENABLE_COS_MERGE=TRUE
quarto render pipelines/fred/prepare_fred.qmd

# Output: output/FReD.xlsx
# HTML report: pipelines/fred/prepare_fred.html

FLoRA Pipeline (Paper-level: Replications + Reproductions)

File: pipelines/flora/prepare_flora.qmd

12-step process:

Load helpers - Source R functions for augmentation
Download & Combine - Fetch both replications and reproductions from Google Sheets, combine on common columns
Prepare - Select relevant columns
Deduplicate - Remove duplicate (doi_o, doi_r) pairs
Validate DOIs - Ensure format starts with "10."
Fetch metadata - CrossRef/DataCite lookup (framework ready)
Augment with references - Clean references (APA-formatted with manual overrides)
Add IDs - Privacy-preserving 3-char DOI hash prefixes
Add language & keywords - Fetch from OpenAlex API (only fills empty fields)
Format - Reorder columns for output (includes keywords and language)
Save - Output to output/flora.csv
Summary - Report statistics on papers, coverage, and augmentation success

Run:

quarto render pipelines/flora/prepare_flora.qmd

# Output: output/flora.csv
# HTML report: pipelines/flora/prepare_flora.html

Helper Functions

All helper functions are in R/ and ready to use:

Data Cleaning (`R/data_cleaning.R`)

clean_fred_data(data) - Standardize formatting, fix DOIs, remove non-printable characters

CrossRef Caching (`R/crossref_cache.R`)

get_apa_references(dois) - Get APA references (manual → cache → API lookup)
get_crossref_authors(dois) - Fetch author lists from CrossRef
compute_author_overlap(data) - Calculate author overlap

Augmentation (`R/augmentation.R`)

augment_with_author_overlap(data) - Add author overlap columns
augment_with_clean_references(data) - Add APA reference columns
augment_with_keywords(data) - Add OpenAlex keywords

Release (`R/release_helpers.R`)

release_to_osf(dataset_path, ...) - Release dataset to OSF with versioning

Usage:

source("R/augmentation.R")

# Augment data
data <- augment_with_author_overlap(data)
data <- augment_with_clean_references(data)
data <- augment_with_keywords(data)

COS Integration

COS (Collaborative Open Science) Phase 1 test data can be optionally merged with FReD.

Enabling COS Integration

# Method 1: Environment variable
export ENABLE_COS_MERGE=TRUE
quarto render pipelines/fred/prepare_fred.qmd

# Method 2: .env file
# Edit .env and set: ENABLE_COS_MERGE=TRUE
quarto render pipelines/fred/prepare_fred.qmd

# Disable (default)
export ENABLE_COS_MERGE=FALSE
quarto render pipelines/fred/prepare_fred.qmd

How It Works

When ENABLE_COS_MERGE=TRUE:

FReD pipeline downloads main dataset
Merges with COS data on common columns
Both processed identically (cleaning, validation, augmentation)
Single output: FReD.xlsx with both datasets

See: cos_integration/README.md for detailed instructions

Caching Strategy

All caches are organized by data type (not purpose) for maximum efficiency:

Cache File	Type	Purpose
`crossref_doi_cache.rds`	DOI Metadata	CrossRef/DataCite results
`crossref_citations.rds`	References	APA/BibTeX formatted citations
`crossref_authors.xlsx`	Authors	Author lists by DOI
`author_overlap.xlsx`	Overlap Data	Computed author overlaps
`manual_references.xlsx`	Overrides	Manual reference corrections
`openalex_keywords.csv`	Keywords	OpenAlex keywords by DOI

Three-tier lookup for references (fastest to slowest):

Manual reference overrides (manual_references.xlsx)
Cached references (RDS cache files)
Live CrossRef API call

API Documentation

See original API Readme file API endpoints:

Prefix Lookup (privacy-preserving 3-char hash lookups)
Original DOI Lookup (direct DOI searches)

The API is powered by FLoRA dataset (output/flora.csv) loaded into DynamoDB.

API Backend: fred_dynamodb_loader/load_fred_to_dynamodb.py

Breaking Changes from Previous Version

This reorganization introduces breaking changes:

Old scripts moved to archive/old_scripts/
All helper functions now in R/
Pipelines now in pipelines/fred/ and pipelines/flora/ (not at root)
Output files now in output/ (use pipelines to generate)
No symlinks created at root

Migration path:

Run new pipelines: quarto render pipelines/fred/prepare_fred.qmd
Use output/FReD.xlsx and output/flora.csv as outputs
All helper functions available via source("R/...") in your scripts

Configuration

Environment Variables

Set in .env file (or shell environment):

# OSF Release Authentication
OSF_TOKEN=your_osf_token_here

# COS Integration Toggle
ENABLE_COS_MERGE=FALSE  # Set to TRUE to merge COS test data

# OpenAlex API Contact
OPENALEX_MAILTO=your_email@example.com

Cache Configuration

All cache paths defined in R/cache_config.R:

CACHE_DIR <- "cache"
CROSSREF_DOI_CACHE <- "cache/crossref_doi_cache.rds"
CROSSREF_CITATIONS_CACHE <- "cache/crossref_citations.rds"
# ... etc

To change cache locations, edit R/cache_config.R and update the paths.

Troubleshooting

Pipeline fails to download data

Check internet connection
Verify Google Sheets URLs are still active
Check that CSV format is still used for export

Missing cache files

Caches are auto-generated on first run
Ensure cache/ directory exists
Check file permissions

COS data not merging

Ensure ENABLE_COS_MERGE=TRUE
Verify cos_integration/cos_test_set_phase1_prepared.xlsx exists
Run Rscript cos_integration/prepare_cos_data.R to prepare COS data

Reference lookup fails

Check internet connection for CrossRef API
Verify manual_references.xlsx exists if using overrides
Check OSF_TOKEN if OpenAlex requires authentication

Contributing

Running with Debug Output

# Enable verbose logging
quarto render pipelines/fred/prepare_fred.qmd --quiet false

Testing Individual Functions

# Test cleaning
source("R/data_cleaning.R")
result <- clean_fred_data(sample_data)

# Test augmentation
source("R/augmentation.R")
data <- augment_with_author_overlap(sample_data)

# Test caching
source("R/crossref_cache.R")
refs <- get_apa_references(c("10.1234/example"))

Adding New Augmentations

Create function in R/augmentation.R
Follow pattern: augment_with_[feature](data)
Add cache management as needed
Call from appropriate pipeline file
Document in pipeline comments

Related Resources

FORRT Project: https://forrt.org
FReD Dataset: https://osf.io/9r62x (OSF project)
Zotero Plugin: Replication Checker plugin in Zotero marketplace
API Backend: fred_dynamodb_loader/ (AWS Lambda + DynamoDB)

License

[See LICENSE file in repository]

Contact

For questions about the data processing pipeline:

Open an issue on GitHub
Contact FORRT team

For API issues:

See API documentation in this README
Check fred_dynamodb_loader/ for backend code

Last Updated: 2025-12-17 Version: 2.0 (Reorganized with independent pipelines) Status: Production-ready

📊 Flora Dataset Update Log

This chart is automatically updated after each pipeline run, showing the size of output/flora.csv over time.

xychart-beta
    title "FLoRA Dataset Size Over Time (Total / Replications / Reproductions)"
    x-axis ["2026-03-16", "2026-03-17", "2026-03-18", "2026-03-19"]
    line [1126, 1436, 1473, 1474]
    line [1118, 1428, 1465, 1466]
    line [8, 8, 8, 8]

Latest (2026-03-19): Total = 1474 | Replications = 1466 | Reproductions = 8

Name		Name	Last commit message	Last commit date
Latest commit History 279 Commits
.claude		.claude
.github/workflows		.github/workflows
.vscode		.vscode
API Integration Guide		API Integration Guide
R		R
archive		archive
cache		cache
cos_score_integration		cos_score_integration
output		output
pipelines		pipelines
release		release
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CITATION.cff		CITATION.cff
FINAL_VALIDATION_CHECKLIST.md		FINAL_VALIDATION_CHECKLIST.md
IMPLEMENTATION_STATUS.md		IMPLEMENTATION_STATUS.md
README.md		README.md
fred_data.Rproj		fred_data.Rproj

Folders and files

Latest commit

History

Repository files navigation

FORRT Replication Database (FReD) — Data Processing & API

📋 Table of Contents

Overview

Repository Structure

Quick Start

Installation

Running Pipelines

Data Processing Pipelines

FReD Pipeline (Effect-level)

FLoRA Pipeline (Paper-level: Replications + Reproductions)

Helper Functions

Data Cleaning (R/data_cleaning.R)

CrossRef Caching (R/crossref_cache.R)

Augmentation (R/augmentation.R)

Release (R/release_helpers.R)

COS Integration

Enabling COS Integration

How It Works

Caching Strategy

API Documentation

Breaking Changes from Previous Version

Configuration

Environment Variables

Cache Configuration

Troubleshooting

Pipeline fails to download data

Missing cache files

COS data not merging

Reference lookup fails

Contributing

Running with Debug Output

Testing Individual Functions

Adding New Augmentations

Related Resources

License

Contact

📊 Flora Dataset Update Log

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Data Cleaning (`R/data_cleaning.R`)

CrossRef Caching (`R/crossref_cache.R`)

Augmentation (`R/augmentation.R`)

Release (`R/release_helpers.R`)

Packages