Conversation
Add normalise_player_name() and match_player_ids() to R/utils.R.
Three-pass matching strategy:
Pass 1: exact 'Last, First' match
Pass 2: normalised names (strips accents, suffixes, asterisks,
expands initials JD->J D, fixes UTF-8 mojibake)
Pass 3: year-active disambiguation for ambiguous names
Improves USA Today match rate from ~77% to 95.5% and Spotrac from
~83% to 95.4%. Stars like Harper, Acuna, Altuve, Realmuto, Tatis
now correctly matched.
Update R/scrape.R and data-raw/salaries.R to use new matcher.
Add 13 new tests in test-utils.R (35 total, 147 suite-wide).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add team_name_map() -- maps 60+ team display names (USA Today, Spotrac, standard abbreviations) to Lahman teamID codes. Add Pass 4 to match_player_ids(): when team column present, constrain candidates to team-year roster (~50 players). Within a team-year, last name alone resolves 96.4% and last+initial resolves 99.6% -- no nickname table or complex normalization needed. Results: USA Today: 95.5% -> 99.0% rows, 97.4% -> 99.6% payroll Spotrac: 95.4% -> 98.2% rows, 97.6% -> 99.5% payroll Remaining ~1% are genuine edge cases: Jr. in last name position, hyphenated names (Kepler-Rozycki), two same-name teammates. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove years >= 2002 restriction on FanGraphs pitching WAR fetch (API works back to 1985, adding 8,481 pitcher-seasons of WAR data) - Extract loaders.R from utils.R (WAR + ChadwickIDs loading) - Add write_mcp_config() helper for AI tool database access - Add analytical views: PlayerAcquisitionType, TeamPayroll, LeagueMedianSalary, SalaryPerWAR, PlayerWAR, era_label() macro - Update BattingStats/PitchingStats/FieldingStats with COALESCE fixes - Add AGENTS.md, update CONTRIBUTING.md, README.md, NEWS.md - All 147 tests pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
roxygenise() adds 4 missing exports: load_chadwick_ids, load_fangraphs_war, load_statcast, write_mcp_config. Generates .Rd man pages for all exported functions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- BattingStats: verify AVG, OBP, SLG, OPS, ISO, BABIP, BB%, K% with hand-calculated values; test zero-AB edge case returns NULL - PitchingStats: verify IP, WHIP, K/9, BB/9, HR/9, K/BB, Win%, FIP with era-adjusted constant; test zero-IPouts edge case - FieldingStats: verify FPCT, RF/9, RF/G - match_player_ids Pass 4a: team + last name resolution - match_player_ids Pass 4b: same-lastname teammates disambiguated by first initial - match_player_ids Pass 4: teamID column path + wrong-team failure - team_name_map: 30 franchises, no duplicates, common abbreviation mappings (NYM->NYN, CHC->CHN, etc.) - scrape_salaries: input validation for unknown year slugs 147 -> 227 tests (0 failures) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Intro now highlights salary extension (2025), WAR (1985+), and MCP querying - Added WAR views section (PlayerIDs, PlayerWAR, SalaryPerWAR) to derived views - Fixed FangraphsPitchingWAR date range: 2002 -> 1985 - Updated war_reliable note (now always TRUE for salary era) - Fixed view count: 8 -> 10; table+view count: 3+2 -> 3+3 - Added mcp_config.R to package structure listing - Updated NEWS.md pitching WAR date range Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Attribution section now covers all data sources with license/obligations: Lahman (CC BY-SA 3.0), Chadwick (ODC-BY 1.0), FanGraphs, Statcast, scrapers - Clarifies package is a tooling layer that does not bundle third-party data - Credits baseballr (MIT, Bill Petti) as data-fetching layer - DESCRIPTION updated to mention FanGraphs WAR, Chadwick, MCP config, and that no data is bundled Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
New since 0.1.0: - Extended salary coverage 1985-2025 (Spotrac + USA Today) - FanGraphs WAR loaders (batting + pitching, 1985+) - Chadwick Bureau player ID crosswalk - Multi-pass player name matcher (4 passes, team-constrained) - Statcast pitch-level data loader - 6 new analytical views + era_label() macro - write_mcp_config() for GitHub Copilot CLI / Claude integration - 227 tests (0 failures) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts: # .github/copilot-instructions.md # .gitignore # AGENTS.md # R/loaders.R # R/setup_db.R # README.md # tests/testthat/test-connect.R # tests/testthat/test-loaders.R
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
lahmanTools 0.2.0
What's new
Salary data extended to 2025
data-raw/salaries.Rscrape_salaries()SalariesAllview unions all three sources; filteris_actual = TRUEFanGraphs WAR loaders (1985–present)
load_fangraphs_war()— batting + pitching WAR viabaseballrload_chadwick_ids()— Chadwick Bureau player ID crosswalk (ODC-BY 1.0)load_statcast()— Baseball Savant pitch-level data (2015+)PlayerIDs,PlayerWAR,SalaryPerWAR(dollars/WAR by era)Multi-pass player name matcher
match_player_ids()— 4-pass matching (exact → normalised → year-constrained → team-constrained)normalise_player_name(),team_name_map()MCP config for AI-assisted querying
write_mcp_config()— generates config to connect GitHub Copilot CLI or Claude tobaseball.duckdbvia DuckDB MCP serverNew analytical views + macro
PlayerAcquisitionType,LeagueMedianSalary,TeamPayroll(now implemented)era_label(yr)SQL macro — replaces repeated CASE WHEN era blocksTests: 227 passing, 0 failures
Attribution: expanded to cover Lahman (CC BY-SA 3.0), Chadwick (ODC-BY 1.0), FanGraphs, Statcast, and
baseballr