feat: repository analytics & repo populated & repo health score & health score refactor (IN-1054)#3987
feat: repository analytics & repo populated & repo health score & health score refactor (IN-1054)#3987gaspergrom wants to merge 2 commits intomainfrom
Conversation
…lth score refactor Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
|
|
| sum(failedAssessments) AS failed, | ||
| sum(passedAssessments) AS passed, | ||
| sum(failedAssessments + passedAssessments) AS total, | ||
| round(100 * (passed / total)) AS percentage |
There was a problem hiding this comment.
Security score can divide by zero
Medium Severity
repo_health_score_security_category computes percentage as round(100 * (passed / total)) without guarding total = 0. When a repo/category has no remaining assessments after filtering, total can be zero and this expression can fail or yield invalid results, breaking repository security scoring.
| AND repo != '' | ||
| {% if defined(repoUrl) %} | ||
| AND repo = {{ String(repoUrl, description="Repository URL", required=False) }} | ||
| {% end %} |
There was a problem hiding this comment.
Excluded repos bypass security endpoint filtering
Low Severity
repo_health_score_security.pipe does not apply the repos_to_channels_excluded check used by other repository health pipes. Queries by repoUrl can return security scores for repositories intentionally excluded from analytics, producing inconsistent behavior across repository health endpoints.
There was a problem hiding this comment.
Pull request overview
This PR adds repository-level analytics (populated repo metadata + repo health score metrics and rollup) and extends the existing project insights/health-score pipeline to support both project and repository records, while refactoring project health-score benchmark logic into shared Tinybird includes.
Changes:
- Introduces repository “populated” enrichment copy pipe + datasource and a daily repository health score copy pipe + datasource.
- Adds repository health score metric pipes (dual API/batch modes) and a copy/rollup pipe that computes category percentages and overall score.
- Extends
project_insights_copy_dsto include both project and repo records, adds a new combined insights endpoint, and refactors project health score benchmark logic into sharedincludes/*.incl.
Reviewed changes
Copilot reviewed 42 out of 42 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| services/libs/tinybird/pipes/repositories_populated_copy.pipe | New copy pipe to materialize enriched repository metadata into repositories_populated_ds. |
| services/libs/tinybird/pipes/repo_health_score_stars.pipe | New repo-level stars metric pipe + shared benchmark include. |
| services/libs/tinybird/pipes/repo_health_score_security.pipe | New repo-level security metric pipe. |
| services/libs/tinybird/pipes/repo_health_score_retention.pipe | New repo-level retention metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_pull_requests.pipe | New repo-level PR metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_organization_dependency.pipe | New repo-level org dependency metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_merge_lead_time.pipe | New repo-level merge lead time metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_issues_resolution.pipe | New repo-level issue resolution metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_forks.pipe | New repo-level forks metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_copy.pipe | New daily copy pipe to join repo metric pipes and compute category/overall scores into repo_health_score_copy_ds. |
| services/libs/tinybird/pipes/repo_health_score_contributor_dependency.pipe | New repo-level contributor dependency metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_contributions_outside_work_hours.pipe | New repo-level outside-work-hours metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_active_days.pipe | New repo-level active-days metric pipe + shared include. |
| services/libs/tinybird/pipes/repo_health_score_active_contributors.pipe | New repo-level active-contributors metric pipe + shared include. |
| services/libs/tinybird/pipes/project_repo_insights.pipe | New endpoint serving combined project + repo insights from project_insights_copy_ds. |
| services/libs/tinybird/pipes/project_insights.pipe | Updated to filter type = 'project' after project_insights_copy_ds becomes mixed-type. |
| services/libs/tinybird/pipes/project_insights_copy.pipe | Extended copy logic to UNION project records with repo records, sourcing repo base from repositories_populated_ds. |
| services/libs/tinybird/pipes/health_score_stars.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_retention.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_pull_requests.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_organization_dependency.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_merge_lead_time.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_issues_resolution.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_forks.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_contributor_dependency.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_contributions_outside_work_hours.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_active_days.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/pipes/health_score_active_contributors.pipe | Refactor to use shared benchmark include. |
| services/libs/tinybird/includes/health_score_stars.incl | New shared benchmark/include logic for stars scoring. |
| services/libs/tinybird/includes/health_score_retention.incl | New shared benchmark/include logic for retention scoring. |
| services/libs/tinybird/includes/health_score_pull_requests.incl | New shared benchmark/include logic for PR scoring. |
| services/libs/tinybird/includes/health_score_organization_dependency.incl | New shared processing + benchmark logic for org dependency scoring. |
| services/libs/tinybird/includes/health_score_merge_lead_time.incl | New shared benchmark/include logic for merge lead time scoring. |
| services/libs/tinybird/includes/health_score_issues_resolution.incl | New shared benchmark/include logic for issue resolution scoring. |
| services/libs/tinybird/includes/health_score_forks.incl | New shared benchmark/include logic for forks scoring. |
| services/libs/tinybird/includes/health_score_contributor_dependency.incl | New shared processing + benchmark logic for contributor dependency scoring. |
| services/libs/tinybird/includes/health_score_contributions_outside_work_hours.incl | New shared benchmark/include logic for outside-work-hours scoring. |
| services/libs/tinybird/includes/health_score_active_days.incl | New shared benchmark/include logic for active-days scoring. |
| services/libs/tinybird/includes/health_score_active_contributors.incl | New shared benchmark/include logic for active-contributors scoring. |
| services/libs/tinybird/datasources/repositories_populated_ds.datasource | New datasource for enriched repository metadata. |
| services/libs/tinybird/datasources/repo_health_score_copy_ds.datasource | New datasource for repository health score rollups and raw metrics/benchmarks. |
| services/libs/tinybird/datasources/project_insights_copy_ds.datasource | Updated schema to include type + repoUrl and adjusted sorting key for mixed-type records. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| SELECT | ||
| channel, | ||
| uniq(CASE WHEN memberId != '' THEN memberId ELSE NULL END) AS contributorCount, | ||
| uniq(CASE WHEN organizationId != '' THEN organizationId ELSE NULL END) AS organizationCount | ||
| FROM activityRelations_deduplicated_cleaned_bucket_union |
There was a problem hiding this comment.
uniq(CASE ... ELSE NULL END) will still count NULL as a distinct value in ClickHouse aggregate functions, so repositories that have any rows with empty memberId/organizationId can be overcounted by 1. Use uniqIf(memberId, memberId != '') / uniqIf(organizationId, organizationId != '') (or countDistinctIf) to exclude empty IDs without introducing NULL into the aggregation.
| {% if defined(startDate) %} | ||
| AND timestamp | ||
| > {{ DateTime(startDate, description="Filter after date", required=False) }} | ||
| {% end %} | ||
| {% if defined(endDate) %} | ||
| AND timestamp | ||
| < {{ DateTime(endDate, description="Filter before date", required=False) }} | ||
| {% end %} |
There was a problem hiding this comment.
In single-repo (repoUrl) mode there’s no default previous-quarter time window unless startDate/endDate are provided, but batch mode restricts to the previous quarter. This makes the API endpoint return all-time active contributors by default. Consider applying the same previous-quarter bounds when repoUrl is defined and no dates are passed.
| {% if defined(startDate) %} | |
| AND timestamp | |
| > {{ DateTime(startDate, description="Filter after date", required=False) }} | |
| {% end %} | |
| {% if defined(endDate) %} | |
| AND timestamp | |
| < {{ DateTime(endDate, description="Filter before date", required=False) }} | |
| {% end %} | |
| {% if defined(startDate) or defined(endDate) %} | |
| {% if defined(startDate) %} | |
| AND timestamp | |
| > {{ DateTime(startDate, description="Filter after date", required=False) }} | |
| {% end %} | |
| {% if defined(endDate) %} | |
| AND timestamp | |
| < {{ DateTime(endDate, description="Filter before date", required=False) }} | |
| {% end %} | |
| {% else %} | |
| AND timestamp >= toStartOfQuarter(now() - toIntervalQuarter(1)) | |
| AND timestamp < toStartOfQuarter(now()) | |
| {% end %} |
| AND timestamp | ||
| > {{ DateTime(startDate, description="Filter after date", required=False) }} | ||
| {% end %} | ||
| {% if defined(endDate) %} | ||
| AND timestamp | ||
| < {{ DateTime(endDate, description="Filter before date", required=False) }} |
There was a problem hiding this comment.
In single-repo (repoUrl) mode there’s no default 365-day window unless startDate/endDate are provided, but batch mode is explicitly “last 365 days”. This makes the API endpoint return all-time results by default. Apply the same 365-day bounds in the repoUrl branch when no explicit dates are passed.
| AND timestamp | |
| > {{ DateTime(startDate, description="Filter after date", required=False) }} | |
| {% end %} | |
| {% if defined(endDate) %} | |
| AND timestamp | |
| < {{ DateTime(endDate, description="Filter before date", required=False) }} | |
| AND timestamp > {{ DateTime(startDate, description="Filter after date", required=False) }} | |
| {% else %} | |
| AND timestamp >= toStartOfDay(now() - toIntervalDay(365)) | |
| {% end %} | |
| {% if defined(endDate) %} | |
| AND timestamp < {{ DateTime(endDate, description="Filter before date", required=False) }} | |
| {% else %} | |
| AND timestamp < toStartOfDay(now()) |
| WHERE | ||
| (type = 'pull_request-opened' OR type = 'merge_request-opened' OR type = 'changeset-created') | ||
| AND channel = {{ String(repoUrl, description="Repository URL", required=False) }} | ||
| AND channel NOT IN (SELECT channel FROM repos_to_channels_excluded) | ||
| {% if defined(startDate) %} |
There was a problem hiding this comment.
In single-repo (repoUrl) mode there’s no default 365-day window unless startDate/endDate are provided, but batch mode is “last 365 days”. This makes the API endpoint return all-time PR counts by default. Add a default 365-day filter in the repoUrl branch when no explicit dates are passed.
| WHERE | ||
| memberId != '' | ||
| AND (type, platform) IN (SELECT activityType, platform FROM activityTypes_filtered) | ||
| AND channel = {{ String(repoUrl, description="Repository URL", required=False) }} | ||
| AND channel NOT IN (SELECT channel FROM repos_to_channels_excluded) |
There was a problem hiding this comment.
In single-repo (repoUrl) mode there’s no default 365-day window unless startDate/endDate are provided, but batch mode restricts to the last 365 days. This makes the API endpoint return all-time dependency metrics by default. Apply the same default window in the repoUrl branch when no explicit dates are passed.
| WHERE | ||
| category NOT IN ('Documentation', 'Vulnerability Management') | ||
| AND repo != '' | ||
| {% if defined(repoUrl) %} | ||
| AND repo = {{ String(repoUrl, description="Repository URL", required=False) }} |
There was a problem hiding this comment.
This pipe doesn’t exclude repos/channels in repos_to_channels_excluded, unlike the other repo health score metrics. As a result, calling the endpoint with an excluded repoUrl can still return a security score. Consider adding an exclusion filter to match the other health-score pipes’ behavior.
| COALESCE(owh.contributionsOutsideWorkHours, 0) AS contributionsOutsideWorkHours, | ||
| COALESCE(owh.contributionsOutsideWorkHoursBenchmark, 0) AS contributionsOutsideWorkHoursBenchmark, | ||
| COALESCE(sec.securityPercentage, 0) AS securityPercentage |
There was a problem hiding this comment.
securityPercentage is COALESCE’d to 0 when there is no matching row from repo_health_score_security, which penalizes repos that simply lack security evaluation data (and differs from how other missing metrics are excluded via arrayFilter(... >= 0)). Consider keeping it NULL when absent and handling it explicitly in overallScore.
| CASE WHEN organizationId != '' THEN organizationId ELSE NULL END | ||
| ) AS activeOrganizationsLast365Days | ||
| FROM activityRelations_deduplicated_cleaned_bucket_union | ||
| WHERE timestamp <= now() |
There was a problem hiding this comment.
The “last 365 days” repo metrics node only filters timestamp <= now() (no lower bound), so it actually counts all historical activity. Add a timestamp >= now() - INTERVAL 365 DAY bound (or equivalent) to match the column name/description.
| WHERE timestamp <= now() | |
| WHERE timestamp >= now() - INTERVAL 365 DAY | |
| AND timestamp <= now() |
| CASE WHEN organizationId != '' THEN organizationId ELSE NULL END | ||
| ) AS activeOrganizationsPrevious365Days | ||
| FROM activityRelations_deduplicated_cleaned_bucket_union | ||
| WHERE timestamp < now() - INTERVAL 365 DAY |
There was a problem hiding this comment.
The “previous 365 days” repo metrics node only filters timestamp < now() - INTERVAL 365 DAY (no lower bound), so it counts all activity older than 365 days rather than the 365–730 day window implied by the column name. Add a lower bound (e.g., timestamp >= now() - INTERVAL 730 DAY).
| WHERE timestamp < now() - INTERVAL 365 DAY | |
| WHERE timestamp >= now() - INTERVAL 730 DAY | |
| AND timestamp < now() - INTERVAL 365 DAY |
| uniq(CASE WHEN memberId != '' THEN memberId ELSE NULL END) AS activeContributorsLast365Days, | ||
| uniq( | ||
| CASE WHEN organizationId != '' THEN organizationId ELSE NULL END | ||
| ) AS activeOrganizationsLast365Days |
There was a problem hiding this comment.
uniq(CASE ... ELSE NULL END) will count NULL as a distinct value in ClickHouse, so a repo with any rows missing memberId/organizationId can be overcounted by 1. Prefer uniqIf(memberId, memberId != '') / uniqIf(organizationId, organizationId != '') (or countDistinctIf).
Signed-off-by: Gašper Grom <gasper.grom@gmail.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
There are 5 total unresolved issues (including 2 from previous reviews).
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| pm.starsPrevious365Days AS starsPrevious365Days, | ||
| pm.forksPrevious365Days AS forksPrevious365Days, | ||
| pm.activeContributorsPrevious365Days AS activeContributorsPrevious365Days, | ||
| pm.activeOrganizationsPrevious365Days AS activeOrganizationsPrevious365Days |
There was a problem hiding this comment.
Stale projects can break insights copy
High Severity
project_insights_copy_project_results selects pm.* metrics without COALESCE, but project_insights_copy_period_metrics only emits rows for segments with activity in the last 730 days. Projects with older/no recent activity get NULL period metrics, which conflicts with non-null UInt64 columns in project_insights_copy_ds and can fail the copy.
Additional Locations (1)
| rp.softwareValue AS softwareValue, | ||
| rp.firstCommit AS firstCommit | ||
| FROM repositories_populated_ds AS rp | ||
| JOIN repositories r FINAL ON r.id = rp.id |
There was a problem hiding this comment.
Repo status checks missing in insights join
Medium Severity
project_insights_copy_repo_base joins repositories by id without checking enabled, excluded, or deletedAt. Because project_insights_copy runs daily while repositories_populated_ds refreshes hourly, stale rows in rp can still be emitted as active repo insights for a full day after a repo is disabled, excluded, or deleted.
| SELECT | ||
| channel, | ||
| uniq(CASE WHEN memberId != '' THEN memberId ELSE NULL END) AS contributorCount, | ||
| uniq(CASE WHEN organizationId != '' THEN organizationId ELSE NULL END) AS organizationCount |
There was a problem hiding this comment.
Null counted as contributor and organization
Medium Severity
repositories_populated_copy_contributor_org_counts uses uniq(CASE ... ELSE NULL END), and NULL is treated as a distinct value in uniq. Repositories with only empty memberId or organizationId can get counts of 1 instead of 0, which propagates incorrect contributorCount and organizationCount into project_insights_copy_ds.


Summary
Details
Shared includes (includes/)
Introduced 11 .incl files containing parameterized benchmark/processing logic ($GROUP_COL, $SOURCE_NODE). These are used by both project-level and
repo-level health score pipes, ensuring benchmark thresholds (e.g., stars 0-9 → 0, 1000+ → 5) are defined in a single place.
time
Repository health score pipes (repo_health_score_*.pipe)
12 new pipes, each with dual mode:
Metrics: active contributors, contributor dependency, organization dependency, retention, stars, forks, issue resolution, pull requests, merge lead time,
active days, contributions outside work hours, security.
Repository health score copy pipe & datasource
score. Runs daily at 01:50 UTC.
Repository populated data
Runs hourly.
Project insights extension
Project health score refactor
Refactored 11 existing project-level health_score_*.pipe files to replace inline benchmark nodes with INCLUDE directives. No functional change — same
columns, same endpoint behavior. health_score_copy.pipe unchanged.
Note
Medium Risk
Adds new Tinybird datasources and scheduled copy pipes that materialize repository-level metrics and union them into existing insights tables, which can affect analytics outputs and query performance if joins/filters are incomplete. Existing project health score pipes are refactored via shared includes, so functional risk is low there but rollout touches many production queries.
Overview
Adds repository-level analytics in Tinybird: a new
repo_health_score_copy_dsplus metric pipes and a daily copy/aggregation pipeline to compute per-repo benchmarks, category percentages, and an overall score.Extends
project_insights_copy_dsto store a union of project and repo insight records (type,repoUrl), updatesproject_insights.pipeto filtertype = 'project', and introducesproject_repo_insights.pipeto query both with combined filtering.Refactors multiple project
health_score_*.pipedefinitions to reuse new parameterized benchmark includes (includes/health_score_*.incl), reducing duplicated CASE logic without intended behavior changes.Written by Cursor Bugbot for commit e72807b. This will update automatically on new commits. Configure here.