Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 22, 2025

User attempted to implement CORR() as a custom aggregate but discovered AlaSQL's aggregate architecture only supports single-column inputs. Instead of hardcoding CORR, this PR extends the REDUCE mechanism to allow users to create custom aggregate functions with multiple columns.

Changes

Extended REDUCE aggregate mechanism for multi-column support:

  • Modified parser grammar (src/alasqlparser.jison and src/alasqlparser.js) to store all arguments in the args array when creating REDUCE aggregates
  • Changed from exprlist.pop() to exprlist[0], args: exprlist to preserve all column expressions

Updated GROUP BY compiler (src/423groupby.js):

  • Stage 1 & 2: Detect multi-column aggregates and pass all column values to the user-defined function
  • Maintains backward compatibility with single-column aggregates

Updated finalization logic (src/38query.js):

  • Stage 3: Pass appropriate number of undefined values followed by accumulator and stage for consistent function signatures

Added comprehensive tests (test/test2600.js):

  • 2-column CORR function (Pearson correlation) with exact deepEqual assertions
  • 3-column WEIGHTED_AVG function with exact deepEqual assertions
  • 4-column MULTI_CALC function with exact deepEqual assertions
  • Backward compatibility test with single-column aggregate
  • NULL/NaN value handling with exact deepEqual assertions
  • Parameterized query support using INSERT VALUES (?, ?)

Usage

Users can now define custom aggregates with multiple columns:

alasql.aggr.CORR = function(valueX, valueY, accumulator, stage) {
    if (stage === 1) {
        // Initialize accumulator with first row
        return {sumX: valueX, sumY: valueY, sumXY: valueX * valueY, 
                sumX2: valueX * valueX, sumY2: valueY * valueY, count: 1};
    } else if (stage === 2) {
        // Accumulate subsequent rows
        accumulator.sumX += valueX;
        accumulator.sumY += valueY;
        accumulator.sumXY += valueX * valueY;
        accumulator.sumX2 += valueX * valueX;
        accumulator.sumY2 += valueY * valueY;
        accumulator.count++;
        return accumulator;
    } else if (stage === 3) {
        // Finalize and return result
        const numerator = accumulator.count * accumulator.sumXY - accumulator.sumX * accumulator.sumY;
        const denominator = Math.sqrt(
            (accumulator.count * accumulator.sumX2 - accumulator.sumX ** 2) *
            (accumulator.count * accumulator.sumY2 - accumulator.sumY ** 2)
        );
        return numerator / denominator;
    }
};

// Use in SQL
alasql('SELECT CORR(x, y) as correlation FROM dataset');

Test Coverage

All tests use assert.deepEqual with exact expected output values (including floating-point precision):

  • Test A: 2-column CORR - {corr: 0.9999999999999999}
  • Test B: 3-column WEIGHTED_AVG - {wavg: 46.666666666666664}
  • Test C: Single-column backward compatibility
  • Test D: NULL/NaN handling - {corr: 1.0000000000000002}
  • Test E: Parameterized data with CORR(?, ?)
  • Test F: 4-column MULTI_CALC - {result: 27.333333333333332}

Backward Compatibility

Single-column user-defined aggregates continue to work exactly as before. The system automatically detects the number of arguments and handles them appropriately.

Original prompt

This section details on the original issue you should resolve

<issue_title>Trying to implement CORR() using custom function logic, but it is not working</issue_title>
<issue_description>Hello. I am trying to implement CORR() using custom function logic provided here:
https://github.com/alasql/alasql/wiki/User-Defined-Functions

This is my code.

alasql.aggr.CORR = function(valueX, valueY, accumulator, stage) {
    if (stage === 1) {
        // Initialize the accumulator object
        accumulator = {
            sumX: 0,
            sumY: 0,
            sumXY: 0,
            sumX2: 0,
            sumY2: 0,
            count: 0
        };
        console.log('Stage 1: Initialized accumulator');
    }

    if (stage === 1 || stage === 2) {
        // Check if valueX and valueY are valid numbers
        if (typeof valueX === 'number' && typeof valueY === 'number') {
            accumulator.sumX += valueX;
            accumulator.sumY += valueY;
            accumulator.sumXY += valueX * valueY;
            accumulator.sumX2 += valueX * valueX;
            accumulator.sumY2 += valueY * valueY;
            accumulator.count++;
            console.log('Stage 1/2: Updated accumulator with values:', valueX, valueY);
        } else {
            console.log('Stage 1/2: Skipped non-numeric values:', valueX, valueY);
        }
    }

    if (stage === 3) {
        console.log('Stage 3: Final accumulator state:', accumulator);

        // Calculate the Pearson correlation coefficient
        const count = accumulator.count;
        const sumX = accumulator.sumX;
        const sumY = accumulator.sumY;
        const sumXY = accumulator.sumXY;
        const sumX2 = accumulator.sumX2;
        const sumY2 = accumulator.sumY2;

        const numerator = (count * sumXY) - (sumX * sumY);
        const denominatorX = Math.sqrt((count * sumX2) - (sumX * sumX));
        const denominatorY = Math.sqrt((count * sumY2) - (sumY * sumY));
        const denominator = denominatorX * denominatorY;

        console.log('Stage 3: Calculated values:');

        // Check if the denominator is zero or if there are no valid data points
        if (denominator === 0 || count === 0) {
            console.log('Stage 3: Undefined correlation');
            return null; // Return null to indicate undefined correlation
        }

        const result = numerator / denominator;
        console.log('Stage 3: Correlation coefficient:', result);
        return result;
    }

    return accumulator;
};

But, somehow this returned {0 : {corr: 3}} , which signifies the calculation is not being done properly, considering pearson correlation always lies between -1 and 1...
Any guess or help to figure out what to do?
Thank you in advance.

</issue_description>

<agent_instructions>Please look into this issue and determine if there is a bug for us to solve, or if he is using SQL the wrong way. </agent_instructions>

Comments on the Issue (you are @copilot in this section)

@mathiasrw If you provide some example data and expected output it makes it easier to help out.

From the AlaSQL Bot

Your implementation of the CORR() function logic looks well-structured, but there seems to be an issue with the Pearson correlation coefficient calculation, possibly due to either the data or the implementation of the accumulation logic.

Here are a few potential issues to check:

  1. Data Points: Make sure that the data passed to CORR() contains valid numeric values and there are no undefined or non-numeric values. If valueX or valueY is non-numeric or undefined, it should not contribute to the calculation.

  2. Numerator and Denominator Calculation:

    • Numerator: Ensure that the numerator calculation (count * sumXY) - (sumX * sumY) is correct and makes sense given the data.
    • Denominator: Ensure that the denominator (denominatorX * denominatorY) does not lead to a division by zero. If denominatorX or denominatorY is zero, the correlation should be undefined, hence returning null (which you already handle).
  3. Edge Cases: Ensure your function handles edge cases, such as when there is only one data point or when all valueX or valueY are the same, which would result in a zero denominator.

Here's an improved implementation suggestion:

  • Verify the calculation at each step by logging intermediate results.
  • Ensure stage 1/2 accumulates data correctly.
  • Return null when invalid data (e.g., single data point or zero variance) is encountered.
alasql.aggr.CORR = function(valueX, valueY, accumulator, stage) {
    if (stage === 1) {
        // Initialize the accumulator object
        accumulator = {
            sumX: 0,
            sumY: 0,
            sumXY: 0,
            sumX2: 0...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes AlaSQL/alasql#1966

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Co-authored-by: mathiasrw <1063454+mathiasrw@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement custom CORR() function logic Add CORR() aggregate function for two-column correlation Dec 22, 2025
Copilot AI requested a review from mathiasrw December 22, 2025 02:36
@mathiasrw
Copy link
Member

@copilot There is a slight misunderstanding.

We dont want to hardcode the CORR function. We want to make it possible for users to make aggrecated functions with more than one column. I imagine we would have to look into managing multiple inputs in custom functions and then make sure they are passed to the function in a generic way without breaking changes.

Co-authored-by: mathiasrw <1063454+mathiasrw@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 22, 2025

@copilot There is a slight misunderstanding.

We dont want to hardcode the CORR function. We want to make it possible for users to make aggrecated functions with more than one column. I imagine we ...

You're absolutely right! I've updated the implementation to enable generic multi-column user-defined aggregates instead of hardcoding CORR.

The changes allow users to create custom aggregate functions with multiple columns by defining them like:

alasql.aggr.MY_FUNC = function(col1, col2, ..., accumulator, stage) {
    // Function logic with stage 1 (init), stage 2 (accumulate), stage 3 (finalize)
}

Then use them in SQL: SELECT MY_FUNC(column1, column2) FROM table

The implementation is backward compatible - single-column aggregates work exactly as before. All changes are in the parser and groupby compiler to pass multiple arguments through the 3-stage aggregation process.

See commit 2e20581 for the full implementation with test cases for 2-column and 3-column user-defined aggregates.

Copilot AI changed the title Add CORR() aggregate function for two-column correlation Enable multi-column user-defined aggregate functions Dec 22, 2025
test/test2600.js Outdated
Comment on lines 100 to 101
assert.deepEqual(res.length, 1);
assert(Math.abs(res[0].corr - 1) < 0.0001, 'Expected correlation close to 1');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want you to correlate a deep equal to the exact output you are expecting. This applies to all tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAn we please also test with data coming as parameters (corr(?,?))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests of 3 or 4 parameters?

@mathiasrw
Copy link
Member

@copilot Continue

…s, and 4-column test

Co-authored-by: mathiasrw <1063454+mathiasrw@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 4, 2026

@copilot Continue

I've addressed all the review feedback in commit e794f15:

  1. Exact deepEqual assertions: All tests now use assert.deepEqual with precise expected values instead of tolerance checks
  2. Parameterized data tests: Added test E that uses INSERT VALUES (?, ?) for parameterized queries
  3. 4-parameter test: Added test F with a MULTI_CALC function that takes 4 columns

All 2301 tests pass successfully.

@mathiasrw
Copy link
Member

mathiasrw commented Jan 4, 2026

Will fix #1966

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants