Skip to content

Comments

[ANE-2795] Fix UTF-8 encoding for ficus output on Windows#1646

Open
Conor-FOSSA wants to merge 3 commits intomasterfrom
conor/ANE-2795-ficus-stderr-encoding
Open

[ANE-2795] Fix UTF-8 encoding for ficus output on Windows#1646
Conor-FOSSA wants to merge 3 commits intomasterfrom
conor/ANE-2795-ficus-stderr-encoding

Conversation

@Conor-FOSSA
Copy link
Contributor

@Conor-FOSSA Conor-FOSSA commented Feb 19, 2026

Overview

This PR fixes UTF-8 encoding issues when running fossa analyze --x-vendetta --debug on native Windows. The CLI was crashing when ficus output contained Unicode characters (like box-drawing characters from Rust's SPANTRACE formatting).

Acceptance criteria

When users run fossa analyze --x-vendetta --debug on native Windows behind a corporate TLS proxy, the CLI should:

  • Successfully read ficus stdout/stderr containing Unicode characters
  • Successfully write debug log files containing Unicode characters
  • No longer crash with commitBuffer: invalid argument (cannot encode character '\9473')

Testing plan

  1. On a native Windows machine (not WSL), run fossa analyze --x-vendetta --debug against a project
  2. If ficus encounters an error with SPANTRACE output containing box-drawing characters (U+2501 ), the CLI should handle it gracefully
  3. Check that fossa.ficus-stdout.log and fossa.ficus-stderr.log are created without encoding errors

Note: Full reproduction requires a Windows machine behind a corporate TLS-inspecting proxy to trigger the ficus TLS error that produces the problematic SPANTRACE output.

Risks

Low risk - this is a targeted fix that only affects:

  • How stderr is read from the ficus subprocess (now uses Conduit with decodeUtf8Lenient)
  • How debug log files are opened (now explicitly set to UTF-8 encoding)

Metrics

N/A

References

  • ANE-2795: Vendetta analysis fails on native Windows with encoding error

Checklist

  • I added tests for this PR's change (or explained in the PR description why tests don't make sense).
    • This is difficult to test automatically as it requires Windows with a specific code page setting. The fix is straightforward and based on well-documented Haskell behavior.
  • If this PR introduced a user-visible change, I added documentation into docs/.
    • No user-facing documentation needed - this is a bug fix.
  • If this PR added docs, I added links as appropriate to the user manual's ToC in docs/README.ms and gave consideration to how discoverable or not my documentation is.
    • N/A
  • If this change is externally visible, I updated Changelog.md. If this PR did not mark a release, I added my changes into an ## Unreleased section at the top.
    • Will add changelog entry.
  • If I made changes to .fossa.yml or fossa-deps.{json.yml}, I updated docs/references/files/*.schema.json AND I have updated example files used by fossa init command. You may also need to update these if you have added/removed new dependency type (e.g. pip) or analysis target type (e.g. poetry).
    • N/A
  • If I made changes to a subcommand's options, I updated docs/references/subcommands/<subcommand>.md.
    • N/A

🤖 Generated with Claude Code

Conor-FOSSA and others added 2 commits February 13, 2026 11:04
Use Conduit with decodeUtf8Lenient for stderr handling, matching the
approach already used for stdout. This prevents crashes on Windows
where the default system encoding (CP1252) cannot decode UTF-8 characters
like box-drawing characters (U+2501) used in ficus progress output.

The decodeUtf8Lenient function safely handles any byte sequence by
replacing invalid UTF-8 with the Unicode replacement character (U+FFFD)
instead of throwing an exception.

Fixes: ANE-2795

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Set explicit UTF-8 encoding on debug log file handles. On Windows,
file handles default to the system code page (CP1252), which cannot
encode Unicode characters like box-drawing characters (U+2501) used
in ficus's Rust SPANTRACE output.

Without this fix, running `fossa analyze --x-vendetta --debug` on
Windows crashes with:
  commitBuffer: invalid argument (cannot encode character '\9473')

This complements the previous fix for reading ficus output by ensuring
the writing side also handles UTF-8 correctly.

Fixes: ANE-2795

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Conor-FOSSA Conor-FOSSA requested a review from a team as a code owner February 19, 2026 17:43
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

Walkthrough

The pull request refactors stderr handling to use a Conduit-based streaming pipeline with lenient UTF-8 decoding, line splitting, optional tee to a debug file, timestamping, and line prefixing with "STDERR". It accumulates and retains only the most recent 50 lines. The change sets UTF-8 encoding on debug file handles and aligns stderr processing with the existing Conduit-based stdout pipeline. No public or exported interfaces were changed.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing UTF-8 encoding for ficus output on Windows, which is the primary objective of this PR.
Description check ✅ Passed The description follows the template with all major sections completed: Overview, Acceptance criteria, Testing plan, Risks, References, and a thorough Checklist with explanations for each item.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/App/Fossa/Ficus/Analyze.hs (1)

358-389: Core decodeUtf8Lenient approach is correct; count tracking and String intermediates can be cleaned up.

The switch to CC.sourceHandle .| CC.decodeUtf8Lenient .| CC.linesUnbounded directly mirrors how streamFicusOutput handles stdout and is the right fix for Windows CP encoding issues.

Two optional cleanups worth considering:

1. Redundant count tracking. The _count tuple element is discarded at the call-site ((acc, _count) <-), so it is never used outside the fold. The claim that tracking count gives O(1) truncation doesn't hold: take 50 on a list of ≤50+1 elements is already O(50) = O(1), matching the constant bound. The fold state can be simplified to just [Text]:

♻️ Simplify accumulator to drop redundant count
-      (acc, _count) <-
-        Conduit.runConduit $
-          CC.sourceHandle handle
-            .| CC.decodeUtf8Lenient
-            .| CC.linesUnbounded
-            .| CC.foldM
-              ( \(acc, count) line -> do
-                  -- Tee raw line to file if debug mode
-                  traverse_ (\fileH -> hPutStrLn fileH (toString line)) maybeFile
-                  now <- getCurrentTime
-                  let ts = formatTime defaultTimeLocale "%H:%M:%S.%3q" now
-                  let msg = toText $ "[" ++ ts ++ "] STDERR " <> toString line
-                  -- Keep at most the last 50 lines of stderr
-                  -- I came up with 50 lines by looking at a few different error traces and making
-                  -- sure that we captured all of the relevant error output, and then going a bit higher
-                  -- to make sure that we didn't miss anything. I'd rather capture a bit too much than not enough.
-                  -- Use cons (:) for O(1) prepending, track count explicitly for O(1) truncation
-                  let newAcc =
-                        if count >= 50
-                          then take 50 (msg : acc)
-                          else msg : acc
-                  let newCount = min (count + 1) 50
-                  pure (newAcc, newCount)
-              )
-              ([], 0 :: Int)
-      pure (reverse acc) -- Reverse at the end to get correct order
+      acc <-
+        Conduit.runConduit $
+          CC.sourceHandle handle
+            .| CC.decodeUtf8Lenient
+            .| CC.linesUnbounded
+            .| CC.foldM
+              ( \acc line -> do
+                  -- Tee raw line to file if debug mode
+                  traverse_ (\fileH -> hPutStrLn fileH (toString line)) maybeFile
+                  now <- getCurrentTime
+                  -- Keep at most the last 50 lines of stderr (newest first during accumulation)
+                  let ts = Text.pack (formatTime defaultTimeLocale "%H:%M:%S.%3q" now)
+                  let msg = "[" <> ts <> "] STDERR " <> line
+                  pure (take 50 (msg : acc))
+              )
+              []
+      pure (reverse acc) -- Reverse at the end to get correct order

2. String intermediates in msg construction. Line 376 builds msg via toText $ "[" ++ ts ++ "] STDERR " <> toString line, mixing ++/<> on String before converting back to Text. This violates the coding guideline to prefer Text over String. The fix above also addresses this by constructing msg entirely in Text using Text.pack on the formatTime result and <> on Text values.

As per coding guidelines: "In Haskell, use Text instead of String."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/App/Fossa/Ficus/Analyze.hs` around lines 358 - 389, The fold in
consumeStderr can be simplified: drop the explicit count from the fold state and
use a single [Text] accumulator (remove the (acc,count) tuple and _count
binding), then prepend each new Text line and trim to 50 via take 50 (fold
state: [Text]); also avoid String intermediates when building msg by packing the
timestamp (formatTime result) to Text (e.g. Text.pack) and constructing msg
entirely with Text concatenation/<> instead of using ++ or toString
conversions—update the CC.foldM lambda (currently using (acc, count) and
toText/toString) to work with Text-only values and the simplified accumulator.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/App/Fossa/Ficus/Analyze.hs`:
- Around line 358-389: The fold in consumeStderr can be simplified: drop the
explicit count from the fold state and use a single [Text] accumulator (remove
the (acc,count) tuple and _count binding), then prepend each new Text line and
trim to 50 via take 50 (fold state: [Text]); also avoid String intermediates
when building msg by packing the timestamp (formatTime result) to Text (e.g.
Text.pack) and constructing msg entirely with Text concatenation/<> instead of
using ++ or toString conversions—update the CC.foldM lambda (currently using
(acc, count) and toText/toString) to work with Text-only values and the
simplified accumulator.

- Remove redundant count tracking from stderr fold (take 50 on a bounded
  list is already O(1))
- Use Text throughout msg construction instead of String intermediates
- Remove trailing blank line to satisfy formatter

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/App/Fossa/Ficus/Analyze.hs (1)

370-377: Optional: avoid Text → String round-trip when teeing to file.

Both here and in streamFicusOutput (line 317), hPutStrLn fileH (toString line) converts Text to String only for System.IO.hPutStrLn to re-encode it. Using Data.Text.IO.hPutStrLn would write Text directly through the handle's UTF-8 encoder, avoiding the intermediate String allocation.

♻️ Suggested change (applies to both line 317 and 372)

Add the import:

import Data.Text.IO qualified as Text.IO

Then in both tee sites:

-  traverse_ (\fileH -> hPutStrLn fileH (toString line)) maybeFile
+  traverse_ (\fileH -> Text.IO.hPutStrLn fileH line) maybeFile

As per coding guidelines, "In Haskell, use Text instead of String".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/App/Fossa/Ficus/Analyze.hs` around lines 370 - 377, The tee currently
converts Text to String via toString when writing to handles (see the hPutStrLn
usage inside the lambda that updates stderr accumulators and the similar site in
streamFicusOutput), causing an unnecessary Text→String→encoding round-trip;
import Data.Text.IO qualified as Text.IO and replace the System.IO.hPutStrLn
calls that pass (toString line) with Text.IO.hPutStrLn using the Text value
(e.g. write Text.IO.hPutStrLn fileH line), keeping the rest of the logic
(maybeFile, acc update, timestamping) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/App/Fossa/Ficus/Analyze.hs`:
- Around line 370-377: The tee currently converts Text to String via toString
when writing to handles (see the hPutStrLn usage inside the lambda that updates
stderr accumulators and the similar site in streamFicusOutput), causing an
unnecessary Text→String→encoding round-trip; import Data.Text.IO qualified as
Text.IO and replace the System.IO.hPutStrLn calls that pass (toString line) with
Text.IO.hPutStrLn using the Text value (e.g. write Text.IO.hPutStrLn fileH
line), keeping the rest of the logic (maybeFile, acc update, timestamping)
unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant