Skip to content

Run the VS2013 (MSVC 18.00) cl.exe/ml64.exe toolchain under wibo, byte-identical to native#127

Open
jeffmcjunkin wants to merge 9 commits into
decompals:mainfrom
jeffmcjunkin:vs2013-cl-exe-support
Open

Run the VS2013 (MSVC 18.00) cl.exe/ml64.exe toolchain under wibo, byte-identical to native#127
jeffmcjunkin wants to merge 9 commits into
decompals:mainfrom
jeffmcjunkin:vs2013-cl-exe-support

Conversation

@jeffmcjunkin

Copy link
Copy Markdown

Depends on #126 — please merge that first. This branch is stacked on fix-fls-per-thread-and-cs-handoff, so its first two commits (per-thread FLS; Windows-faithful CRITICAL_SECTION) belong to #126 and will drop out of this diff once #126 lands and I rebase onto main. Review the commits above those two.

What this does

Makes wibo run the VS2013 Update 5 (MSVC 18.00.40629) 32-bit-host cross toolchainVC/bin/x86_amd64/cl.exe and ml64.exe (both PE32 / i386) — well enough to compile C/C++ and assemble MASM into COFF objects that are byte-identical to native Windows output. wibo loads the real msvcr120.dll/msvcp120.dll from disk; this PR fixes the wibo-side gaps underneath it (kernel32 / ntdll / loader / TLS).

This is unrelated to the older msvc-wip / msvc-broken branches (which implemented the builtin legacy msvcrt.dll) — VS2013 imports msvcr120, so this is a different and complete approach.

Commits (above the #126 base)

Validation

  • Determinism: .text (incl. all /Gy COMDATs), .xdata, .pdata, .rdata, .drectve and relocations are byte-identical to native x86_amd64 cl.exe; only .debug$S (embedded absolute paths) and the COFF TimeDateStamp differ.
  • Full-tree sweep: 4,231 / 4,233 real units compile cleanly under wibo with zero missing-import aborts and zero crashes. The 2 failures are genuine MSVC source errors that fail identically on native cl.exe.
  • Byte-identity at scale: a seeded 150-unit native-vs-wibo sample was 150 / 150 byte-identical (1,459 code/data sections compared).
  • ml64.exe assembles .asm byte-identically too.
  • With kernel32: fix per-thread FLS and Windows-faithful CRITICAL_SECTION (fixes intermittent MSVC c2.dll deadlock) #126: 0 hangs across thousands of parallel stress compiles, and parallel compiles got ~6× faster (the CRITICAL_SECTION fix removed a lock-handoff convoy).
  • Exercised on WSL2 Ubuntu 24.04 (32-bit release-clang wibo); Wine 9.0 (old-WoW64) is the byte-identical baseline this matches.

Out of scope (not needed for compile/assemble → obj)

link.exe / lib.exe / rc.exe, PDB generation (/Zi → the mspdbsrv.exe IPC server), /MP, and the 64-bit-host (PE32+) tools (wibo is a 32-bit loader; the x86_amd64 32-bit-host tools emit x64 objects and are sufficient). cl.exe + ml64.exe cover the matching-decompilation workflow this targets.

Jeff McJunkin and others added 9 commits June 11, 2026 18:10
FlsAlloc/FlsGetValue/FlsSetValue stored values in a single process-global
array, so every thread shared one cell per FLS index. Fiber-local storage
without fibers is thread-local storage on Windows: each thread is exactly
one fiber, and FlsGetValue must return the value the calling thread set.

The shared cell breaks msvcr120.dll (VS2013 CRT) thread creation. The CRT
stores its per-thread data block (_ptd) -- which carries the
_beginthreadex entry point and argument -- via FlsSetValue, and the new
thread's _threadstartex/_callthreadstartex re-read it via FlsGetValue.
With a process-global cell, two concurrently-starting threads overwrite
each other's _ptd and can both start with the same argument.

Observed with VS2013 cl.exe: c2.dll's parallel codegen pool creates four
worker threads back-to-back; ~1-2% of compiles deadlocked forever. An API
trace of a hung process shows two workers waiting on the same per-worker
dispatch event while another worker's event has a signal and no waiter:

    CreateEvent h=7c / 88 / 94 / a0      (per-worker "go" events)
    t251103 WaitForSingleObject h=88     <- worker 1
    t251102 WaitForSingleObject h=88     <- different thread, SAME event
    boss SetEvent 7c, 88, 94, a0         <- 94 never gets a waiter
    boss WaitForMultipleObjects({done events}, bWaitAll, INFINITE)
      -> hangs forever: the orphaned worker never runs its work item,
         so its "done" event is never set.

Fix: keep the index allocation map process-wide (FLS indices are
process-wide on Windows) but store the values in a thread_local array;
wibo maps guest threads 1:1 onto host threads, so thread_local is exactly
per-guest-thread. New threads observe zero-initialized values, matching
Windows. Index alloc/free is guarded by a mutex.

Caveat (unchanged behavior): FLS destructor callbacks are still never
invoked, and freeing then reallocating an index does not clear other
threads' stale values for it. The VS2013 CRT allocates its index once per
process, so neither is reachable for it.

With this fix the cl.exe hang rate dropped from ~1-2% to ~0.1-0.3% over
1000-run stress batches; the remainder was a separate CRITICAL_SECTION
issue fixed in the following commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two divergences from the real Windows state machine, both observed
breaking VS2013 cl.exe (c2.dll multithreaded codegen):

1. Contended EnterCriticalSection waited until OwningThread was observed
   to be 0 and then claimed the section with a plain store. Two waiters
   can both observe 0 after a single Leave and both enter the critical
   section simultaneously. The fingerprint -- a free section left with
   LockCount == 0 instead of -1, created when the second "owner"'s Leave
   failed the ownership check and returned without decrementing -- was
   captured with gdb in hung c2.dll worker pools.

2. LeaveCriticalSection bailed out when the calling thread did not match
   OwningThread. Real Windows Leave performs no caller validation at all:
   it unconditionally decrements RecursionCount/LockCount and releases
   exactly one waiter; mutual exclusion is carried entirely by the
   LockCount/semaphore state machine. Guest lock usage that Windows
   tolerates therefore silently strands wibo's lock state instead.
   Instrumented runs caught exactly this on c2.dll's work-queue section:

       leave-not-owner cs=<work queue CS> tid=B owner=A lock=1 rec=1

   after which the queue's "active worker" counter and the LockCount were
   each stranded one too high, the queue's drain condition became
   unsatisfiable, and the compiler deadlocked (~0.1-0.3% of compiles even
   after the FLS fix).

Fix, mirroring the Windows protocol:

- Model LockSemaphore as a ticket count: every contended Leave posts
  exactly one ticket (InterlockedIncrement + WakeByAddressSingle); every
  blocked Enter consumes exactly one ticket (CAS decrement, WaitOnAddress
  otherwise) and then owns the section by construction. There is no claim
  race on OwningThread. TryEnterCriticalSection cannot steal while
  waiters exist because each waiter's LockCount increment persists until
  it owns and leaves.
- Leave validates nothing, like Windows: if --RecursionCount != 0,
  decrement LockCount; otherwise clear the owner, decrement, and release
  one waiter if the result is >= 0.

Measured with VS2013 cl.exe under stress (18KB C unit, 25s timeout =
hang): ~1-2% hangs before, 0 hangs in 5,100 runs with this commit plus
the FLS fix (2x1000 + 2000 at 24-way parallelism, 300 sequential, 800 on
two other source units). Compiler output stays byte-identical to native
Windows. Side effect: heavily contended compiles got ~6x faster (1000
parallel compiles: 26s -> 4s wall) because the old wait loop woke every
waiter to race on each release.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resource-only DLLs (e.g. MSVC's clui.dll) are linked /NOENTRY with their
relocations stripped. When their preferred image base is already occupied
they must be mapped elsewhere, at which point loadPE bailed with
"relocation required but no relocation directory present". Such images
have no executable section and address their resources via the actual
mapped base, so it is safe to continue without applying relocations.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Backed by wibo's module registry. The GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS
form (moduleInfoFromAddress) is how the MSVC CRT/compiler finds its own
module to build the localized-resource (1033\clui.dll) path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds InterlockedPushEntrySList / InterlockedPopEntrySList /
InterlockedFlushSList / QueryDepthSList (wibo already had SLIST_HEADER and
InitializeSListHead), serialized with a mutex.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tubs

LCMapStringEx/CompareStringEx forward to LCMapStringW/CompareStringW. A
small compat TU adds conservative stubs the MSVC CRT/compiler probe at
startup (GetEnabledXStateFeatures, Get/SetThreadPreferredUILanguages,
IsValidLocaleName, InitializeSRWLock, WaitForSingleObjectEx/MultipleEx).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…icodeString

The MSVC frontend opens source/output files through the NT file API.
NtCreateFile maps OBJECT_ATTRIBUTES onto kernel32 CreateFileW (with
FILE_FLAG_BACKUP_SEMANTICS for FILE_DIRECTORY_FILE); Rtl*UnicodeString use
the process heap; NtQueryDirectoryFile enumerates via std::filesystem.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
General threading fix (not MSVC-specific). Two defects left
__declspec(thread) data broken on guest-created threads (NULL
ThreadLocalStoragePointer), crashing MSVC c2.dll's parallel-codegen workers:
1. initializeTib() set up the module-TLS array via
   ensureModuleArrayCapacityLocked(g_moduleArrayCapacity), which early-
   returns when required <= current capacity, so a thread created after the
   TLS-bearing DLLs loaded got no array. Allocate it directly for the new TIB.
2. notifyDllThreadAttach() only allocated static TLS for modules passing
   shouldDeliverThreadNotifications(); a DLL that calls
   DisableThreadLibraryCalls (c2.dll) is excluded, yet Windows still
   allocates its static TLS. Allocate static TLS for every hasTls module.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n path

VS2013 c1.dll selects its source-file resolution strategy from the OS
version reported by RtlGetVersion. 6.2 (Windows 8) steers it into a
directory-canonicalization path that dead-ends on wibo's Z:-mapped volumes
(C1083 without ever opening the source); 6.1 makes it use the direct
CreateFileW(source) path. Windows 7 (6.1, build 7601) was VS2013's
contemporary host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant