Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
193 changes: 193 additions & 0 deletions PolyPilot.Tests/SessionOrganizationTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -2398,6 +2398,199 @@ public void GetStrengths_ReturnsDescription()
Assert.NotEqual("Unknown model", strengths);
Assert.Contains("reasoning", strengths, StringComparison.OrdinalIgnoreCase);
}

// --- InferFromName: family branches ---

[Theory]
[InlineData("claude-opus-99", ModelCapability.ReasoningExpert | ModelCapability.CodeExpert | ModelCapability.ToolUse)]
[InlineData("my-opus-model", ModelCapability.ReasoningExpert | ModelCapability.CodeExpert | ModelCapability.ToolUse)]
public void InferFromName_OpusFamily_ReturnsReasoningExpertCodeExpertToolUse(string slug, ModelCapability expected)
{
var caps = ModelCapabilities.InferFromName(slug);
Assert.Equal(expected, caps);
}

[Theory]
[InlineData("claude-sonnet-5", ModelCapability.CodeExpert | ModelCapability.ToolUse | ModelCapability.Fast)]
[InlineData("sonnet-next", ModelCapability.CodeExpert | ModelCapability.ToolUse | ModelCapability.Fast)]
public void InferFromName_SonnetFamily_ReturnsCodeExpertToolUseFast(string slug, ModelCapability expected)
{
var caps = ModelCapabilities.InferFromName(slug);
Assert.Equal(expected, caps);
}

[Theory]
[InlineData("claude-haiku-5", ModelCapability.Fast | ModelCapability.CostEfficient)]
[InlineData("haiku-lite", ModelCapability.Fast | ModelCapability.CostEfficient)]
public void InferFromName_HaikuFamily_ReturnsFastCostEfficient(string slug, ModelCapability expected)
{
var caps = ModelCapabilities.InferFromName(slug);
Assert.Equal(expected, caps);
}

[Theory]
[InlineData("gemini-4-pro")]
[InlineData("gemini-flash")]
public void InferFromName_GeminiFamily_ReturnsReasoningExpertLargeContextVision(string slug)
{
var caps = ModelCapabilities.InferFromName(slug);
// Core gemini-family flags must be present
Assert.True(caps.HasFlag(ModelCapability.ReasoningExpert));
Assert.True(caps.HasFlag(ModelCapability.LargeContext));
Assert.True(caps.HasFlag(ModelCapability.Vision));
// Note: "gemini" contains "mini" as a substring, so the mini-variant logic also fires,
// adding Fast|CostEfficient. This is expected production code behaviour.
Assert.True(caps.HasFlag(ModelCapability.Fast));
Assert.True(caps.HasFlag(ModelCapability.CostEfficient));
Comment on lines +2441 to +2444
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Test cements a production code bug as "expected behaviour"

InferFromName (ModelCapabilities.cs:80) uses lower.Contains("mini"), which falsely matches "ge**mini**". Calling this "expected production code behaviour" discourages fixing it, and any future fix to use word-boundary matching would break these assertions.

Suggestion: Reword the comment to acknowledge this as a known limitation:

// Known issue: "gemini" contains "mini" as a substring, causing the mini-variant
// logic to also fire. If InferFromName is updated to use word-boundary matching,
// these two asserts should be removed.

Comment on lines +2441 to +2444
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Test locks in a substring-collision defect as "expected behaviour"

"gemini" contains "mini" at positions 2–5, so the production code's lower.Contains("mini") check in InferFromName (ModelCapabilities.cs:80) fires on every Gemini-family slug. This incorrectly tags premium Gemini models with Fast | CostEfficient — flags that semantically mean "cheap/lightweight."

The test comment calls this "expected production code behaviour," which effectively approves the defect and will prevent future maintainers from recognising it as a bug. Any novel Gemini model inferred from its name will be misclassified, potentially skewing role-assignment logic in GetRoleWarnings.

Suggestion: Reword the test comment to acknowledge this as a known defect (e.g., // KNOWN BUG: "gemini" contains "mini" substring — see #<issue>), and consider a production fix using word-boundary matching (e.g., Regex.IsMatch(lower, @"\bmini\b") or checking for -mini/_mini tokens).

Flagged by: 3/3 reviewers

}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Test locks in a production bug — "gemini" contains "mini" as a substring

The production code at ModelCapabilities.cs:80 uses lower.Contains("mini"), which matches the "mini" inside "ge**mini**". This causes every Gemini model inferred via InferFromName to incorrectly receive Fast | CostEfficient.

This test asserts those incorrect flags and documents it as "expected production code behaviour", which locks in the bug — any future fix to the variant matching would break these assertions.

Concrete impact: GetStrengths("gemini-4-pro") returns "Inferred: reasoning, fast, cost-efficient, multimodal, large context", mislabeling a hypothetical premium model as "fast, cost-efficient".

Suggested fix — in ModelCapabilities.cs line 80, change:

if (lower.Contains("mini")) caps |= ...

to:

if (lower.Contains("-mini")) caps |= ...

All real "mini" variant slugs use a -mini suffix (gpt-5-mini, gpt-5.1-codex-mini). Then update this test to not assert Fast/CostEfficient for Gemini models.


[Fact]
public void InferFromName_CodexVariant_AddsCodeExpert()
{
Comment on lines +2444 to +2449
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR · 2/3 reviewers

Enshrining a production quirk as required test behavior. These assertions lock in the "gemini""mini" substring false-positive as expected, which means a future fix to use word-boundary matching in InferFromName would break these tests.

Suggestion: Consider splitting the gemini assertions — keep ReasoningExpert | LargeContext | Vision as the required core, and assert Fast | CostEfficient separately with a comment like // TODO: Known quirk — "gemini" contains "mini". Remove when InferFromName uses word-boundary matching. This way a future fix only needs to update the quirk assertion, not the core gemini test.

var caps = ModelCapabilities.InferFromName("new-model-codex");
Assert.True(caps.HasFlag(ModelCapability.CodeExpert));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Variant-only tests use HasFlag instead of Assert.Equal (3/3 reviewers)

These inputs ("new-model-codex", "new-model-mini", "new-model-max") don't match any family branch, so the expected flags are fully deterministic (e.g., codex → exactly CodeExpert). HasFlag can't detect regressions that add spurious flags — for example, if a code change accidentally added ToolUse to codex variants, this test would still pass.

Suggestion: Use Assert.Equal to match the pattern of the family tests:

Assert.Equal(ModelCapability.CodeExpert, caps);

Same for the mini and max variant tests.

}

[Fact]
public void InferFromName_MiniVariant_AddsFastCostEfficient()
{
var caps = ModelCapabilities.InferFromName("new-model-mini");
Assert.True(caps.HasFlag(ModelCapability.Fast));
Assert.True(caps.HasFlag(ModelCapability.CostEfficient));
}

[Fact]
public void InferFromName_MaxVariant_AddsReasoningExpert()
{
var caps = ModelCapabilities.InferFromName("new-model-max");
Assert.True(caps.HasFlag(ModelCapability.ReasoningExpert));
}

[Fact]
public void InferFromName_SonnetCodexCombined_HasBothFamilyAndVariantFlags()
{
// sonnet family gives CodeExpert|ToolUse|Fast; codex adds CodeExpert (idempotent via flags)
var caps = ModelCapabilities.InferFromName("sonnet-codex-experimental");
Assert.True(caps.HasFlag(ModelCapability.CodeExpert));
Assert.True(caps.HasFlag(ModelCapability.ToolUse));
Assert.True(caps.HasFlag(ModelCapability.Fast));
}

[Fact]
Comment on lines +2471 to +2479
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Tautological test — codex branch is unobservable

The sonnet family already provides CodeExpert (line 74 of ModelCapabilities.cs), and CodeExpert is the only flag the codex variant adds (line 79). So the |= from the codex branch is a no-op here — the test result is identical whether or not the codex variant code exists.

Concrete failing scenario: Delete line 79 (if (lower.Contains("codex")) caps |= ModelCapability.CodeExpert;) → this test still passes, because sonnet already supplied CodeExpert|ToolUse|Fast.

Suggested fix: Use a family that does NOT include CodeExpert, so the codex variant's contribution is observable:

[Fact]
public void InferFromName_HaikuCodexCombined_AddsBothFamilyAndVariantFlags()
{
    // haiku → Fast|CostEfficient; codex → |= CodeExpert (observable addition)
    var caps = ModelCapabilities.InferFromName("haiku-codex-v2");
    Assert.True(caps.HasFlag(ModelCapability.Fast));
    Assert.True(caps.HasFlag(ModelCapability.CostEfficient));
    Assert.True(caps.HasFlag(ModelCapability.CodeExpert)); // fails if codex branch removed
}

public void InferFromName_CompletelyUnknown_ReturnsNone()
{
var caps = ModelCapabilities.InferFromName("totally-unknown-xyz");
Assert.Equal(ModelCapability.None, caps);
}

// --- GetCapabilities: fuzzy prefix matching ---

[Theory]
[InlineData("claude-opus-4.6-1m")]
[InlineData("claude-opus-4.6-preview")]
[InlineData("claude-opus-4.6-extended")]
public void GetCapabilities_FuzzyPrefix_OpusVariant_ReturnsOpusFlags(string slug)
{
var caps = ModelCapabilities.GetCapabilities(slug);
// claude-opus-4.6 registry entry: ReasoningExpert|CodeExpert|ToolUse|LargeContext
Assert.True(caps.HasFlag(ModelCapability.ReasoningExpert));
Comment on lines +2492 to +2496
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Fuzzy-prefix tests use HasFlag instead of Assert.Equal (3/3 reviewers after follow-up)

All three fuzzy-prefix tests (OpusVariant, SonnetVariant, GptMiniVariant) use Assert.True(caps.HasFlag(...)), which only verifies minimum flags are present. If GetCapabilities starts returning extra unintended flags for fuzzy-prefix slugs, these tests would silently pass.

Since fuzzy-prefix resolution is expected to mirror a specific registry entry exactly, Assert.Equal(expected, caps) is the stronger and more appropriate contract check — consistent with the InferFromName family tests above which already use Assert.Equal.

Note: GetCapabilities_FuzzyPrefix_MatchesSameAsExact (line 2526) already uses Assert.Equal for opus, which is good — extending this pattern to all fuzzy-prefix tests would close the gap.

Assert.True(caps.HasFlag(ModelCapability.CodeExpert));
Assert.True(caps.HasFlag(ModelCapability.ToolUse));
Assert.True(caps.HasFlag(ModelCapability.LargeContext));
}

[Theory]
[InlineData("claude-sonnet-4.5-20250514")]
[InlineData("claude-sonnet-4.5-preview")]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Non-deterministic fuzzy match: claude-sonnet-4.5-preview matches two registry keys (2/3 reviewers)

"claude-sonnet-4.5-preview" satisfies StartsWith(key) for both "claude-sonnet-4.5" and "claude-sonnet-4" in the registry dictionary. Dictionary<> enumeration order is unspecified in .NET, so which key "wins" the fuzzy match is nondeterministic.

The test passes today only because both keys share identical flags (CodeExpert|ToolUse|Fast) and the same Strengths string. If either entry diverges in a future registry update, this test becomes unpredictably flaky.

The same ambiguity affects GetStrengths_FuzzyMatchedModels_ReturnRegistryStrengths (line 2551) which also uses "claude-sonnet-4.5-preview".

Fix: Use a slug that unambiguously matches exactly one registry key, or add an assertion that verifies the exact matched key.

public void GetCapabilities_FuzzyPrefix_SonnetVariant_ReturnsSonnetFlags(string slug)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Fuzzy-prefix test relies on Dictionary iteration order

"claude-sonnet-4.5-preview" prefix-matches two registry keys: "claude-sonnet-4" and "claude-sonnet-4.5". The foreach over the internal Dictionary<string, ...> has no guaranteed iteration order in .NET. This test passes today because both keys happen to share identical capability flags (CodeExpert | ToolUse | Fast), but if either entry is updated independently, this test could become non-deterministically fragile.

Suggestion: Either use test inputs that unambiguously match a single registry key, or add a comment documenting that correctness depends on both keys sharing the same flags.

Flagged by: 2/3 reviewers

{
var caps = ModelCapabilities.GetCapabilities(slug);
// claude-sonnet-4.5 registry entry: CodeExpert|ToolUse|Fast
Assert.True(caps.HasFlag(ModelCapability.CodeExpert));
Assert.True(caps.HasFlag(ModelCapability.ToolUse));
Assert.True(caps.HasFlag(ModelCapability.Fast));
}
Comment on lines +2502 to +2512
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Fuzzy-match test relies on non-deterministic Dictionary iteration order

"claude-sonnet-4.5-20250514".StartsWith("claude-sonnet-4") is true, so this slug matches both the "claude-sonnet-4.5" and "claude-sonnet-4" registry keys in the foreach loop. The code returns whichever the Dictionary iterator yields first — which isn't guaranteed across .NET versions.

Both entries have identical caps today (CodeExpert | ToolUse | Fast), so this is safe for now. If they ever diverge, this test becomes flaky.

Suggestion: Add a comment noting the assumption:

// Both claude-sonnet-4.5 and claude-sonnet-4 share identical caps today;
// if they diverge, fuzzy matching will need longest-prefix-wins logic.


[Fact]
public void GetCapabilities_FuzzyPrefix_GptMiniVariant_ReturnsGpt4Flags()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Misleading test method name

This tests "gpt-4.1-preview" fuzzy-matching to "gpt-4.1", but the name says "GptMiniVariant" — there's no "mini" in the slug. Suggest renaming to GetCapabilities_FuzzyPrefix_GptPreviewVariant_ReturnsGpt41Flags.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Misleading test name

GetCapabilities_FuzzyPrefix_GptMiniVariant_ReturnsGpt4Flags tests "gpt-4.1-preview", which is a preview variant, not a mini variant. The name will mislead future maintainers into thinking this covers mini-model behaviour.

Suggestion: Rename to GetCapabilities_FuzzyPrefix_GptPreviewVariant_ReturnsGpt41Flags (or similar) to match the actual input slug.

Flagged by: 2/3 reviewers

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Misleading test name GptMiniVariant (2/3 reviewers)

"gpt-4.1-preview" contains no "mini" substring and no mini-variant logic fires — this test exercises the basic prefix fuzzy path (gpt-4.1-preview → gpt-4.1). The name will confuse future readers searching for mini-variant coverage.

Suggestion: Rename to something like GetCapabilities_FuzzyPrefix_Gpt41Preview_ReturnsGpt41Flags.

{
// "gpt-4.1" has Fast|CostEfficient|ToolUse; "gpt-4.1-preview" unambiguously fuzzy-matches it
Comment on lines +2510 to +2517
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 False-positive test — cannot detect fuzzy-matching regressions

Both sonnet registry entries (claude-sonnet-4.5, claude-sonnet-4) have flags CodeExpert|ToolUse|Fast. InferFromName for any slug containing "sonnet" also returns CodeExpert|ToolUse|Fast (line 74 of ModelCapabilities.cs). The flag sets are identical.

This means if the entire fuzzy-prefix loop in GetCapabilities (lines 54-57) were deleted, these tests would still pass — InferFromName produces the exact same result. Compare with the opus fuzzy tests above, which do provide real coverage because LargeContext is in the registry entry but NOT in InferFromName's opus output.

Concrete failing scenario: Delete the foreach fuzzy loop from GetCapabilities → opus fuzzy tests fail (good), sonnet fuzzy tests still pass (bad — zero coverage of the code path they claim to test).

Suggested fix: Replace with a model whose registry entry diverges from InferFromName, or add an Assert.Equal against the exact registry value paired with a separate test proving InferFromName returns a different value. Alternatively, test a haiku variant (registry has ToolUse, InferFromName doesn't):

[Fact]
public void GetCapabilities_FuzzyPrefix_HaikuVariant_ReturnsRegistryFlags()
{
    var caps = ModelCapabilities.GetCapabilities("claude-haiku-4.5-fast");
    // Registry: Fast|CostEfficient|ToolUse; InferFromName: Fast|CostEfficient (no ToolUse)
    Assert.True(caps.HasFlag(ModelCapability.ToolUse)); // only passes via fuzzy
}

// (no shorter registry key matches "gpt-4.1-preview")
var caps = ModelCapabilities.GetCapabilities("gpt-4.1-preview");
Assert.True(caps.HasFlag(ModelCapability.Fast));
Assert.True(caps.HasFlag(ModelCapability.CostEfficient));
Assert.True(caps.HasFlag(ModelCapability.ToolUse));
}

[Fact]
public void GetCapabilities_FuzzyPrefix_MatchesSameAsExact()
{
// Fuzzy-matched variant should return identical flags to the exact registry entry
var exact = ModelCapabilities.GetCapabilities("claude-opus-4.6");
var fuzzy = ModelCapabilities.GetCapabilities("claude-opus-4.6-1m");
Assert.Equal(exact, fuzzy);
}

// --- GetStrengths: all paths ---

[Theory]
[InlineData("gpt-5")]
[InlineData("claude-opus-4.5")]
[InlineData("claude-haiku-4.5")]
[InlineData("gemini-3-pro")]
public void GetStrengths_KnownModels_ReturnNonEmptyRegistryString(string slug)
{
var strengths = ModelCapabilities.GetStrengths(slug);
Assert.NotEmpty(strengths);
Assert.DoesNotContain("Unknown", strengths, StringComparison.OrdinalIgnoreCase);
Assert.DoesNotContain("Inferred", strengths, StringComparison.OrdinalIgnoreCase);
Comment on lines +2544 to +2546
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 MINOR — Known-model strengths test uses only negative assertions (2/3 reviewers)

NotEmpty + DoesNotContain("Unknown") + DoesNotContain("Inferred") confirms the registry-match path was taken, but can't catch a regression where GetStrengths returns the wrong registry entry's text. One reviewer disagreed, noting that adding positive assertions on prose content couples tests to copy edits — a reasonable tradeoff concern.

Suggestion (optional): If desired, add one keyword per model: e.g., Assert.Contains("reasoning", ...) for opus, Assert.Contains("cost", ...) for haiku. Alternatively, accept the current level as sufficient for a "path coverage" test.

}

[Theory]
[InlineData("claude-opus-4.6-1m")]
[InlineData("claude-sonnet-4.5-preview")]
public void GetStrengths_FuzzyMatchedModels_ReturnRegistryStrengths(string slug)
{
var strengths = ModelCapabilities.GetStrengths(slug);
Assert.NotEmpty(strengths);
Assert.DoesNotContain("Unknown", strengths, StringComparison.OrdinalIgnoreCase);
Assert.DoesNotContain("Inferred", strengths, StringComparison.OrdinalIgnoreCase);
Comment on lines +2554 to +2557
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MODERATE — Fuzzy-match strengths test doesn't verify correct registry mapping (3/3 reviewers)

These negative-only assertions (NotEmpty, DoesNotContain("Unknown"), DoesNotContain("Inferred")) prove the result didn't fall through to the inferred/unknown fallback paths, but can't verify which registry entry was matched. If the fuzzy loop's iteration order changes and "claude-opus-4.6-1m" resolves to a different model's strengths, this test still passes.

Suggestion: Pin the fuzzy result to its expected canonical entry:

Assert.Equal(ModelCapabilities.GetStrengths("claude-opus-4.6"), strengths);
// (and similarly for "claude-sonnet-4.5-preview" → "claude-sonnet-4.5")

This mirrors the pattern already used in GetCapabilities_FuzzyPrefix_MatchesSameAsExact.

}

[Theory]
[InlineData("claude-opus-99")]
[InlineData("claude-sonnet-10")]
[InlineData("gemini-5-pro")]
public void GetStrengths_InferredModels_ReturnInferredPrefix(string slug)
{
var strengths = ModelCapabilities.GetStrengths(slug);
Assert.StartsWith("Inferred:", strengths, StringComparison.OrdinalIgnoreCase);
Assert.True(strengths.Length > "Inferred:".Length, "Inferred strengths must contain content after the prefix");
}

[Fact]
public void GetStrengths_InferredOpus_ContainsReasoningAndCode()
{
var strengths = ModelCapabilities.GetStrengths("claude-opus-99");
Assert.Contains("reasoning", strengths, StringComparison.OrdinalIgnoreCase);
Assert.Contains("code", strengths, StringComparison.OrdinalIgnoreCase);
}

[Fact]
public void GetStrengths_InferredMini_ContainsFastAndCostEfficient()
{
var strengths = ModelCapabilities.GetStrengths("new-model-mini");
Assert.StartsWith("Inferred:", strengths, StringComparison.OrdinalIgnoreCase);
Assert.Contains("fast", strengths, StringComparison.OrdinalIgnoreCase);
Assert.Contains("cost-efficient", strengths, StringComparison.OrdinalIgnoreCase);
}

[Fact]
public void GetStrengths_CompletelyUnknown_ReturnsUnknownModel()
{
var strengths = ModelCapabilities.GetStrengths("totally-unknown-xyz-abc");
Assert.Equal("Unknown model", strengths);
}
}

public class GroupPresetTests
Expand Down
Loading