INDEX
    Explanations

    words related to choices or decision-making

    references to the word "which" in various contexts

    New Auto-Interp
    Negative Logits
    swick
    -0.89
    renheit
    -0.79
    ISTORY
    -0.78
    bart
    -0.76
    ibaba
    -0.76
    wn
    -0.76
    emp
    -0.74
    ̶
    -0.74
    shi
    -0.73
    yrinth
    -0.73
    POSITIVE LOGITS
     ones
    1.18
     side
    1.16
     direction
    1.09
     wavelengths
    1.03
     kinds
    1.00
     aspects
    0.99
     hemisphere
    0.98
     parts
    0.94
     subset
    0.91
     facets
    0.90
    Act Density 0.044%

    No Known Activations