INDEX
    Explanations

    whatMethod used: 1Reason: MAX_ACTIVATING_TOKENS are all the same token

    New Auto-Interp
    Negative Logits
    Did
    0.52
     Did
    0.49
    Do
    0.49
     Do
    0.47
    Does
    0.46
    에서는
    0.45
    에서도
    0.45
    로는
    0.45
    では
    0.44
     Does
    0.44
    POSITIVE LOGITS
     constitutes
    1.19
     happens
    1.09
     kind
    1.05
     happened
    1.05
     transpired
    0.94
     motivates
    0.92
     kinds
    0.82
     resonates
    0.81
     constituye
    0.80
     excites
    0.79
    Act Density 0.268%

    No Known Activations