INDEX
    Explanations

    leading or instructive phrases

    New Auto-Interp
    Negative Logits
    Khan
    0.49
    João
    0.47
    THERE
    0.46
     Referenced
    0.45
     nhớ
    0.44
    Wrong
    0.44
    worst
    0.44
    Wikipedia
    0.43
    angi
    0.43
    Worst
    0.43
    POSITIVE LOGITS
     spit
    0.45
     spits
    0.42
     lac
    0.42
     pi
    0.41
     illustrative
    0.40
     Bol
    0.40
     boulevard
    0.39
     instructive
    0.39
     ramps
    0.38
     pitching
    0.38
    Act Density 0.001%

    No Known Activations