INDEX
    Explanations

    the specific phrase or phrases mentioned in the activation

    repeated mentions of phrases and their variations

    New Auto-Interp
    Negative Logits
    ÄŁ
    -0.81
    DERR
    -0.77
     Emirates
    -0.72
    llah
    -0.70
     Thro
    -0.69
     hemor
    -0.68
    fman
    -0.67
    Fal
    -0.66
     Indies
    -0.65
     Brotherhood
    -0.62
    POSITIVE LOGITS
    phrase
    1.06
    ology
    1.03
     phrases
    1.01
     phrase
    0.91
    witz
    0.89
    terday
    0.84
    stress
    0.82
    mith
    0.81
    atre
    0.78
     uttered
    0.78
    Act Density 0.021%

    No Known Activations