INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    angan
    -0.76
    NAS
    -0.73
    arters
    -0.72
    eways
    -0.71
    claimed
    -0.71
    ursed
    -0.71
    arta
    -0.70
    office
    -0.70
    ritic
    -0.69
    unal
    -0.68
    POSITIVE LOGITS
     wolves
    1.41
     wolf
    1.19
    wolves
    1.11
     Wolves
    1.07
     Fenrir
    1.01
    wolf
    0.95
    hound
    0.94
    gang
    0.93
     Wolf
    0.90
    enstein
    0.90
    Act Density 0.010%

    No Known Activations