INDEX
    Explanations

    negative attributes or qualities

    expressions of negativity or criticism

    New Auto-Interp
    Negative Logits
    ĸļ
    -0.80
    arov
    -0.76
    ynthesis
    -0.73
    ovember
    -0.73
    hens
    -0.73
    ktop
    -0.72
    illation
    -0.71
    ellation
    -0.71
    agos
    -0.70
    Revolution
    -0.69
    POSITIVE LOGITS
    dest
    0.98
     karma
    0.88
    enough
    0.80
     bye
    0.80
    dies
    0.78
     vib
    0.77
     Samar
    0.77
     enough
    0.76
     undermin
    0.75
    die
    0.74
    Act Density 0.020%

    No Known Activations