INDEX
    Explanations

    mentions of controversial topics or discussions surrounding moral or ethical dilemmas

    New Auto-Interp
    Negative Logits
    malink
    -0.18
    elib
    -0.16
    Actually
    -0.15
    actually
    -0.15
     надо
    -0.14
    Asked
    -0.14
    too
    -0.14
    oka
    -0.14
     поÑĤом
    -0.14
     Actually
    -0.13
    POSITIVE LOGITS
     According
    0.23
     according
    0.22
     Although
    0.22
     While
    0.21
     Though
    0.21
    According
    0.21
     Due
    0.21
    Furthermore
    0.20
     although
    0.20
     due
    0.20
    Act Density 0.264%

    No Known Activations