INDEX
    Explanations

    explanations or discussions about moral and ethical dilemmas

    New Auto-Interp
    Negative Logits
    à¸ĩหมà¸Ķ
    -0.15
     огÑĢа
    -0.15
    ãİ
    -0.13
    ãģłãģ£ãģ¦
    -0.13
    riad
    -0.13
    ?,?,?,?,
    -0.13
    اگ
    -0.13
     имÑĥ
    -0.12
    aeda
    -0.12
    ghest
    -0.12
    POSITIVE LOGITS
     both
    1.43
    both
    1.32
     Both
    1.24
     BOTH
    1.23
    Both
    1.20
    _both
    1.02
     beide
    0.99
     ambos
    0.98
     neither
    0.84
     obou
    0.84
    Act Density 1.899%

    No Known Activations