INDEX
    Explanations

    phrases related to topics of debate and controversy

    New Auto-Interp
    Negative Logits
    367
    -0.14
    tul
    -0.14
    whatever
    -0.14
    oi
    -0.13
    á»Ļi
    -0.13
    ERM
    -0.12
    ANNEL
    -0.12
    inders
    -0.12
     helf
    -0.12
     ìķĦëĭĪ
    -0.12
    POSITIVE LOGITS
     how
    0.40
    å¦Ĥä½ķ
    0.28
    how
    0.28
     whether
    0.27
     why
    0.27
     cómo
    0.24
     HOW
    0.19
    æĺ¯åIJ¦
    0.19
    -how
    0.19
    whether
    0.18
    Act Density 0.147%

    No Known Activations