INDEX
    Explanations

    expressions of skepticism or criticism towards authority figures or systems

    New Auto-Interp
    Negative Logits
    485
    -0.16
    raft
    -0.15
    agon
    -0.15
    ÃĸL
    -0.15
    .cgi
    -0.14
    abay
    -0.14
    御
    -0.14
    iar
    -0.14
    igne
    -0.14
    INI
    -0.14
    POSITIVE LOGITS
     when
    0.24
    when
    0.23
     When
    0.19
     khi
    0.19
     cuando
    0.19
     quando
    0.18
    When
    0.18
    	when
    0.18
     WHEN
    0.18
    adian
    0.17
    Act Density 0.159%

    No Known Activations