INDEX
    Explanations

    references to violence and physical harm

    New Auto-Interp
    Negative Logits
    _handlers
    -0.15
    issor
    -0.14
    atica
    -0.14
    æĹıèĩªæ²»
    -0.14
    ÑĢаÑģÑĤа
    -0.14
    433
    -0.14
    nech
    -0.14
    esson
    -0.14
    ebek
    -0.13
    ouched
    -0.13
    POSITIVE LOGITS
     silly
    0.31
     sense
    0.30
     clean
    0.25
     rotten
    0.24
     beyond
    0.24
     worse
    0.23
     sav
    0.22
     bad
    0.22
     dry
    0.22
     flat
    0.22
    Act Density 0.227%

    No Known Activations