INDEX
    Explanations

    references to acts of violence or destruction

    New Auto-Interp
    Negative Logits
     Fou
    -0.17
    ocr
    -0.15
     Fut
    -0.15
     Flood
    -0.15
    Foo
    -0.15
     Fog
    -0.14
    Flush
    -0.14
    人æ°ijåħ±åĴĮåĽ½
    -0.14
     Fauc
    -0.14
    aida
    -0.14
    POSITIVE LOGITS
     fire
    0.74
    fire
    0.57
    -fire
    0.56
     Fire
    0.54
     fires
    0.52
    Fire
    0.51
    .fire
    0.49
    _fire
    0.48
    çģ«
    0.48
     FIRE
    0.47
    Act Density 0.074%

    No Known Activations