INDEX
Explanations
references to acts of violence or destruction
New Auto-Interp
Negative Logits
Fou
-0.17
ocr
-0.15
Fut
-0.15
Flood
-0.15
Foo
-0.15
Fog
-0.14
Flush
-0.14
人æ°ijåħ±åĴĮåĽ½
-0.14
Fauc
-0.14
aida
-0.14
POSITIVE LOGITS
fire
0.74
fire
0.57
-fire
0.56
Fire
0.54
fires
0.52
Fire
0.51
.fire
0.49
_fire
0.48
çģ«
0.48
FIRE
0.47
Activations Density 0.074%