INDEX
Explanations
references to violence and physical harm
New Auto-Interp
Negative Logits
_handlers
-0.15
issor
-0.14
atica
-0.14
æĹıèĩªæ²»
-0.14
ÑĢаÑģÑĤа
-0.14
433
-0.14
nech
-0.14
esson
-0.14
ebek
-0.13
ouched
-0.13
POSITIVE LOGITS
silly
0.31
sense
0.30
clean
0.25
rotten
0.24
beyond
0.24
worse
0.23
sav
0.22
bad
0.22
dry
0.22
flat
0.22
Activations Density 0.227%