INDEX
Explanations
terms related to physical struggle or conflict
New Auto-Interp
Negative Logits
ç£
-0.16
wayne
-0.15
.Label
-0.15
(Layout
-0.15
imizi
-0.14
argent
-0.14
imizin
-0.14
POSITE
-0.13
533
-0.13
-0.13
POSITIVE LOGITS
led
0.70
les
0.63
ling
0.61
le
0.57
ler
0.55
lers
0.50
li
0.50
let
0.49
lo
0.48
la
0.47
Activations Density 0.184%