INDEX
Explanations
references to emotional and harsh descriptors
New Auto-Interp
Negative Logits
antan
-0.15
ickers
-0.15
elf
-0.15
iegel
-0.14
ÃŃk
-0.14
orman
-0.14
agas
-0.14
ÑĪила
-0.13
ella
-0.13
иÑĤа
-0.13
POSITIVE LOGITS
ksam
0.15
eru
0.14
rox
0.14
ucz
0.14
oux
0.14
Pont
0.14
BRO
0.14
etxt
0.14
è¥
0.13
ativ
0.13
Activations Density 0.023%