INDEX
Explanations
phrases related to accusations and social discourse
New Auto-Interp
Negative Logits
oral
-0.18
plit
-0.16
inh
-0.15
oren
-0.15
uta
-0.14
oker
-0.14
olas
-0.13
mand
-0.13
Finn
-0.13
zag
-0.13
POSITIVE LOGITS
Berger
0.16
گاÙĨ
0.15
argo
0.15
jest
0.15
ampaign
0.14
иÑģÑĤÑĢа
0.14
ارت
0.14
addon
0.14
Khal
0.14
ereco
0.13
Activations Density 0.120%