INDEX
Explanations
phrases related to authority figures and social complaints
New Auto-Interp
Negative Logits
obot
-0.15
apper
-0.14
majority
-0.14
InProgress
-0.14
kvin
-0.14
ód
-0.14
dogs
-0.14
TestCase
-0.14
adesh
-0.14
466
-0.13
POSITIVE LOGITS
282
0.15
ussen
0.14
cka
0.14
imu
0.14
Morales
0.14
ượng
0.14
*@
0.14
ÑĥÑĢн
0.14
uentes
0.14
ikal
0.13
Activations Density 0.491%