INDEX
Explanations
references to specific groups or categories of people
New Auto-Interp
Negative Logits
Suom
-0.57
Puig
-0.56
Wilber
-0.55
Gedichte
-0.55
میل
-0.53
Isma
-0.51
pytanie
-0.50
Lancelot
-0.50
Änder
-0.49
Jum
-0.49
POSITIVE LOGITS
those
1.31
Those
1.25
those
1.19
Those
1.19
THOSE
1.13
these
1.00
pesky
1.00
%")
0.94
चीज़ों
0.93
Những
0.91
Activations Density 0.031%