INDEX
Explanations
words related to political entities or affiliations
terminologies related to representation and identity within specific groups or contexts
New Auto-Interp
Negative Logits
ahime
-0.84
icing
-0.70
sidx
-0.64
vent
-0.64
terness
-0.63
thood
-0.61
Canaver
-0.60
Kissinger
-0.60
theorem
-0.60
Verb
-0.58
POSITIVE LOGITS
ét
0.72
ild
0.66
erville
0.66
etts
0.65
emouth
0.62
urg
0.60
anded
0.60
iciary
0.60
olulu
0.59
AL
0.59
Activations Density 0.123%