INDEX
Explanations
politically related terms, particularly focusing on party affiliations
references to political parties and gender
New Auto-Interp
Negative Logits
mun
-0.76
Stard
-0.67
adobe
-0.63
ANN
-0.61
RELE
-0.61
WARN
-0.60
Producer
-0.58
abc
-0.57
atana
-0.57
andestine
-0.57
POSITIVE LOGITS
counterpart
0.88
counterparts
0.85
equivalents
0.76
itto
0.68
versions
0.66
versa
0.65
flakes
0.65
д
0.65
captivity
0.64
ngth
0.64
Activations Density 0.325%