INDEX
Explanations
political references and themes
New Auto-Interp
Negative Logits
ury
-0.18
eus
-0.17
mind
-0.16
ected
-0.15
eum
-0.15
uded
-0.14
Ston
-0.14
oÄŁ
-0.14
eil
-0.14
uali
-0.14
POSITIVE LOGITS
ere
0.22
heits
0.22
heid
0.20
heit
0.18
hei
0.18
es
0.17
este
0.16
erer
0.16
orient
0.16
olan
0.16
Activations Density 0.049%