INDEX
Explanations
words related to politics
New Auto-Interp
Negative Logits
uteur
-0.18
iyet
-0.17
ertas
-0.16
å¢ĥ
-0.15
ez
-0.15
ênh
-0.15
eson
-0.15
cit
-0.14
ungal
-0.14
ighet
-0.14
POSITIVE LOGITS
correct
0.23
Correct
0.21
incorrect
0.21
correct
0.21
icians
0.20
correctness
0.19
Incorrect
0.18
.correct
0.17
incorrect
0.17
ically
0.17
Activations Density 0.007%