INDEX
Explanations
mentions or references to political figures and events
references to political events and figures
New Auto-Interp
Negative Logits
Sund
-0.58
reek
-0.53
Hawks
-0.53
Els
-0.51
ãĥ³ãĤ¸
-0.49
tho
-0.49
Nar
-0.49
Samar
-0.48
pac
-0.48
oy
-0.48
POSITIVE LOGITS
.''.
1.00
*.
0.89
ãĢĤ
0.89
.''
0.80
.
0.80
.).
0.79
.*
0.78
.(
0.76
.�
0.75
attRot
0.73
Activations Density 1.260%