INDEX
Explanations
references to political actions and beliefs
New Auto-Interp
Negative Logits
\-
-0.59
infamous
-0.58
unsuccessfully
-0.56
secret
-0.56
nicknamed
-0.55
Downloadha
-0.54
irin
-0.53
unknown
-0.52
amon
-0.52
cheon
-0.52
POSITIVE LOGITS
ASAP
0.85
transparency
0.83
accountability
0.83
accountable
0.79
unbiased
0.79
sooner
0.79
respect
0.79
honest
0.78
cknow
0.78
decency
0.77
Activations Density 0.963%