INDEX
Explanations
phrases related to political ideology and criticism, particularly focusing on deception and manipulation of information
New Auto-Interp
Negative Logits
zens
-0.64
Sob
-0.64
gra
-0.62
rawdownloadcloneembedreportprint
-0.61
ivid
-0.59
stown
-0.59
chenko
-0.58
Survey
-0.58
abulary
-0.58
pora
-0.58
POSITIVE LOGITS
innocuous
0.87
innocence
0.80
invincible
0.79
UL
0.76
OPA
0.72
benign
0.65
harmless
0.65
neutrality
0.62
glamorous
0.61
Cure
0.61
Activations Density 16.267%