INDEX
Explanations
phrases questioning or challenging societal norms or behaviors
phrases that express denial or rejection of certain actions or beliefs
New Auto-Interp
Negative Logits
Vers
-0.67
uled
-0.65
verning
-0.64
Province
-0.59
Renew
-0.58
orsche
-0.56
Located
-0.56
iversary
-0.56
ouver
-0.56
entials
-0.56
POSITIVE LOGITS
trolling
0.78
shaming
0.77
cynicism
0.76
misguided
0.75
frankly
0.74
honestly
0.74
instead
0.73
subconscious
0.72
goddamn
0.72
trolls
0.71
Activations Density 1.377%