INDEX
Explanations
phrases related to self-definition and self-description
expressions of opinion about individuals
New Auto-Interp
Negative Logits
teness
-0.80
Dism
-0.65
shock
-0.65
Kot
-0.63
Extend
-0.63
Indust
-0.60
win
-0.60
irm
-0.59
Awakens
-0.59
Ship
-0.59
POSITIVE LOGITS
supposed
0.86
Interstitial
0.82
happening
0.81
gonna
0.77
nt
0.76
going
0.75
akin
0.74
destined
0.74
omorphic
0.73
alian
0.71
Activations Density 0.203%