INDEX
Explanations
phrases related to self-identity and self-awareness
references to identity and self-perception
New Auto-Interp
Negative Logits
Refresh
-0.69
uish
-0.69
refresh
-0.66
heny
-0.65
teness
-0.65
Highlights
-0.63
idates
-0.62
Dism
-0.61
stray
-0.61
Kills
-0.59
POSITIVE LOGITS
supposed
0.97
gonna
0.86
destined
0.85
going
0.83
doing
0.82
able
0.81
happening
0.78
presented
0.77
weakest
0.76
experiencing
0.76
Activations Density 0.157%