INDEX
Explanations
phrases related to self-related concepts or actions
references to self-identity or self-related concepts
New Auto-Interp
Negative Logits
IUM
-0.75
ICAN
-0.71
nis
-0.69
ONT
-0.67
pheus
-0.62
ondo
-0.62
Nights
-0.61
dayName
-0.61
andum
-0.60
oS
-0.60
POSITIVE LOGITS
lessly
0.97
same
0.95
esteem
0.94
destruct
0.93
destruct
0.93
-
0.93
ridges
0.90
explanatory
0.88
proclaimed
0.86
less
0.83
Activations Density 0.032%