INDEX
Explanations
phrases related to the concept of self
words and phrases indicating self-reference or self-descriptions
New Auto-Interp
Negative Logits
IUM
-0.78
ICAN
-0.78
Ashe
-0.77
XIII
-0.70
ONT
-0.69
etter
-0.68
rium
-0.68
oice
-0.65
oric
-0.65
IENCE
-0.64
POSITIVE LOGITS
destruct
1.11
lessly
1.05
-
1.01
same
1.01
destruct
0.93
explanatory
0.93
proclaimed
0.92
ridges
0.89
âĢij
0.88
less
0.86
Activations Density 0.016%