INDEX
Explanations
words related to self-identity and self-description
New Auto-Interp
Negative Logits
Nights
-0.79
Ashe
-0.77
XIII
-0.74
GOODMAN
-0.74
ī
-0.70
Syndicate
-0.69
IUM
-0.68
Rouge
-0.67
Slay
-0.67
LAT
-0.66
POSITIVE LOGITS
destruct
1.30
destruct
1.00
conscious
1.00
ridges
0.97
same
0.94
explanatory
0.92
lessly
0.91
upload
0.89
esteem
0.84
pecially
0.81
Activations Density 0.041%