INDEX
Explanations
references to royalty or princesses
mentions of the word "Princess."
New Auto-Interp
Negative Logits
grate
-0.67
sych
-0.66
neur
-0.66
oker
-0.65
smoker
-0.65
spaced
-0.64
funn
-0.63
OUT
-0.62
appa
-0.62
ulhu
-0.62
POSITIVE LOGITS
Leia
1.10
Princess
1.01
Diana
1.01
Bride
1.00
Celest
0.99
anova
0.93
Peach
0.87
princess
0.86
Fiona
0.83
cess
0.82
Activations Density 0.019%