INDEX
Explanations
mentions of personal preferences or favorites
references to favorite things or preferences
New Auto-Interp
Negative Logits
redits
-0.73
ulative
-0.73
avis
-0.73
lam
-0.73
aping
-0.71
uid
-0.70
DEF
-0.70
proof
-0.70
usted
-0.70
compliance
-0.69
POSITIVE LOGITS
haun
1.01
haunt
0.91
beverage
0.87
hobby
0.86
tunes
0.86
hobbies
0.86
snack
0.83
underdog
0.81
childhood
0.81
meal
0.78
Activations Density 0.059%