INDEX
Explanations
descriptions or accounts of experiences
instances of the word "described."
New Auto-Interp
Negative Logits
cot
-0.66
aghetti
-0.66
alos
-0.64
iasm
-0.64
Bus
-0.64
think
-0.63
ificial
-0.63
ammy
-0.62
ffic
-0.59
isdom
-0.59
POSITIVE LOGITS
descriptions
0.78
urated
0.77
symptoms
0.76
markings
0.71
uron
0.69
details
0.69
REDACTED
0.69
urally
0.69
aloud
0.65
ribing
0.65
Activations Density 0.029%