INDEX
Explanations
quotes or phrases in a question-response format
question marks and conversational cues indicating uncertainty or requests for confirmation
New Auto-Interp
Negative Logits
multiplied
-0.78
fanc
-0.77
fleeing
-0.77
trave
-0.75
chants
-0.75
harassing
-0.75
pict
-0.74
stray
-0.74
migr
-0.74
forgotten
-0.73
POSITIVE LOGITS
JM
1.46
JB
1.41
Answer
1.37
JV
1.36
RH
1.32
EH
1.32
MH
1.29
JP
1.26
DW
1.26
JS
1.25
Activations Density 0.069%