INDEX
Explanations
questions or statements involving knowledge or information
references to guessing and knowledge-based questioning
New Auto-Interp
Negative Logits
Contents
-0.76
egu
-0.73
UTC
-0.71
Dialogue
-0.69
imm
-0.68
itals
-0.68
egal
-0.68
enge
-0.68
roman
-0.67
²¾
-0.67
POSITIVE LOGITS
Oops
0.80
Pledge
0.74
Want
0.74
Won
0.74
Bike
0.71
Spice
0.70
naughty
0.70
kidding
0.70
Bastard
0.69
Didn
0.68
Activations Density 0.319%