INDEX
Explanations
examples or instances of behaviors or characteristics
instances of examples being cited in various contexts
New Auto-Interp
Negative Logits
Enlarge
-0.74
hunt
-0.72
ettes
-0.70
emies
-0.68
forts
-0.67
EEP
-0.65
task
-0.65
querade
-0.65
agues
-0.65
culosis
-0.64
POSITIVE LOGITS
how
1.37
why
1.29
what
0.97
hypocrisy
0.94
unintended
0.89
why
0.88
how
0.87
lazy
0.87
WHY
0.83
misplaced
0.82
Activations Density 0.115%