INDEX
Explanations
instances of specific items or groups within larger categories
the word "which" in various contexts
New Auto-Interp
Negative Logits
let
-0.72
nor
-0.70
politics
-0.66
hat
-0.63
LET
-0.61
fitting
-0.60
Binding
-0.59
quote
-0.59
-+
-0.59
dl
-0.59
POSITIVE LOGITS
originated
0.91
akespeare
0.82
lasted
0.81
consisted
0.79
consists
0.75
specialize
0.75
are
0.74
resulted
0.74
survives
0.73
contributed
0.73
Activations Density 0.023%