INDEX
Explanations
people and their associated actions
New Auto-Interp
Negative Logits
.”
0.16
。”
0.15
Contains
0.15
?.
0.15
decomposition
0.15
.}
0.14
0.14
Charging
0.14
."
0.14
().
0.14
POSITIVE LOGITS
have
0.21
recognize
0.18
spend
0.17
perceive
0.17
spends
0.16
들은
0.16
たちは
0.16
engage
0.16
would
0.15
们
0.15
Activations Density 0.167%