INDEX
Explanations
instances of the word "acting."
New Auto-Interp
Negative Logits
↵↵
-0.70
-0.67
<eos>
-0.61
↵
-0.59
The
-0.56
.
-0.54
,
-0.53
And
-0.53
(
-0.52
-0.51
POSITIVE LOGITS
acted
1.68
acting
1.64
act
1.57
acting
1.51
Acting
1.50
cted
1.50
acts
1.46
Acting
1.43
act
1.40
ACT
1.39
Activations Density 0.073%