INDEX
Explanations
whatMethod used: 1Reason: MAX_ACTIVATING_TOKENS are all the same token
New Auto-Interp
Negative Logits
Did
0.52
Did
0.49
Do
0.49
Do
0.47
Does
0.46
에서는
0.45
에서도
0.45
로는
0.45
では
0.44
Does
0.44
POSITIVE LOGITS
constitutes
1.19
happens
1.09
kind
1.05
happened
1.05
transpired
0.94
motivates
0.92
kinds
0.82
resonates
0.81
constituye
0.80
excites
0.79
Activations Density 0.268%