INDEX
Explanations
previous steps or hidden states
New Auto-Interp
Negative Logits
Ná
0.42
começo
0.41
違う
0.40
lateribus
0.38
重複
0.37
leves
0.37
люс
0.37
ρχ
0.36
Burj
0.36
Wade
0.36
POSITIVE LOGITS
Implications
0.38
wildfires
0.38
aiian
0.38
anal
0.38
inspired
0.38
qualche
0.38
ム
0.37
переда
0.37
informed
0.36
결
0.36
Activations Density 0.007%