INDEX
Explanations
Copilot, sales pitches, internet knowledge
New Auto-Interp
Negative Logits
ان
0.53
ル
0.49
인
0.47
で
0.46
expensive
0.46
hundred
0.45
拶
0.45
ک
0.45
ப்பில்
0.44
λ
0.44
POSITIVE LOGITS
odio
0.51
x
0.49
velké
0.49
aquello
0.48
getir
0.47
lector
0.47
feitos
0.47
diversas
0.46
ix
0.46
queso
0.46
Activations Density 0.007%