INDEX
Explanations
kena, suay, hacked, getroffen
New Auto-Interp
Negative Logits
Hearst
0.38
unhealthy
0.37
emergent
0.37
عبير
0.37
чества
0.36
শির
0.36
ịnh
0.36
честве
0.35
൩
0.35
다운
0.34
POSITIVE LOGITS
terken
0.69
kena
0.61
getroffen
0.56
pata
0.55
hab
0.54
robbed
0.50
puk
0.49
trampled
0.49
hab
0.48
geb
0.48
Activations Density 0.001%