INDEX
Explanations
movie titles starting with these words
New Auto-Interp
Negative Logits
in
0.66
ia
0.63
ي
0.63
م
0.61
et
0.58
and
0.54
ik
0.54
ق
0.54
ل
0.53
ان
0.52
POSITIVE LOGITS
misog
0.50
лор
0.49
musul
0.48
때는
0.48
तीन
0.48
трех
0.47
лардын
0.47
而在
0.46
figur
0.46
жка
0.46
Activations Density 0.018%