INDEX
Explanations
bad followed by descriptions
New Auto-Interp
Negative Logits
ال
1.58
માં
1.48
の
1.47
ın
1.45
의
1.45
Ма
1.34
ন
1.27
ر
1.24
0
1.23
ها
1.22
POSITIVE LOGITS
on
1.36
c
1.14
bad
1.02
y
1.02
h
1.00
২
0.95
t
0.95
ontal
0.91
𝟮
0.90
bad
0.89
Activations Density 0.022%