INDEX
Explanations
end of sentences with specific subsequent words
New Auto-Interp
Negative Logits
t
0.87
h
0.80
d
0.74
k
0.71
f
0.70
c
0.64
n
0.63
ia
0.62
e
0.60
א
0.58
POSITIVE LOGITS
؟
0.61
يد
0.56
۔
0.55
0.54
)،
0.51
ým
0.50
ة
0.50
؟
0.50
До
0.50
؛
0.49
Activations Density 0.089%