INDEX
Explanations
possessive markers
prepositional phrase indicators
New Auto-Interp
Negative Logits
a
0.61
ו
0.60
ów
0.57
erson
0.56
was
0.54
ids
0.54
art
0.53
V
0.52
was
0.51
verts
0.51
POSITIVE LOGITS
in
0.77
ين
0.68
daki
0.52
인해
0.52
ని
0.50
larından
0.50
ల
0.49
larını
0.49
ྛ
0.48
في
0.48
Activations Density 2.344%