INDEX
Explanations
importance followed by clauses
New Auto-Interp
Negative Logits
Detail
0.37
southern
0.36
"></
0.35
brunes
0.34
GD
0.33
нут
0.32
并将
0.32
Sav
0.32
vět
0.32
testaceis
0.31
POSITIVE LOGITS
bahwa
0.79
أن
0.77
να
0.68
أنه
0.65
that
0.65
ότι
0.64
dass
0.62
että
0.61
rằng
0.61
bahawa
0.59
Activations Density 0.022%