INDEX
Explanations
referring to entities and actions
New Auto-Interp
Negative Logits
h
0.31
The
0.28
They
0.28
0.27
They
0.27
den
0.26
stabil
0.26
It
0.25
cd
0.25
I
0.25
POSITIVE LOGITS
ל
0.30
for
0.29
começ
0.29
ную
0.29
μα
0.28
تي
0.28
ید
0.28
льне
0.27
ால்
0.27
م
0.27
Activations Density 0.085%