INDEX
Explanations
phrases indicating outcomes or conclusions
New Auto-Interp
Negative Logits
259
-0.15
emas
-0.15
../../../
-0.15
iras
-0.14
iveau
-0.14
rikes
-0.14
inz
-0.14
DNA
-0.14
owell
-0.14
odd
-0.14
POSITIVE LOGITS
stile
0.28
pike
0.26
tables
0.22
coat
0.19
-around
0.18
heads
0.18
into
0.18
šek
0.18
ipse
0.18
table
0.18
Activations Density 0.041%