INDEX
Explanations
language related to arguments or inconsistencies in reasoning
New Auto-Interp
Negative Logits
ÁCT
-0.53
ennen
-0.46
Pender
-0.44
statt
-0.44
مث
-0.41
års
-0.41
orex
-0.40
Зак
-0.40
glabrous
-0.40
مرئيه
-0.40
POSITIVE LOGITS
aside
1.86
side
1.66
Side
1.52
aside
1.50
Side
1.50
side
1.45
SIDE
1.37
SIDE
1.33
Aside
1.32
sides
1.31
Activations Density 0.128%