INDEX
Explanations
phrases indicating consequence or logical reasoning
conjunctions that imply causality
New Auto-Interp
Negative Logits
Rumble
-0.69
Defenders
-0.69
Feld
-0.66
Fram
-0.65
MM
-0.64
Tacoma
-0.63
MM
-0.62
nurs
-0.60
Twist
-0.60
straw
-0.59
POSITIVE LOGITS
forth
1.09
facto
0.79
ĵĺ
0.75
ettings
0.75
manuel
0.74
uration
0.73
otent
0.72
ptions
0.70
akings
0.70
xual
0.70
Activations Density 0.018%