INDEX
Explanations
phrases related to causality and explanation
expressions indicating uncertainty or speculation
New Auto-Interp
Negative Logits
ukong
-0.74
yna
-0.68
Goat
-0.63
uggle
-0.62
mop
-0.61
Ping
-0.61
Oro
-0.60
unks
-0.59
kie
-0.59
76561
-0.59
POSITIVE LOGITS
nevertheless
1.72
nonetheless
1.58
etheless
1.18
still
0.93
still
0.93
theless
0.83
remains
0.74
undeniably
0.73
retained
0.71
])
0.69
Activations Density 0.300%