INDEX
Explanations
negations or contradictions
negations or phrases that indicate a lack of something
New Auto-Interp
Negative Logits
spor
-0.66
indirectly
-0.65
tein
-0.64
creations
-0.62
arts
-0.61
Js
-0.58
rotated
-0.57
oriented
-0.57
towed
-0.56
jointly
-0.56
POSITIVE LOGITS
unanim
0.93
hin
0.87
enough
0.86
enough
0.81
xus
0.81
room
0.80
ibaba
0.79
Enough
0.77
ANY
0.76
anymore
0.73
Activations Density 0.070%