INDEX
Explanations
sentences indicating understanding or comprehension
expressions of comprehension or understanding
New Auto-Interp
Negative Logits
DOWN
-0.66
mage
-0.65
aunder
-0.64
rouse
-0.63
izons
-0.62
artney
-0.62
patch
-0.61
psychiat
-0.60
ield
-0.60
onding
-0.60
POSITIVE LOGITS
myself
0.94
Citation
0.63
ANA
0.61
firsthand
0.60
Kahn
0.59
count
0.58
reiterate
0.58
poke
0.57
regrets
0.57
exaggeration
0.56
Activations Density 0.371%