INDEX
Explanations
phrases indicating contrast or emphasis
phrases that express limitation or negation
New Auto-Interp
Negative Logits
osion
-0.72
cies
-0.69
insula
-0.65
VIDEOS
-0.62
ctors
-0.60
Increases
-0.59
ortmund
-0.59
Blaz
-0.58
soType
-0.58
UW
-0.57
POSITIVE LOGITS
forth
0.92
employed
0.74
been
0.74
considered
0.73
confined
0.71
entertained
0.70
immortal
0.70
ãĥĩãĤ£
0.68
agree
0.68
always
0.67
Activations Density 0.197%