INDEX
Explanations
phrases indicating contrasting viewpoints
phrases contrasting with societal expectations or norms
New Auto-Interp
Negative Logits
kamp
-0.75
velt
-0.72
stru
-0.69
age
-0.68
ixel
-0.67
pires
-0.65
FAQ
-0.64
eur
-0.62
aspers
-0.62
Age
-0.61
POSITIVE LOGITS
necessarily
1.50
bothering
1.00
epad
0.95
withstanding
0.95
icably
0.93
eworthy
0.88
unlike
0.84
merely
0.82
ifying
0.80
ific
0.78
Activations Density 0.068%