INDEX
Explanations
texts related to respect and social norms
words related to perspective or viewpoints
New Auto-Interp
Negative Logits
pmwiki
-0.73
cedes
-0.70
udeb
-0.69
educ
-0.68
puting
-0.66
fml
-0.65
rama
-0.64
anyl
-0.64
driving
-0.63
izzard
-0.63
POSITIVE LOGITS
pect
1.09
Ratio
0.94
orate
0.94
terness
0.93
rons
0.88
pects
0.88
lihood
0.85
olate
0.81
eur
0.79
anamo
0.78
Activations Density 0.006%