INDEX
Explanations
mentions of how people are treated by others
references to the concept of treatment or being treated in various contexts
New Auto-Interp
Negative Logits
aer
-0.72
azi
-0.71
audi
-0.68
sky
-0.63
direction
-0.61
sign
-0.60
Origin
-0.60
adra
-0.59
Rae
-0.58
activated
-0.58
POSITIVE LOGITS
ttes
0.87
ricular
0.83
treated
0.79
reatment
0.78
terson
0.76
iments
0.75
ivated
0.75
illance
0.74
pione
0.73
htaking
0.72
Activations Density 0.019%