INDEX
Explanations
actions or stances taken by individuals related to ethical or controversial topics
words and phrases related to ethical choices and societal issues
New Auto-Interp
Negative Logits
Bellev
-0.66
amina
-0.64
waters
-0.63
Ramos
-0.61
Ambrose
-0.60
isters
-0.59
resa
-0.59
verbs
-0.57
enthal
-0.56
throb
-0.56
POSITIVE LOGITS
depending
1.17
depending
1.17
thereof
1.06
alike
0.90
versa
0.87
SPONSORED
0.82
whichever
0.81
atever
0.80
anywhere
0.79
respectively
0.78
Activations Density 0.320%