INDEX
Explanations
concepts or situations that evoke discomfort or unease
instances of discomfort or unease
New Auto-Interp
Negative Logits
ership
-1.07
ework
-0.95
ebook
-0.80
ribution
-0.78
runner
-0.77
successful
-0.77
uilding
-0.77
ivism
-0.73
lass
-0.73
enforcement
-0.73
POSITIVE LOGITS
uncomfortable
0.98
discomfort
0.88
adolesc
0.74
une
0.74
awkward
0.73
truths
0.72
NESS
0.72
tiss
0.70
Osw
0.68
nesses
0.68
Activations Density 0.025%