INDEX
Explanations
terms related to unethical or exploitative behavior
terms related to profanity and unethical behavior
New Auto-Interp
Negative Logits
empty
-0.77
warm
-0.75
wolves
-0.75
20439
-0.74
forth
-0.72
ment
-0.68
MENTS
-0.67
WAY
-0.67
WAYS
-0.66
DAY
-0.65
POSITIVE LOGITS
prof
1.41
mathemat
1.02
thous
0.95
luent
0.92
licted
0.89
eatures
0.88
predec
0.87
inances
0.85
essor
0.84
concess
0.83
Activations Density 0.006%