INDEX
Explanations
words related to intelligence and judgment (e.g., 'stupid', 'dumb', 'smart')
New Auto-Interp
Negative Logits
riott
-0.81
AUT
-0.79
accompan
-0.76
APH
-0.76
apers
-0.75
ILA
-0.74
orthy
-0.73
Reviewed
-0.70
OHN
-0.68
20439
-0.66
POSITIVE LOGITS
founded
1.16
found
0.97
nesses
0.90
ness
0.90
fuck
0.89
ly
0.88
est
0.87
asses
0.85
itude
0.84
stru
0.84
Activations Density 0.036%