INDEX
Explanations
discussion around preferences and societal norms
New Auto-Interp
Negative Logits
sculptured
-0.64
noOf
-0.64
!!!
-0.64
maktadır
-0.63
omiast
-0.61
="#"><
-0.59
میباشد
-0.59
又は
-0.58
již
-0.58
אשר
-0.57
POSITIVE LOGITS
weirdly
1.00
vaguely
0.97
goddamn
0.93
whatnot
0.91
iirc
0.89
shitty
0.89
ostensibly
0.85
pretty
0.83
awkwardly
0.82
lemme
0.82
Activations Density 1.833%