INDEX
Explanations
words related to personal attributes, feelings, and actions
expressions of luck and fortune
New Auto-Interp
Negative Logits
"},"
-0.65
harming
-0.63
Achieve
-0.63
etc
-0.63
Mankind
-0.62
".[
-0.61
harmful
-0.60
discriminatory
-0.56
.","
-0.56
çķ
-0.55
POSITIVE LOGITS
guesses
0.94
caveat
0.90
analogy
0.87
caveats
0.82
assumption
0.82
prisingly
0.75
disclaimer
0.74
spoilers
0.74
understatement
0.72
guess
0.71
Activations Density 0.782%