INDEX
Explanations
phrases expressing opinions or judgements
phrases that indicate making a statement or providing examples
New Auto-Interp
Negative Logits
xual
-0.63
iru
-0.63
uzz
-0.62
Completed
-0.62
elsen
-0.61
arnaev
-0.60
fram
-0.60
sidx
-0.58
resa
-0.58
enegger
-0.57
POSITIVE LOGITS
laughs
0.73
nothing
0.71
least
0.67
hem
0.62
Credit
0.57
_>
0.57
pecially
0.56
detract
0.56
suffice
0.56
gap
0.55
Activations Density 0.087%