INDEX
Explanations
negations or expressions of disagreement
New Auto-Interp
Negative Logits
rape
-0.17
asia
-0.14
writeln
-0.14
reu
-0.14
wire
-0.14
bilt
-0.14
aoke
-0.14
finity
-0.13
drivers
-0.13
claimer
-0.13
POSITIVE LOGITS
longer
0.32
different
0.31
doubt
0.29
xious
0.28
thin
0.27
exception
0.26
stranger
0.25
laughing
0.25
match
0.25
ordinary
0.23
Activations Density 0.018%