INDEX
Explanations
profane words or phrases
linguistic patterns related to specific suffixes and potentially notable keywords
New Auto-Interp
Negative Logits
rift
-0.80
iry
-0.75
Wond
-0.74
MQ
-0.74
romy
-0.70
dq
-0.69
DOM
-0.67
IR
-0.67
Soc
-0.66
ilk
-0.64
POSITIVE LOGITS
banter
0.82
insult
0.78
gluc
0.74
fres
0.74
plet
0.74
heck
0.73
eah
0.72
boo
0.66
onsense
0.66
outburst
0.66
Activations Density 0.055%