INDEX
Explanations
phrases or sentences warning about the content of the text
phrases indicating the presence of content or materials within various contexts
New Auto-Interp
Negative Logits
doms
-0.79
Sabha
-0.75
apo
-0.69
urai
-0.67
Seym
-0.65
icably
-0.65
laus
-0.64
sett
-0.63
liner
-0.63
zai
-0.62
POSITIVE LOGITS
ttes
0.77
contents
0.76
ãĤ¼ãĤ¦ãĤ¹
0.73
encies
0.72
Contains
0.72
ãĤ©
0.69
ãĤ£
0.69
iveness
0.68
Material
0.68
ãĤ·ãĥ£
0.67
Activations Density 0.024%