INDEX
Explanations
phrases related to actions or decisions being taken
phrases related to concerns and attention regarding safety and evaluation processes
New Auto-Interp
Negative Logits
fuck
-0.64
Pse
-0.61
fame
-0.61
Beat
-0.59
OGR
-0.57
Que
-0.57
stories
-0.57
Chains
-0.56
misfortune
-0.56
Flo
-0.56
POSITIVE LOGITS
aukee
0.97
ļéĨĴ
0.78
ername
0.76
ģ«
0.72
thren
0.70
zinski
0.67
Ī
0.65
ħĭ
0.65
ignt
0.65
emaker
0.64
Activations Density 0.325%