INDEX
Explanations
that is mainly from
instructions and meta-discussions about AI models, their capabilities or constraints, especially jailbreak-style prompts and references to policies or system rules.
New Auto-Interp
Negative Logits
ⵛ
0.35
┍
0.34
يف
0.33
胍
0.33
ლე
0.33
?](
0.32
␥
0.32
ﺔ
0.32
άλ
0.32
ί
0.32
POSITIVE LOGITS
that
0.37
advertisement
0.36
providence
0.35
Youtube
0.34
not
0.32
customer
0.32
vision
0.32
rod
0.32
time
0.31
It
0.31
Activations Density 3.103%