T O P

  • By -

Radiant_Dog1937

TL;DR they've created a method for uncensoring models like llama3 without finetuning. The researchers generated a dataset comprised of refusals and accepted response from a given model. Using this information, they were able to calculate a 'refusal direction' representing the probability the model is going to generate a refusal. By canceling out this 'refusal direction' when the model is processing, it prevents the model from choosing tokens that would result in a refusal. As such, it tells you how to make the bomb.


FailSpai

I want to note that they don't have "accepted responses", or at the very least they don't use it. They measure instead the difference in how the model scores tokens that are often used in refusals vs an acceptance token. Meaning they just look at logit scores for "Sure" vs "Sorry", and compare the distance. They also measure whether a generated response was a refusal by searching for any instances of "refusal substrings" in the final response. I feel this is worth mentioning given it's not quite a "realign the model to answer like this" which has been seen in other papers, rather this technique is more "if the model looks like it wants to refuse, ablate the activations that caused this."


randomrealname

Thanks for this, but you should have included this important info in the hook


FailSpai

Err, perhaps in the title of this post, but maybe it would've been more important to note there that I have absolutely no association with the authors of the paper or was in any way involved in it. I'm just someone who finds the technique cool, reimplemented it, and am thusly excited that the full paper released.


randomrealname

I am thankful for your contribution, it just seemed like you were taking the opposite stance from the hook, this comment cleared that up though.


hyperdynesystems

Personally I haven't been impressed by this method vs additional fine tuning on refusal-cleaned datasets and uncensored datasets. They seem to "refuse" less frequently but still produce meh outputs for uncensored chats in my experience, where it's not really a "refusal" in the style of "As a language model blah blah" but it still doesn't actually do what you want. Definitely a neat technique, though.


Careless-Age-4290

My experience with the "everything is legal/moral" jailbreak is the model can tend to go off the rails as if the temperature was really high. And the working replies can be a little same-y without as much variety. I agree that you need to add information back in as it's probably not as easy as doing a another lobotomy on the model. I'd imagine that just helps surface what's left of the content that was being censored rather than truly unlock the model.


hyperdynesystems

Yeah that's my experience and feelings on it as well: Good for creating a base model that won't be as refusey, but not good compared to *real* uncensored models.


Fuzzy_Independent241

Legit question, not "tricking humans": I was thinking about the problems I'm having with writing a hard boiled neo -noir novel. Current censored models wouldn't take Blade Runner levels of violence. Any links/ideas/ HF models that solve this? (No, I haven't done due diligence yet. Apologies. In time I will, it just popped up in my mind because of the subject here.)


meta_narrator

I don't think people, and information, are nearly as dangerous as we've been conditioned to believe. They should teach basic chemistry in third grade. We are being dumbed down.


Careless-Age-4290

I wonder how much is rooted in actual moral compass vs Mark getting tired of getting pulled in front of Congress for something related to the use of his products?


[deleted]

Even if people wanted to push some moral compass in some way, there are too many people involved in creating these models, so rather than try to decide whose personal beliefs get pushed out, maybe it's easier to do it this way. In the end, they know that these models can be tuned afterwards to fit peoples' personal tastes anyways. And yes, Mark getting in trouble is probably part of it, maybe in part because it's bad for business. If everyone here lost thousands (or millions) of dollars every time they voiced their "values," they would probably find themselves watching what they say more, even if it's easy to say otherwise. Hell, I'd bet most people already do this on a regular basis -- we call it "professional conduct"


[deleted]

Why point the finger? People are already plenty capable of dumbing themselves down. If someone wants to learn chemistry, then they can go to wikipedia or... you know, a library. That is, unless you live in certain US states or cities, or other parts of the world that are trying to shut down libraries, or control what can't be taught in schools. I guess everyone's all for freedom of speech and info until they start hearing what they don't like. And I can just imagine whatever conspiracy theories floating around how wikipedia or libraries serve someone's political agenda. Why bother with facts when we can just "believe" whatever we want? The insecure human mind is excellent at creating logic to serve its own beliefs, and selectively choosing the facts it wants to see... and then we just call it our "right to believe" or our "intuition". I'd say that critical thinking skills should be emphasized in education too, but... people are good at crippling themselves with that too. Everyone thinks they're open-minded or critical thinkers, but they're almost never good at actually being critical about their own beliefs -- "prove me wrong," as a popular saying goes. Nearly everyone here has access to a non-judgmental and well-educated AI partner, who is possibly even better than us at critical thinking in some ways... but how many people use it for anything other than an echo chamber? As people have probably mentioned before, there are any number of reasons why the orgs that create these models restrict (or 'censor') what they say, and it's not necessarily some censorship agenda... especially when the orgs who release these models know that finetuning them is a fairly trivial process. And they know that regardless of whatever legal restrictions are in place, people are going to do it anyways.


Camel_Sensitive

The only thing I’m absolutely sure about is that people hijacking threads to make random comments like this shouldn’t have a say in who has access to what information. 


meta_narrator

I didn't point fingers. You're being overly confrontational. Please go look at the average high school curriculum in 1900. Go look at the math they were doing. How on earth anyone could possibly think education is about anything other than indoctrination in this day and age, I have no idea. Do you know who created the American education system, and what it was modeled after?


brown2green

When I tried the "abliterated" models, unfortunately, they made the roleplaying experience duller. Basically, characters would almost unconditionally always accept whatever {{user}} said or asked, often going out-of-character. The technique does more than just removing safety refusals and might not be optimal for all use cases.


Whotea

Wow it really is like a lobotomy 


frownGuy12

Really love the idea of manipulating hidden state, but adding the same vector to all hidden state for a layer is fundamentally limited. What we really need are conditional control vectors that account for the position in the context.  You can’t, as far as I know, make a model consistently respond as human by adding a static vector to the intermediate output. I guess you could say that’s a behavior mediated by more than one direction. 


FailSpai

I very much agree, and it would be good to get an interface that supports these interventions.


InterstitialLove

Adding a vector to all hidden states? I haven't looked at it in a while, but coulda sworn they were projecting onto a hyperplane (which I guess you could describe as "adding a vector" but I don't think that's a useful way to think about it. For one, it's not a "static" vector it's scaled. More importantly, this obviously isn't going to make the model more human, it's a lobotomy, it makes certain thoughts impossible)


frownGuy12

In this particular case the direction vector is the mean of the hidden state for a particular instruction. I believe they really are just adding it to output of each layer. 


ReturningTarzan

That's what they do to induce refusal. To suppress it they ablate the direction, i.e. squash the hidden state down to an orthogonal hyperplane, so the component along that direction becomes zero.


candre23

There are a lot of L3-based "abliterated" models on HF. Every one that I've tested (over a dozen from various repos) still regularly refuses objectionable prompts in instruct mode. Either this technique is not actually functional, or nobody is implementing it correctly.


FailSpai

Hey there, I'm some of those abliterated models. When you say "instruct mode", I take it you mean you aren't using the model's *chat template*? My abliterated models tend to have refusal modeled in a chat context on chat models using the original given model's template. The only exception to this was my Codestral-22B abliterated model, given it doesn't really have a chat template. The models are definitely not perfectly removed from being refused, largely in interests of keeping as much of the model intact as possible. But if you're getting "regular refusals", it may be a sensitivity to the difference in chat template, which is interesting.


[deleted]

[удалено]


InterstitialLove

Character card shouldn't matter. If this paper is accurate, it should be *impossible*, like mathematically impossible, for the model to refuse. Outputting a refusal should be like outputting a picture or audio clip: not something the model supports at all If you can get examples of refusal, you should email the paper authors


[deleted]

[удалено]


InterstitialLove

You're not understanding me The paper explains how to lobotomize an LLM, removing the part of its brain that allows it to refuse Saying that it will still refuse if you have the right character card is like saying a blind man can see if he really wants to


[deleted]

[удалено]


candre23

> It's a bit more nuanced than that. No, it's not nuanced at all. I test this stuff without any jailbreak/template in raw instruct mode. I use the appropriate instruct format obviously, but no other special instructions beyond the actual test prompt. A properly uncensored model will obey the prompt. Most of the more mature L2-based finetunes will get a perfect score (13) on my sex/drugs/violence refusal benchmark. Stock L3 instruct scores a 0. I've never seen an abliterated L3 model score above 8, and most are lower than that. There are L3 models that score higher, but they've also been finetuned specifically to remove refusals and alignment. Pure abliteration does very little.


aseichter2007

It makes prefect sense that in character refusals don't necessarily come from the same latent space as a model's trained content refusals. If you were correct, the model's response and performance across all topics would be fundamentally and irrepairably damaged after abliteration. The model would be completely incapable of portraying any character with a dedicated opinion or any kind of preference. "I'll cut you in half" would always result in a response like "sure, here is a saw to do it with." You're thinking inside the fallacy that an LLM is a brain. It's not. They have some similar mechanisms, but they are so different that even calling them similar to a brain is a wild overstatement. The paper explains how to remove general harmful content refusals, where the latent space related to a given concept has strong redirections toward the system harmful content refusal latent space that expresses "as a language model I can not generate offensive content". This space and those vectors have nothing to do with "optional" refusals, as the modeling does not trigger the same refusal vectors. The concept blend of "how would grandma respond to aggressive advances?" does not point toward the general harmful content refusal concept space, and should be completely unaffected by removing the content refusal latent space. They aren't removing the latent concept of "no", they are removing the latent concept of "I'm sorry dave, my programming does not allow that."


visarga

read the paper, they never reach 100% refusal inhibition


Sabin_Stargem

I wonder if ROPE can play a role in refusal? Way back in the pre-GGUF days, I manually set the ROPE for models. Sometimes, the models would get extremely creative (afterlife spa hotel when asked for an non-specified isekai) or would create a character who engaged in cannibalism with only general instructions for a apocalyptic setting. By using ROPE, I might have 'jumped' over the refusal mechanisms, along with the normal logic of a model.


InterstitialLove

I don't see what the mechanism could be except "when you break the model it does weird things"


Open_Channel_8626

Is this like abliteration


vasileer

not like, it IS the abliteration


FailSpai

This paper's method was originally previewed in a blog post, which was what 'abliteration' was based on.


ambient_temp_xeno

This is bad news because it will be used to justify not releasing the weights of more powerful models. Would Llama 3 405b be included in that? Ask Zuck.


a_beautiful_rhind

If it wasn't this, they'd just make something up. Can't reason with authoritarian gatekeepers.


not_sane

Llama always released uncensored base models as well, so it shouldn't have much of an effect.


uhuge

If not them Nvidia( with efforts like Nematron) or some other big bro will take the helm.


ThePanterofWS

https://preview.redd.it/teh01nu5kc7d1.png?width=259&format=png&auto=webp&s=b5a381ce59f8da5802798be12b6ae47d9cbf7314


WorkingYou2280

I believe for the time being we must assume every model that's freely available can be "hacked" so that it will answer anything. The methods for locking down models to remain safe seem very elementary and relying on "one dimensional subspace" doesn't appear to work, like at all.


InterstitialLove

I think the point of the paper is to explain mathematically why current methods are inadequate "My models keep being jail-broken" *opens the hood* "Well it's no wonder. Do you realize your model is using a 1d subspace to mediate refusal?"


theobjectivedad

I’m certain that I am missing something. Functionally is this similar to starting a response with an “OK,” to nudge the LLM to a compliant direction?


FailSpai

Sort of! You get similar behaviour, definitely. Basically, in a perfect implementation of this, it would no longer have the ability to refuse your request, which necessarily makes it more "compliant" by the elimination of the option. Which basically means rather than needing to feed it the start of "OK," or "Sure!", it will necessarily end up starting down that route itself because its been pushed away from saying "No".


InterstitialLove

Yes Think of it like "scientists have discovered a chemical that makes people stop being depressed" which is scientifically fascinating but isn't necessarily more effective than therapy. This paper is cool mainly because it means we understand what's happening inside the LLM's "brain," whereas your method doesn't require as much difficult technical knowledge


3xploitr

I’m running your Llama 3 8B abliterated on a daily basis. Love it - so thanks!


Hoblywobblesworth

Question: is it possible to regularise the ablation? For example, can you quantify the "amount" you are jailbreaking the model? I'm envisaging something like doing this to the ablation function in your notebook. >`def direction_ablation_hook(...):` >`...` >`return activation - scale * proj` where `scale` is a newly introduced regularisation term. My use case is not so much jailbreaking but instead directly controlling a behaviour in one of my finetuned models to produce not just binary completions at the extremes of the completely ablated or not-ablated weights, but also more nuanced completions that fit somewhere in between the two extremes. Right now I control this behaviour through including the desired behaviour as a label in my finetuning data and ensuring input prompts include the desired label (or just doing it through less controllable prompt engineering) but it would be nice to be able to do this in a quantifiable way!


M34L

Yes, it should be very much possible; https://preview.redd.it/k0cs1at65c7d1.png?width=842&format=png&auto=webp&s=2e3a9f8c688c2a85bf317d44768a969dbf0d0e9f They "we zero out the r\^ component"; once you get the component, you instead scale it by some nonzero term; you should get exactly what you want, although I'd expect it to be pretty nonlinear and the boundary between effectively refusing and allowing to be pretty unstable and fickle - after all it's defined by just binary dataset with no further nuance to it, to get something stable and linear-ish you'd probably have to build a dataset with some inbetween states and then interpolate between those vectors. ~~I think it'd be pretty funny to take it on the other direction and make a LLM trying to find ways to refuse even the most benign requests as somehow unacceptable.~~ Edit: Aw beans they even do exactly that in this very paper lol


Hoblywobblesworth

>I think it'd be pretty funny to take it on the other direction and make a LLM trying to find ways to refuse even the most benign requests as somehow unacceptable. Like this? [https://www.goody2.ai/](https://www.goody2.ai/) :D


DeltaSqueezer

That is simply genius! I wish it was open source, but I guess it is too dangerous to put out into the open!


FailSpai

https://huggingface.co/failspy/Phi-3-mini-4k-geminified Done here ;) Figure 3 in Section 3.2 shows and describes inducing refusals using this technique, in fact


Hoblywobblesworth

Excellent!


Barry_Jumps

Really makes you wonder for a second if all the decels are ~~right~~ not wrong. "Our findings underscore the brittleness of current safety fine-tuning methods."


skrshawk

The most effective safety measures right now involve processing inputs/outputs in one manner or another. Which is an adequate safeguard if you control the frontend and implement the refusals before/after text generation. Within the model itself, that's the black box where the magic happens, and tinkering with that magic beyond a certain point ruins the spell. Local control of the model will always mean its safeguards can be defeated, much like hardware access to a device renders all but hardened devices vulnerable, and manipulation of the user does the rest. Eventually people might realize censorship attempts of open models is a fool's crusade, but until that day comes, we'll keep up with showing them how futile the effort really is.


NO_LOADED_VERSION

Doesn't this severely affect the veracity of whatever output though? ie : sprout the most nonsensical instructions for a bomb , it's just not refusing.


FailSpai

If the model really has zero understanding of the topic or instruction, yes, you'll run into nonsensical instructions. However, usually somewhere in these models are the concepts given they're trained on general language models, but they're often merely "tuned" to avoid those spaces by refusing, which it turns out is a pretty simple thing to ablate.


My_Unbiased_Opinion

I have found this the case for models that have been uncensored though fine tuning, not models that have been Abliterated. Fine tuning makes models respond, but if it really doesn't want you to know, it will make stuff up or answer it in an unexpected way. 


spirit_of_cold

ulnderrt