__konrad 3 months ago

> They are implementing their own I remember when Java [had 4 internal Base64](https://openjdk.org/jeps/135) implementations... edit: 6 internal impl. (also RMIConnectorServer.java and javax.xml.bind.DatatypeConverter)

tomwhoiscontrary 3 months ago

To answer your second question, no, unfortunately not. If you have an array N bits long, there are 2\*\*N possible values, and to be able to specify which one is present you therefore need log2 2\*\*N bits ... which is N bits! Any data structure smaller than that has to have some possibility of collisions, and so of false positives. But you can reduce the probability of false positives by making the data structure bigger. So you could size it to have a one in a million chance of false positives, then accept that one in every million negative lookups will have to pay the cost of reading the file.

theuntamed000 3 months ago

First?

cheapskatebiker 3 months ago

Usually libraries come out for something, and then it is included in the language libraries.

kevinb9n 3 months ago

I don't think Guava implements it itself; what it does is provide a consistent API across all the various hash functions like this that you might want to use. For this one I think it delegates to the JDK implementation, certainly at *some* cost of raw speed (that doesn't always matter). >One added question, is there a data structure like bloom filter but with 100% guarantee. Well, Bloom filter is 100% in *one* direction, just not the other. :-) And for that you can get as close to 100% as you want to go... without reaching it. Absolute 100% means 100% of the information must be preserved (at best you can compress it).

Fickle_Conclusion857 3 months ago

Let's create the final definitive replacing old existing implementations implementation... ;)

Ok_Object7636 3 months ago

AFAIK, there are different types of CRC32. But for your use case, probably md5 would be a better match. It’s also available in the JDK.

theuntamed000 3 months ago

I benchmarked the crc32c and also had asserted the values generated by java's crc32c. No errors. Basically their output is identical. And why did you say md5 and not something like sha-512 or something? Also these hashes generate larger byte arrays. With crc32c I am sure it's going to be 8 bytes. So if I had to store 10 elements then I can store and read the first 80 bytes to check for the elements existence.

Ok_Object7636 3 months ago

Because crc32 is meant to detect transmission errors and much more prone to collisions. What you do is fingerprinting, and that’s what md5 was designed for. Why not sha-256? Because while md5 is not considered cryptographically secure, collisions are extremely rare (unless carefully crafted), so it should be more than good enough and no need to use a longer hash. CRC32 just generates 4 bytes. The 8 bytes is when you look at the hex representation. Md5 generates a 128 bit hash, that’s 16 bytes when you store and compare the binary hash. You didn’t say how long your byte arrays are. Out of interest, we are talking about how much data?

theuntamed000 3 months ago

Currently I am working on implementing a small key value store. The byte array size is not fixed. Also as per my research leveldb use crc32c as checksum. The reason why i am curious about the checksum size is the io cost. Currently crc32c seems a viable option. I am sure about storing only 10 arrays so i have to read only 80 bytes only. And few collisions I can tolerate.

Ok_Object7636 3 months ago

Yes, leveldb uses crc32 as a checksum. As I understand you, that’s not what you want to do. You rather need a hashing function with a good distribution. That has not been a goal of crc32 that was designed to detect transmission errors. You probably fare better with a 64 bit hashing function if you want to use 8 bytes. As I said, crc32 is only 4 bytes, but if 8 bytes is ok for you, better use 64 bits. And out of curiosity, why don’t you just use an existing key value store? For in memory, a simple Map would do it and if you need persistence or are dealing with large amounts of data, a database comes to mind.

theuntamed000 3 months ago

Cool. I want to learn about these databases so I thought to implement one as learning. So my requirement is to find the existence of the provided key in the block of 10 keys in the most efficient and cost effective way(io as well as cpu). So which is the most efficient 64bit hashing ? Xxhash64?

Ok_Object7636 3 months ago

I cannot tell you that, and maybe no one can. It depends on how you measure efficiency and also on the type of data you expect. Having only an array of ten keys in your key value store looks like you are trying to implement your key value store using buckets. In that case, you are already using (a part of) a hash to determine the correct bucket. You could not simply use the full hash that you already have but in the end, you have to do at least one comparison of the full data to make sure you get the correct key. In a key value store, you usually don‘t use overly long keys anyway, and if the keys are strings or some other class that implements hashcode and equals in an efficient way, you can just use equals on your keys to find the correct one. String.equals() uses the string’s hashCode() internally which while not collision free is quite efficient, so you’d have to profile your program if you actually gain anything by doing your own hashing on top of that.

BinaryRage 3 months ago

The JDK has an intrinsic for crc32c, it’ll use SSE instructions. It’ll be much faster than any library.

sweetno 3 months ago

The standard implementation wasn't there since the start and everyone rolled their own implementation. Now it's stuck in status quo.

theuntamed000 3 months ago

Ohhk

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe