Algorithm
hash = file_size + sum_uint64_le(first_64KB) + sum_uint64_le(last_64KB)
The hash calculates: file size + 64-bit checksum of the first and last 64 KB
(even if they overlap because the file is smaller than 128 KB).
All arithmetic is unsigned 64-bit with natural overflow (wrapping).
Data is read as little-endian uint64 values (8192 values per chunk).
File size requirement: minimum 131,072 bytes (128 KB). Files smaller than this cannot be hashed.
History
GenerateOSHash function).
moviehash parameter (and the moviehash_match flag on results).
"moviehash" is simply the API's name for OSHash; they are the same algorithm.
It remains the fastest identification method for local files.
Test Vectors
| File | Size (bytes) | Expected Hash | Download |
|---|---|---|---|
breakdance.avi |
12,909,756 | 8e245d9679d31e12 |
Download (12.3 MB) |
dummy.rar |
4,295,033,890 | 61f7751fc2a72bfb |
Download RAR (2.4 MB) Unpack to get the 4 GB test file |
Verify your implementation against both files before deployment.
breakdance.avi confirms the basic algorithm, but the
4 GB file (from dummy.rar) is the important one:
its size exceeds 232, so it's the only vector that exposes the
64-bit overflow / large-file seek bugs that plague many
implementations floating around the internet — a hash that's correct on
breakdance.avi can still be wrong here.
Limitations & Security
Not Cryptographic
OSHash is not a cryptographic hash. It was designed for speed, not security. Do not use it for integrity verification, authentication, or any security-sensitive purpose. Use SHA-256 or BLAKE3 for those.
Trivial Collisions
Two files with the same size, same first 64 KB, and same last 64 KB will produce the same hash, regardless of what's in the middle. This means ~99.99% of the file content is not hashed for typical video files.
Hash Forgery
An attacker can craft a file with any desired hash by manipulating the first or last 64 KB. Since the hash is just addition, finding a preimage is trivial arithmetic, not a computational puzzle. You can also transplant the head/tail of one file onto another.
Second Preimage Attack
Given a file and its hash, creating a different file with the same hash is trivial: keep the same size, copy the first and last 64 KB, and put anything in the middle. This makes the hash unsuitable for verifying file authenticity.
Appropriate Use Cases
OSHash is well-suited for: subtitle database lookups, media library deduplication (combined with file size), and quick file identification in trusted environments. Its O(1) read cost (always 128 KB regardless of file size) is its main advantage.
Performance Profile
Only reads 128 KB total, regardless of file size. Hashing a 50 GB file takes the same time as hashing a 200 KB file. No CPU-intensive cryptographic operations — just integer addition. Typically completes in under 1 ms for local files.
Performance
Each implementation is timed on the same run — hashing the 4 GB test file on the reference machine. Because OSHash only reads the first and last 64 KB, even a 4 GB file is trivial work (a correct implementation seeks rather than streams), so these numbers are dominated by interpreter / VM startup and runtime footprint, not throughput. That is exactly what makes them interesting — they show the fixed cost of reaching for each language, and confirm that none choke on a multi-gigabyte input.
What the numbers say:
- Native compiled (C, C++, Rust, Go, Zig, Nim, D, Crystal, Fortran, Pascal, Ada, Assembly, V) all land around
~0.01 sand3–4 MB— essentially just process spawn. They are indistinguishable here; the difference is noise. - Scripting (Python, Perl, PHP, Ruby, Lua, Tcl, AWK) adds a small interpreter-startup tax: tens of milliseconds, single-digit to tens of MB.
- JVM (Java, Kotlin, Scala, Groovy, Clojure) pays for JVM boot and heap —
40–210 MBand up to a couple of seconds. Groovy and Clojure are the heaviest of the set. - BEAM (Erlang, Elixir) and JIT/runtime-heavy languages (Julia, Dart, Raku, PowerShell) sit in between, with Julia's JIT and PowerShell's runtime the most memory-hungry (
180–240 MB).
Bash is the outlier at ~1.5 s: it shells out to
dd/od and does the 64-bit arithmetic in shell, so it
pays per-byte. CPU % above 100 % means the runtime used
multiple cores during startup (common for the JVM and BEAM).