OpenSubtitles Hash (OSHash)

Implementation reference, verified test suite & security analysis

Algorithm

hash = file_size + sum_uint64_le(first_64KB) + sum_uint64_le(last_64KB)

The hash calculates: file size + 64-bit checksum of the first and last 64 KB (even if they overlap because the file is smaller than 128 KB). All arithmetic is unsigned 64-bit with natural overflow (wrapping). Data is read as little-endian uint64 values (8192 values per chunk).

File size requirement: minimum 131,072 bytes (128 KB). Files smaller than this cannot be hashed.

History

~2004
Origin in Media Player Classic. The hash algorithm was first implemented in Media Player Classic (MPC-HC), an open-source media player for Windows created by Gabest (Guliverkli project). The algorithm was designed to quickly identify video files for automatic subtitle matching — prioritizing speed over collision resistance. The name "Gibest hash" (sometimes "Gabest hash") comes from this origin. The original C++ implementation can be found in SubtitlesProvidersUtils.cpp (the GenerateOSHash function).
2006
Adopted by OpenSubtitles.org. OpenSubtitles adopted the hash as the primary file identification method for their XML-RPC API. It became the standard way for media players and subtitle tools to look up subtitles automatically. The hash, combined with file size, creates a lookup key in the OpenSubtitles database. The original hash source codes wiki page collected community implementations.
2006–2020
Widespread adoption. Implementations appeared in dozens of languages. Media players like VLC, Kodi, Plex, Stremio, and many subtitle tools (e.g. Bazarr, Sublight) integrated the hash for automatic subtitle downloads. The algorithm's simplicity made it easy to port, though some early implementations contained bugs (especially around 64-bit overflow handling in PHP, JavaScript, and Perl).
2023+
OpenSubtitles REST API (v2). The newer REST API still uses this hash for file lookups — it's the moviehash parameter (and the moviehash_match flag on results). "moviehash" is simply the API's name for OSHash; they are the same algorithm. It remains the fastest identification method for local files.

Test Vectors

File Size (bytes) Expected Hash Download
breakdance.avi 12,909,756 8e245d9679d31e12 Download (12.3 MB)
dummy.rar 4,295,033,890 61f7751fc2a72bfb Download RAR (2.4 MB)
Unpack to get the 4 GB test file

Verify your implementation against both files before deployment. breakdance.avi confirms the basic algorithm, but the 4 GB file (from dummy.rar) is the important one: its size exceeds 232, so it's the only vector that exposes the 64-bit overflow / large-file seek bugs that plague many implementations floating around the internet — a hash that's correct on breakdance.avi can still be wrong here.

Limitations & Security

Not Cryptographic

OSHash is not a cryptographic hash. It was designed for speed, not security. Do not use it for integrity verification, authentication, or any security-sensitive purpose. Use SHA-256 or BLAKE3 for those.

Trivial Collisions

Two files with the same size, same first 64 KB, and same last 64 KB will produce the same hash, regardless of what's in the middle. This means ~99.99% of the file content is not hashed for typical video files.

Hash Forgery

An attacker can craft a file with any desired hash by manipulating the first or last 64 KB. Since the hash is just addition, finding a preimage is trivial arithmetic, not a computational puzzle. You can also transplant the head/tail of one file onto another.

Second Preimage Attack

Given a file and its hash, creating a different file with the same hash is trivial: keep the same size, copy the first and last 64 KB, and put anything in the middle. This makes the hash unsuitable for verifying file authenticity.

Appropriate Use Cases

OSHash is well-suited for: subtitle database lookups, media library deduplication (combined with file size), and quick file identification in trusted environments. Its O(1) read cost (always 128 KB regardless of file size) is its main advantage.

Performance Profile

Only reads 128 KB total, regardless of file size. Hashing a 50 GB file takes the same time as hashing a 200 KB file. No CPU-intensive cryptographic operations — just integer addition. Typically completes in under 1 ms for local files.

Performance

Each implementation is timed on the same run — hashing the 4 GB test file on the reference machine. Because OSHash only reads the first and last 64 KB, even a 4 GB file is trivial work (a correct implementation seeks rather than streams), so these numbers are dominated by interpreter / VM startup and runtime footprint, not throughput. That is exactly what makes them interesting — they show the fixed cost of reaching for each language, and confirm that none choke on a multi-gigabyte input.

What the numbers say:

  • Native compiled (C, C++, Rust, Go, Zig, Nim, D, Crystal, Fortran, Pascal, Ada, Assembly, V) all land around ~0.01 s and 3–4 MB — essentially just process spawn. They are indistinguishable here; the difference is noise.
  • Scripting (Python, Perl, PHP, Ruby, Lua, Tcl, AWK) adds a small interpreter-startup tax: tens of milliseconds, single-digit to tens of MB.
  • JVM (Java, Kotlin, Scala, Groovy, Clojure) pays for JVM boot and heap — 40–210 MB and up to a couple of seconds. Groovy and Clojure are the heaviest of the set.
  • BEAM (Erlang, Elixir) and JIT/runtime-heavy languages (Julia, Dart, Raku, PowerShell) sit in between, with Julia's JIT and PowerShell's runtime the most memory-hungry (180–240 MB).

Bash is the outlier at ~1.5 s: it shells out to dd/od and does the 64-bit arithmetic in shell, so it pays per-byte. CPU % above 100 % means the runtime used multiple cores during startup (common for the JVM and BEAM).

Implementations