A modest word

Share on:

Cover image from Wikipedia.

Word sizes

We have a range of word sizes with computers. If you have used a wide range of machines you may have even seen bytes with more or less than 8 bits. We have standardized on 8 bits in a byte in the last 40 decades. Characters can be 8, 16, 32 or a variable number of bytes depending on what you need to represent. Words though tend to be the most common size of the registers in the machine.

Example CPU Word size
6502, Z80 8 bit
80286 16 bit
80386, 80486, Pentium, 68000 32 bit
Dec Alpha, x86-64 64 bit

The word size tends to describe where the CPU is comfortable for processing integers and addressing memory. Obviously the 8 bit CPUs used 16 bit addressing, but the ALU was very 8 bit centric and addressing modes had some 8-bit limitations. On the opposite end we know it’s kind of ridiculous to expect 64 bits of addressable memory, right?

64 bits

How big is the space 64 bits can address? Well, it’s a lot. Let me introduce you to a quick way to estimate the range of an X bit number. (I have talked about this in the past as it is part of an excellent way of doing mental math. I’ll describe it more in another post.)

First:

Every 10 extra bits is a thousand times larger.

Second:

Every bit doubles the size that can be addressed.

Using the first rule we know that the 30 bit range gets us into the billions of addressable entities. If you see a limit of 2 billion or 4 billion it is a 32 bit value that is driving it (2 billion when one bit is reserved for sign of the number). A billion is fairly large, until you start using it. 4GB of RAM is small today (even on a phone). When you get into storage we are now talking about Terabytes most of the time. A Terabyte is roughly a trillion bytes, or using our first rule of thumb, something we can handle with 40 bits.

To handle this all the modern CPUs and storage systems went to 64 bits some time ago. It was an easy move to double the standard 32 bit word and just move on. We saw this start in the 1990s and it moved forward until it was fairly common around 2006. Phones followed suite a few years later.

So back to the question at hand. How big is 64 bits of addressable space. Using our 2 rules we can quickly say 16 billion billion (roughly, it is actually 18 billion billion). Rule one gets us the billion billion and Rule two doubles 4 times which is 16. To be honest, that doesn’t really answer the question if you don’t have a concept for a billion, let alone a billion billion.

Equivalence Rough number of bits to represent
Humans on earth, 7.8 billion 33 bits
Inches to the Moon, 15 billion 34 bits
Inches to the Sun, 5.8 trillion 43 bits
Inches from the Sun to Pluto, 232,000,000,000,000 48 bits
Number of square centimeters on the Earth’s surface 510,000,000,000,000 49 bits
Estimated number of Ants on Earth, 1,000,000,000,000,000 50 bits
Number of transistors produced worldwide in 2008, 6,000,000,000,000,000,000 64 bits
Stars within range of a telescope, 70,000,000,000,000,000,000,000 77 bits
Mass of the Sun, (333,000 Earths) 1,989,000,000,000,000,000,000,000,000,000 kg 101 bits

64 bits would let you address every living ant in the world 8000 times over. You could also track every square meter on the surface of the Earth. Given there are 10,000 square centimeters in a square meter, you can actually track every square centimeter a few times over.

Is 64 bit words enough?

The industry went to 64 bits as a way of future proofing technology. For RAM and file references it can address more bytes than there are square centimeters on Earth’s surface. More practically, 64 bits can address an array of 1 million 16TB hard drives.

With Moore’s law some people feel that we will reach that point in roughly 30 years. When ZFS was created 128 bits was chosen to live hundreds of years of future advances using that same estimate of growth.

All of that is somewhat naive though. Storage densities in RAM and hard drives is slowing in growth. Without a major revolutionary advance in the fundamentals of solid state physics we are approaching the lower limits of feature size. To exceed 64 bits of addressing we would need to grow in physical size and complexity. Very possible on a global scale, but not likely for a single person, family, or small community.

The sheer amount of data is huge. For personal consumption and production, 64 bits is overkill.

A modest word

In many cases, carrying around 64 bits to address things isn’t a huge problem. There were cases in the early days where the pointers would consume too much memory and developers choose to compile for 32 bits to keep memory consumption down. These days memory is cheap. For files, keeping 64 bits around for size isn’t a huge burden and really opened up file formats like Zip. Now each file requires 2 numbers upverted from 32 to 64 bits. The overhead is a handful of bytes for each file in the Zip archive. Nothing to notice.

The problem starts to crop up when each record needs to carry around multiple sizes and pointers. If I have a file of 10 billion records, do I need to carry around the size and location of each record in 64 bits? I won’t have a place to store the records if they even get close to 64 bits in size for the collection. There is no technology on the horizon that will make a useful record that will be close to 64 bits in size.

In many cases people still use 32 bit words for this reason. They are half the size and can refer to a few billion items or the size of one item that is a couple GB in size. That is useful in many cases, but not an answer itself.

We can take a page out of networking as a modest solution. IPv4 is 32 bits in address space and we have to take special measures to make it work worldwide. IPv6 is 128 bits and a monster to handle, so much so that it has essentially failed. The hardware addresses that each network device is given at the factory, is 48 bits and looks to be just about right. The cover image is one of these, known as a MAC address.

48 bits let’s us refer to 256 trillion things. Sure I can buy a drive array that is bigger than than, but it would also heat my house. I’m also not likely to fill that much space let alone use that much data for anything sensible.

48 bits will let me measure the size of reasonable file I will have in the foreseeable future. It will also let me count any set of records or items that I care to count in one place. If Zip had gone to 48 bits for size of files no one would have seen a reduction in usability. The same would have been true of most file systems.

I’m not recommending CPUs switch to 48 bits or they address RAM with 48 bits (although a custom CPU architecture could be fairly optimal using that word size). 64 bit integers are a decent size to work with for representing values and tracking concepts. Where I’m making modest suggestion is everywhere outside the CPU.

Databases, file formats, and network references can be mostly handled by 48 bits. Think I’m wrong? How many databases have more than 256 trillion records? Those should be handled by a custom database anyway at that size. How many file systems, or individual files are more than 256 trillion bytes in size? If you have that much storage, you can use ZFS and be happy. Until then my modest 5TB is seeing some waste, especially with my largest file measuring in at 80GB.

For a network, any single addressable set of machines is going to be within 256 trillion until we can communicate faster than the speed of light. Communicating to another planet or solar system is going to be a long latency and probably shouldn’t be in the same address space. Even when a galactic sized civilization emerges, a 48 bit system id with a 48 bit address will handle anything we throw at it in 96 bits, less than the 128 IPv6 peddles today.

When we read or write our large size and reference numbers from I/O channels we often have to switch byte order anyway. Doing that for 48 vs 64 bits is not a computational disadvantage, yet it is a large savings in space as the records become more numerous. Let’s look at a few examples.

  • On my largest computer at home I have 5TB of SSD that is about half used. It holds 1,749,470 files as of this writing. If each file is stored with a 48 bit pointer and size I would save 6,997,880 bytes. With file system implementation details and overhead the real number is probably a few times that.

  • The NYC Taxi data from 2009-2015 is 1.1 billion trips. Storing that in a database means the index needs at least 1.1 billion pointers. Using 48 bits is a saving of 2.2 GB for that one index.

  • IPv6 has a minimum packet header of 40 bytes, and requires additional headers for a lot of things. If we take the 20 byte header of IPv4 and change it to 48 bit addresses the size is 24 bytes. That adds up on the wire. A 10 Gbps Ethernet can have 14 million small packets/sec, where 1.792 Gbps would be the additional overhead of IPv6 instead of 48 bit addresses. That is upto a 17% additional overhead on the wire!

  • The Wayback Machine claims to have 624 billion web pages made searchable from the Internet. Just enumerating with a 48 bit number instead of 64 bits would save 1.3 TB. We can assume that they also need pointers into storage, but they are one of the few places where 64 bits should really be needed.

Parting thoughts

While I don’t think it is reasonable to run out and make wholesale changes to working systems, you should consider this modest word size when you start new work. A new file format, database core, network structure, whatever. If it makes sense to use a 48 bit number in place of a 64 bit number, maybe you should. You could be saving resources 2 bytes at a time.