Whisper

Posted on Wed 15 February 2012

In 2012 I've had an idea inspired by society outcry after the ACTA was signed by our goverment I wanted to share. After 7 years I still think there's some merit to it, I'd construct it vastly differently and I'm quite certain it wouldn't work for practical reasons (people don't care).

Intro

The Internet centralized around major IT companies with most user content shared on the social networks and rest of the Internet gone strictly commercial. The concept's core is about creating a distributed data-sharing platform with enough high-level functions to make it easy to create various distributed applications.

Original idea assumed:

  • Build-in search engine able to gather results from distributed nodes.
  • Build-in ECC cryptography: data is generally public, nodes are not trusted, the content can be signed to proof identity or encrypted.
  • Handle search for messages-signed-by, messages-encrypted-for, #-tags, @-references, etc.
  • No blockchain / required proof-of-work (spam or being forced to host childporn would be a hard problem anyway).
  • Message length limited by the UDP packet size/ethernet MTU - around 500 bytes.

I don't believe this could work as a 0-trust distributed network with thousands of nodes - but could be useful as a type of a high-level database where you can setup servers with your friends to host a resiliant multi-server distributed database for own use. Minimum functionality would have to suffice for: twitter-like or email-like messaging, forums, wiki or blog-like pages.

Networks with valuable content could get automatically backed up by others - database API would be common and public.

What follows is almost original, org export of a 2012 brainstorm.

Basis of the idea

Success of the social networking short message passing

Ok - in fact - of Twitter. But not only, this also extends more basic ideas of Email and IM. For reference Twitter is said to handle 300*106 tweets per day.

Censorship in certain countries

Twitter recently was forced to introduce censorship in certain countries. Other, while talking about limiting computer "piracy", usually limit peoples freedom.

People value their freedom

Or at least so it seems. In my country recently there was pretty big outcry after the goverment signed the ACTA agreement. But still, this freedom, needs to be well-packaged and interesting to get people to defend it.

Sister projects

GNUtella - for it's distributed search and list of active hosts.

Freenet - works in pretty similar way, has a local datastore which is encrypted. It's way more general, heavier and it's data-neutral - you can't decide not to help share pedophile material while using it. Some people might dislike it.

Bitcoin - Similar keying idea + computational generation of more resources using SHA.

TOR - "circuits" and message passing, but no local datastore.

CAN - Content Aware Networks - idea is similar in come places.

Email - is decentralized, but is much less anonymous and by default private, not public.

Name

"Whisper" is a name taken by multiple protocols and systems and Weespur is stupid. I'd focus on the database side of the project though, not message-passing. "The Internet" would be a cool name, not sure if it's taken.

Required features

Whispers

It is a message with a timestamp, authorship data, connected (or not) to a thread, which may contain hashes (#icecream) for searching and locality + language data.

Some #tags could be reserved. Like: '#userprofile' - so that user data is always tagged the same way and searching for a user + this hashtag returns his description.

Totally decentralised

Clients keep a list of a known parts of the network. Dynamically refresh it while communicating with other nodes. Connections are made to a X, randomly selected hosts, of Y most recently contacted.

No mandatory centralised servers; although it's possible to set a headless server for caching/proxying etc. It should even be promoted. In theory self-moderation and self-filtering of the network requires most of the participators to set their own clients and not use WWW access. Although smart WWW gates could fix such issue.

Users can setup headless servers for:

  • proxying publication/search requests
  • keeping normal cache
  • archival servers (with unlimited/big cache and none searching capabilities - as search DB is larger, but, well).
  • proxies with web interface for browsing their resources.

All nodes should participate in forwarding publication/search requests.

For example university might create such a server so all it's students might communicate with each other (this would increase the chance of locating each other whispers by creating a local cache).

Although there's no 'user of the server' idea in the system the cache filtering should allow university to never remove the data of 'their' users. It would still cache all search requests/answers

Cryptographical ensurance of authorship

Similar in the idea to the bitcoin. Whispers are published along the user key and signature. Users themselves might keep a key->name list to decypher names of their close friends. Whispers contain a human-readable 'signature' which can be set to anything. It might be remembered in the list and validated for later whispers (like - signature in red if it doesn't match remembered key)

Local whisper cache

Network holds data in decentralised fashion. Therefore there's only some probability of finding certain information in the network. It's important to both maximize this probability and accept that it can't reach 1. More popular whispers will be easily found.

User can set filters to what he wants to cache, can force caching of his own whispers etc. You also can publish a message and turn off your client - other clients will cache it for some time so anyone interested will be able to read it.

Local cache is not encrypted, but it could - accessing the client would require a password then.

Local cache must be structured in a way which allows fast searches.

Distributed search (think: GNUtella)

Network can be searched for the data. Searches can be automated (search for new things of certain types every 2 hours), or manual. You are also periodically send new publications which might appear automatically on your 'wall' if they match your search criteria. You can have a set of your personal favorite searches which create your 'buddies' (search for whispers of certain user), 'wall' (searches you're mostly interested in. Probably a huge combined request…).

Distributed cache also means that the searches won't be deterministic. Searching for the same data twice might return different results. Popular whispers will appear always, while less popular might pop in and out at random. Searches are naturally sorted by the whisper timestamp. All searches are locally cached. Number of results is limited - if the user doesn't find what he needs in about… 1000 whispers - he should specify search criteria better.

There's a limit on the length of a query (packet size).

Optional: Private messaging with the encryption

Easy. Encrypt "for" someone. Users might filter those out and even forbid it's caching.

Popularity, self-filtering

Fresh entries are promoted by the network, old might in some time vanish. This depends mostly on the servers cache - They have some total capacity. If user wants to retain his entries forever - he should start his own node.

What users can do
  • Can set filters to the content they dislike (hashes, authors, threads, anything we search by) This can be done automatically by client after selecting "dislike message", "dislike author".

  • Can set filters to the content they like.

These filters have impact on algorithms of searching, publication and caching.

  • User can respond to a whisper which will increase it's chance to be visible by people who like the guy (and filter him positively). (+1!) To fullfil this

Some actions might trigger republication of a whisper which will increase it's popularity. The searches alone cause many nodes to cache the searched data thus propagating interesting whispers.

Interface

Thick-clients could work, probably servers hosting a web interfaces would be required. Not sure if the apps should be hosted by the DB itself - rather not. Let's do one thing, and do it right.

With a JS-only interface it could be possible to host the JS-code of an app though and access it directly from the server and run in a sandbox.

Bitcoin has similar problem - almost nobody uses thick-clients because it required a lot of GBs and is not easy to migrate the data between machines, mobiles, etc.

Details and how it works

Connectivity

We keep a list of network addresses which participated in the list. As the number of headless/constantly working servers will rise it might make sense to keep a 'static list' as well as 'dynamic list'.

Both lists might contain IPs and domains, although hosts automatically added to the dynamic list will be added by IP only.

We add to the list automatically the hosts which have contacted us. List is sorted according to host activity; more active hosts are more likely to be contacted for searches/publication, and long-forgotten hosts might be eventually removed from the list.

List length might be limited.

Performance

Communication might be kept inside a SINGLE UDP packet which makes the searches/publications fast and light. In general there are following requests:

  • Whisper / Publication
  • Search
  • Revoke (owner might revoke it's message; might decrease it's popularity, but doubtful if it will remove it totally)
  • PING / PONG

Most of them contain a 'TTL' field which has a certain allowed range.

Whisper / Message

  • Timestamp: 8 bytes UTC timestamp
  • Thread: 'In response to' field. This whisper responds to another one by including it's SHA1 sum (20 bytes).
  • Destination / to: Recipient key: 20bytes
  • Author: Total: 120 bytes
    1. Key: Elliptic Curve key - 160 bits ~ 20bytes
    2. Signature: ECC signature - 640bits? ~ 80 bytes
    3. Human readable sig: 20 bytes for UTF-8 De facto it is possible to publish an unsigned/anonymous whisper by including zeroed key/signature fields. It might be possible to filter messages using this method too.
    4. Optional: Creation timestamp. It might be fine to include a creation timestamp, for account verification algorithm.
    5. Optional: Verification block. This field (4bytes) would be used if the account creation should be paid by the creator (spam limiting).
  • Recipient: Public key of a user this message is destined to. (20bytes)
  • Language / locale: Might be interesting for filtering information (only English, only something)
    1. Optional: Geographical location (country)
    2. Optional: timezone
    3. Optional: language
  • Bitfields: Is content encrypted?

Totals

  • IP header: 20 bytes
  • UDP header: 8 bytes

8 + 20 + 20 + 100 + 10 + 20 + 1 = 179 –> 1321 bytes left Let's reserve 500 bytes for other headers, we can safely assume there's 800 bytes for the rest.

200 characters should fit in at most 500 bytes… Assuming 500 bytes per UTF-8 message.

Might be nice to be able to tunnel the traffic inside DNS requests.

Packet

TYPE| HEADERS | CONTENT > Protocol Version: 0x80 - 0.01

TTL: Valid range is from 0 (don't send further) to 25.

Type: I'd rather keep a set of values if I wanted to differentiate those… (For example publicate like/dislike?) 0x10 - 0x1E - Publication (15 values) 0x1F - Revoke 0x20 - 0x2E - Search (16 values) 0x2F - Search response

0xAX - management space 0xA0 - Ping (Max TTL=0) 0xA1 - Pong

Publication packet

  1. Structure bitfield (1 byte)

    01234567 bit 0: Do we have a 'Response to' field bit 1: Do we have 'Recipient' field bit 2: Do we have a timezone bit 3: Do we have language bit 4: Do we have a country bit 5: reserved bit 6: Is message encrypted bit 7: Is message binary This tells us what fields to read later and shrinks the size of a message.

  2. Content length (2 bytes)

  3. Timestamp (8 bytes)

    I don't want 2038 problem, yet 8 bytes is MORE than enough. It should be fine to have 5 minute resolution too, so 1 byte could be out - but it might not be worth complications.

  4. Author

    Key (around 20 bytes) Optional: verification block (4 bytes)

  5. – Up to this point length/data positions are constant - 38 bytes

  6. Optional: Recipient

    Key of recipient (20bytes)

  7. Optional: Thread

    SHA1 sum of answered publication (20bytes)

  8. Localization

    5 bytes of language (ASCII or NULLS) 5 bytes of country (ASCII or NULLs) 1 byte of timezone (-127 to 127; -128 means 'not given')

  9. Readable sig

    1 byte of length Up to 20 bytes of signature

  10. Content

  11. Message signature

    Does sign pretty much everything except: the version, TTL, requesttype, structure bitfield, content length

Search packet

TODO

  1. Structure bitfield (1byte)
  2. Query length
  3. Query contents

Local databases

Contains a cache of all whispers.

  • Has a user settable limit on it's size.
  • Old entries are removed if limit is hit.
  • Assuming pesimistic 1500 bytes per each whisper, a 2GB DB could store over 1mln entries - but without a search data.

Separate DB for searching

Index contains and can search for:

  • words (content/hashes/anything)
  • hashes
  • language
  • locality
  • encryption
  • author
  • thread
  • recipient

It will grow and be bigger than the main DB itself. Can be easily recreated.

Probably Xapian DB. It might be possible to merge those databases into one with Xapian too. Not sure if it's such a good idea. It wouldn't be certainly ok to reimplement a DB; Maybe use tokyocabinet… mongodb is ok too, but this shouldn't be so heavy. Single XapianDB, maybe with sqlite/other SQL.

DoS filter lists

Filter/delay IPs which were behaving badly. If you're getting too much packets from certain IP - increase probability of dropping the packet. This should either be implemented fine with good limits (adaptive?), or not at all.

If creating a user key is expensive it might be fine to limit not per IP, but per author.

Connectivity in distributed network

To send a publication / search request, a client keeps a list of hosts seen in the network. The list is sorted from the most recently contacted host to least. From the top Y entries software selects randomly X computers which are going to be contacted.

TTL of a packet is randomized from a range of values, so the contacted host can't determine if the previous host produced the packet or is just proxying it. Each proxying host inserts random delays when resending the packets so that time-analysis of packet source is complicated as well.

Each time the packet is resend, it's TTL is decreased by some (more/less random) amount (1-3). The search/publication is stopped after the TTL zeroes.

Initial delay before proxying should range between 0.1 and 2s, delay between each packet should be smaller (0.1-0.5s).

If there is a list of static addresses, "always to be contacted", they might be included randomly in selected dynamically hosts.

Host keeps lists of recently published messages and forwarded search queries and in case the newly received packet matches any from the list - it is ignored.

Publication

Publication packets don't get any response.

  1. Probability

    What % of the network can we reach with the publication and in what time? What % of the network can we search? What is an estimated probability of finding a published whisper?

    Should we scale our requests with the length of known nodes?

  2. Anonimity

    Described connectivity algorithm doesn't ensure TOR-grade anonimity of publishers but makes it hard to determine who published a whisper without observing the traffic closely.

    From a (random) time to time (in silence periods?) client spontaneously republicates whisper it had cached which has a high 'like' value (filters) and is relatively fresh.

    Whispers with invalid (future) timestamp, fields, length, signature, must be ignored.

  3. Filtering / self-moderation

    Node receiving publication request might resend it to a larger set of nodes or smaller - depending on local like/dislike filters. It may cache it or it may ignore it. It might decrease the TTL a little or a lot.

  4. What's the point?

    It should be hard to determine if a user published the message or just republished the whisper it had cached previously, or if it is forwarding a publication request it got some random time ago.

    By observing the network as a whole it will be hard - random delays, TTL randomization and alteration makes it impossible to determine node position in the forward chain in a large world-wide network.

    If all traffic to and from the node is tracked it is possible to prove that the node is a publication source/author of course.

    If would be possible to implement TOR-like circuit algorithm, but it doesn't really make any sense. If someone wants - he might publish using TOR. It's most important that after the publication it is impossible to determine the publisher.

  5. Further ways to increase anonimity / accessibility

    It wouldn't be hard to tunnel the initial publication packet inside a DNS request if the message is not too long. It could be then forwarded via any selected, public DNS resolver to a DNS-whisper gate.

Searching

Each hosts, which got the search packet, searches it's local cache for X matching whispers. If they are found, they are marked as an interesting (to someone). Afterwards the hosts sends a responses to the previous host, one datagram per found whisper, at most - 10 whispers. And host forwards those back to the node who sent the search query.

Search can be started with a local cache search, and only if the newest packet found is older than X[s] - the external search would be started.

Querying host adds all new whispers to it's database (ignores already known), and runs a search on it's own cache to get a good search relevance/ordering. Searches can by forced by user to be only-local.

Search responses are given the search ID (sha1 of search packet fields)

  • similar data as in the publication packet.

Security

The author private key might be passphrase encrypted - it will make it harder to steal personality and to prove authorship of whispers.

Statistical analysis of local cache might be enough to prove authorship (do we always cache our messages?).

Analysis of a total traffic to/from node might reveal it's like/dislike filters.

User is at risk only during publication so in fact one could use TOR to publicate whispers and

Malicious behaviour

SPAM

As usual. Twitter has a problem problem with spam too. Good user-set filtering will allow to limit the problem (filter out too short messages, messages with just a link, etc.) But the problem will still persist.

DOS

Sending a simple packet might induce a large network traffic. That's one of the reasons the packets are rather small. This problem can be limited in few ways:

  • Network is probabilistic, so we can limit traffic on the node and silently drop incoming packets. This might DoS the network a bit, but should save the internet.
  • Keep a list of too active hosts (publication/searches) and delay their requests.
  • Require computation of some value which is hard to compute, yet - pretty easy to check. Also it should scale nicely… Resending the same/computed value will cause most nodes to drop it (already answered list).

It would be possible to make the publication harder - force

Publishing files

Network is designed for short messages but of course file can be split into multiple messages and stored nevertheless. This can slow down network and therefore it's better to control it rather than ignore.

Files could be stored as a thread of messages with the first message describing the name/title and next containing only the 'thread' id except for the contents. Bitfields 'binary' and 'multipart' should be set.

Users might of course filter such messages from their cache.

Basic math

In general long/narrow chains of connections are preferred to short/wide (number of servers directly connected/TTL value). This increases anonimity and allows far better distribution of data and network self-filtering (by toggling TTL).

C - sending message in parallel to C computers T - TTL

  • With higher C the chance of hitting not-connected computer is higher
  • Users can't be controlled on what C do they use, so we must limit the TTL correctly… Still they can only toggle C at their host, not farther.
| C \ T |   2 |     3 |      4 |       5 |         6 |          7 |           8 |             9 |             10 |        11 |         12 |
|-------|-----|-------|--------|---------|-----------|------------|-------------|---------------|----------------|-----------|------------|
|     1 |   2 |     3 |      4 |       5 |         6 |          7 |           8 |             9 |             10 |        11 |         12 |
|     2 |   6 |    14 |     30 |      62 |       126 |        254 |         510 |          1022 |          2.046 |     4.094 |      8.190 |
|     3 |  12 |    39 |    120 |     363 |     1.092 |      3.279 |       9.840 |        29.523 |         88.572 |   265.719 |    797.160 |
|     4 |  20 |    84 |    340 |   1.364 |     5.460 |     21.844 |      87.380 |       349.524 |      1.398.100 | 5.592.404 | 22.369.620 |
|     5 |  30 |   155 |    780 |   3.905 |    19.530 |     97.655 |     488.280 |     2.441.405 |     12.207.030 |         X |          X |
|    10 | 110 | 1 110 | 11.110 | 111.110 | 1.111.110 | 11.111.110 | 111.111.110 | 1.111.111 110 | 11.111.111.110 |         X |          X |

Network background

I find it hard to estimate correct/good/bad values so let's start with defining some background.

Downloading a 650MB file over network (many media downloads I guess), assuming only 10 bytes for the protocol at least 444k packets, each will be acked -> probably over 900 milion packets.

3.5MB file (mp3?) - 4787 packets Opening a portal (1.4MB) - 1915

Twitter has 300mln tweets per day. Each tweet takes more than a single packet. Let's assume - 5 packets per tweet, which gives 17.361 packets per second. (That would take less than 26MB/s traffic).

In case of whisper the load is distributed so it might be larger. For the sake of calculations let's assume we can reach 1mln users.

There's also probability of loosing packet (or sending to dead host, or to the host which already got that packet) each time. The final result depends strongly on the moment of the packet drop.

Publication

In case of publication we want to reach some of the network, rest should be reached by the searches/republication. We don't get responses therefore we can reach more hosts than during search.

Assuming C=3, T=9 and 0.2 probability of loosing, the network will send 24000 packets during publication phase which in worst case will transfer 36MB of data.

Initial network coverage is not great - around 2.4%, but the whisper will later be republished during searches to other hosts.

Search

During the search, hosts running a query return packet responses which greatly increases total traffic.

Each host searches it's own cache first. If it has more than X (fresh) results, the request might not be forwarder further. Otherwise, the host forwards the request and gathers responses from that hosts. If there's not enough results for the query or the query is very specific (sha1 of whisper) the search can be deepen.

The requesting host doesn't want to get too much traffic for a single search query. 1MB would fit in at least 666 packets. If we are assuming C=3, that would give 200 packets/responses per host.

No duplicates should be returned so it decreases total number of responses.

C=3:

| TTL        |   5 |   4 |  3 |  2 | 1 | 0 |
| 9-TTL:     |   1 |   2 |  3 |  4 | 5 | 6 |
| Responses: | 600 | 200 | 66 | 22 | 4 | 0 |

C=4:

| TTL        |   4 |   3 |  2 | 1 | 0 |
| 9-TTL:     |   1 |   2 |  3 | 4 | 5 |
| Responses: | 600 | 150 | 37 | 9 | 2 |

Well, 1000 results might make a bit more sense; 10 / 10 pages (+ local cache)...

C=3:

| TTL        |    5 |   4 |   3 |  2 |  1 | 0 |
| 9-TTL:     |    1 |   2 |   3 |  4 |  5 | 6 |
| Responses: | 1000 | 333 | 111 | 37 | 12 | 4 |

C=4:

| TTL        |    4 |   3 |  2 |  1 | 0 |
| 9-TTL:     |    1 |   2 |  3 |  4 | 5 |
| Responses: | 1000 | 250 | 62 | 15 | 3 |

This could be tweaked. For example the maximal TTL allowed could equal 9. If there's no data yet found - search deeper. If you get a lot of pretty fresh responses from the local cache - decrease the TTL. The total responses might vary… but let's estimate traffic first.

Node starting the search would receive all responses - 1000 to 5000 - 1.5MB to 7.5MB. That's pretty pessimistic, as nodes lower would remove duplicates and decrease total number of results. Nodes one level deeper in the search tree will only receive 1/C part of responses - 250-1250 packets - 0.4 - 1.875MB. (I'm assuming pessimistic case of each message having 1500bytes).

Spam protection

If the network is going to be open for publication there needs to be some kind of spam protection with for example a proof-of-work. We end up with blockchain-like solution which I don't like for a generic system.

Using network accounts with federated networks would be probably better.

Using UDP packets can end up with DDoS-amplification gadgets. Care must be taken.


Comments