WEESPUR -*- mode: org; coding: utf-8; fill-column: 75; mode: auto-fill -*-

Table of Contents

1 Basis of the idea

1.1 Success of the social networking short message passing

Ok - in fact - of Twitter. But not only, this also extends more basic ideas of Email and IM. For reference Twitter is said to handle 300*106 tweets per day.

1.2 Censorship in certain countries

Twitter recently was forced to introduce censorship in certain countries. Other, while talking about limiting computer "piracy", usually limit peoples freedom.

1.3 People value their freedom

Or at least so it seems. In my country recently there was pretty big outcry after the goverment signed the ACTA agreement. But still, this freedom, needs to be well-packaged and interesting to get people to defend it.

2 Sister projects

GNUtella - for it's distributed search and list of active hosts. Freenet - works in pretty similar way, has a local datastore which is encrypted. It's way more general, heavier and it's data-neutral - you can't decide not to help share pedophile material while using it. Some people might dislike it.

Bitcoin - Similar keying idea + computational generation of more resources using SHA.

TOR - "circuits" and message passing, but no local datastore. CAN - Content Aware Networks - idea is similar in come places.

Email - is decentralized, but is much less anonymous and by default private, not public.

3 Name

Weespur. 'We' is nice. 'Spur' (or 'stir' or spear) is not bad either. Pronounced just like a 'whisper'. Also I'm going to call messages 'whispers' (in short whisp/wheesp), which is longer but better than 'tweet'. Whatever.

4 Required features

4.1 Whispers

It is a message with a timestamp, authorship data, connected (or not) to a thread, which may contain hashes (#icecream) for searching and locality + language data.

Some #tags could be reserved. Like: '#userprofile' - so that user data is always tagged the same way and searching for a user + this hashtag returns his description.

4.2 Totally decentralised

Clients keep a list of a known parts of the network. Dynamically refresh it while communicating with other nodes. Connections are made to a X, randomly selected hosts, of Y most recently contacted.

No mandatory centralised servers; although it's possible to set a headless server for caching/proxying etc. It should even be promoted. In theory self-moderation and self-filtering of the network requires most of the participators to set their own clients and not use WWW access. Although smart WWW gates could fix such issue.

Users can setup headless servers for:

  • proxying publication/search requests
  • keeping normal cache
  • archival servers (with unlimited/big cache and none searching capabilities - as search DB is larger, but, well).
  • proxies with web interface for browsing their resources.

All nodes should participate in forwarding publication/search requests.

For example university might create such a server so all it's students might communicate with each other (this would increase the chance of locating each other whispers by creating a local cache).

Although there's no 'user of the server' idea in the system the cache filtering should allow university to never remove the data of 'their' users. It would still cache all search requests/answers

4.3 Cryptographical ensurance of authorship

Similar in the idea to the bitcoin. Whispers are published along the user key and signature. Users themselves might keep a key->name list to decypher names of their close friends. Whispers contain a human-readable 'signature' which can be set to anything. It might be remembered in the list and validated for later whispers (like - signature in red if it doesn't match remembered key)

4.4 Local whisper cache

Network holds data in decentralised fashion. Therefore there's only some probability of finding certain information in the network. It's important to both maximize this probability and accept that it can't reach 1. More popular whispers will be easily found.

User can set filters to what he wants to cache, can force caching of his own whispers etc. You also can publish a message and turn off your client - other clients will cache it for some time so anyone interested will be able to read it.

Local cache is not encrypted, but it could - accessing the client would require a password then.

Local cache must be structured in a way which allows fast searches.

4.5 Distributed search (think: GNUtella)

Network can be searched for the data. Searches can be automated (search for new things of certain types every 2 hours), or manual. You are also periodically send new publications which might appear automatically on your 'wall' if they match your search criteria. You can have a set of your personal favorite searches which create your 'buddies' (search for whispers of certain user), 'wall' (searches you're mostly interested in. Probably a huge combined request…).

Distributed cache also means that the searches won't be deterministic. Searching for the same data twice might return different results. Popular whispers will appear always, while less popular might pop in and out at random. Searches are naturally sorted by the whisper timestamp. All searches are locally cached. Number of results is limited - if the user doesn't find what he needs in about… 1000 whispers - he should specify search criteria better.

There's a limit on the length of a query (packet size).

4.6 Optional: Private messaging with the encryption

Easy. Encrypt "for" someone. Users might filter those out and even forbid it's caching.

4.7 Optional: Tor-like anonimity

I weren't sure it was worth it and I believe it's not. If someones wants tor-like anonimity he can use tor for publishing. There might even be hidden weespur proxies inside TOR.

After publication entries will be totally anonymous. During publication there're certain measures taken to make it difficult for anyone to determine who published a message, but it's not Tor-grade.

4.8 Popularity, self-filtering

4.8.1 Fresh data first

Fresh entries are promoted by the network, old might in some time vanish. This depends mostly on the servers cache - They have some total capacity. If user wants to retain his entries forever - he should start his own node.

4.8.2 What users can do

  • Can set filters to the content they dislike (hashes, authors, threads, anything we search by) This can be done automatically by client after selecting "dislike message", "dislike author".
  • Can set filters to the content they like.

These filters have impact on algorithms of searching, publication and caching.

  • User can respond to a whisper which will increase it's chance to be visible by people who like the guy (and filter him positively). (+1!) To fullfil this

Some actions might trigger republication of a whisper which will increase it's popularity. The searches alone cause many nodes to cache the searched data thus propagating interesting whispers.

4.9 Interface

The program should be split into few parts:

4.9.1 Library

Handles local database, caching, connectivity, publishing, searching and all other low-level things.

4.9.2 GUI Interface

Creates a stand-alone application with user GUI. Basic app. It ensures best anonimity, has a local cache and actively works in the network. Is certainly recommended.

4.9.3 Web interface

It is possible to create a web application for browsing the network. It consist of a stand-alone server with his cache performing all low-level functions (proxying) + web interface.

This approach looses user anonimity, decreases network cache but otherwise is just fine. The server can with its cache fix this problems.

Browsing the network should naturally work without problems, self-filtering might be altered badly, but smart web servers could fix it (by averaging users likes/dislikes).

5 Details and how it works

5.1 Connectivity

We keep a list of network addresses which participated in the list. As the number of headless/constantly working servers will rise it might make sense to keep a 'static list' as well as 'dynamic list'.

Both lists might contain IPs and domains, although hosts automatically added to the dynamic list will be added by IP only.

We add to the list automatically the hosts which have contacted us. List is sorted according to host activity; more active hosts are more likely to be contacted for searches/publication, and long-forgotten hosts might be eventually removed from the list.

List length might be limited.

5.1.1 Performance

Communication might be kept inside a SINGLE UDP packet which makes the searches/publications fast and light. In general there are following requests:

  • Whisper / Publication
  • Search
  • Revoke (owner might revoke it's message; might decrease it's popularity, but doubtful if it will remove it totally)
  • PING / PONG

Most of them contain a 'TTL' field which has a certain allowed range.

5.2 Whisper

5.2.1 Timestamp

8 bytes UTC timestamp

5.2.2 Thread

'In response to' field. This whisper responds to another one by including it's SHA1 sum (20 bytes).

5.2.3 Destination / to

Recipient key: 20bytes

5.2.4 Author

Total: 120 bytes

  • Key
    Elliptic Curve key - 160 bits ~ 20bytes
  • Signature
    ECC signature - 640bits? ~ 80 bytes
  • Human readable sig
    20 bytes for UTF-8

    De facto it is possible to publish an unsigned/anonymous whisper by including zeroed key/signature fields. It might be possible to filter messages using this method too.

  • Optional: Creation timestamp
    It might be fine to include a creation timestamp, for account verification algorithm.
  • Optional: Verification block
    This field (4bytes) would be used if the account creation should be paid by the creator (spam limiting).

5.2.5 Recipient

Public key of a user this message is destined to. (20bytes)

5.2.6 Language / locale

Might be interesting for filtering information (only English, only something) Optional: Geographical location (country) Optional: timezone Optional: language

3 language, 3 country, 3 timezone Should fit in 10 bytes

5.2.7 Bitfields

Is content encrypted? At least byte.

5.2.8 Totals

IP header: 20 bytes UDP header: 8 bytes

8 + 20 + 20 + 100 + 10 + 20 + 1 = 179 –> 1321 bytes left Let's reserve 500 bytes for other headers, we can safely assume there's 800 bytes for the rest.

200 characters should fit in at most 500 bytes… Assuming 500 bytes per UTF-8 message.

Might be nice to be able to tunnel the traffic inside DNS requests.

5.3 Packet

<VERSION|TTL|REQUESTTYPE| HEADERS | CONTENT > Protocol Version: 0x80 - 0.01

TTL: Valid range is from 0 (don't send further) to 25.

Type: I'd rather keep a set of values if I wanted to differentiate those… (For example publicate like/dislike?) 0x10 - 0x1E - Publication (15 values) 0x1F - Revoke 0x20 - 0x2E - Search (16 values) 0x2F - Search response

0xAX - management space 0xA0 - Ping (Max TTL=0) 0xA1 - Pong

5.3.1 Publication packet

  • Structure bitfield (1 byte)
    01234567 bit 0: Do we have a 'Response to' field bit 1: Do we have 'Recipient' field bit 2: Do we have a timezone bit 3: Do we have language bit 4: Do we have a country bit 5: reserved bit 6: Is message encrypted bit 7: Is message binary This tells us what fields to read later and shrinks the size of a message.
  • Content length (2 bytes)
  • Timestamp (8 bytes)
    I don't want 2038 problem, yet 8 bytes is MORE than enough. It should be fine to have 5 minute resolution too, so 1 byte could be out - but it might not be worth complications.
  • Author
    Key (around 20 bytes) Optional: verification block (4 bytes)
  • – Up to this point length/data positions are constant - 38 bytes
  • Optional: Recipient
    Key of recipient (20bytes)
  • Optional: Thread
    SHA1 sum of answered publication (20bytes)
  • Localization
    5 bytes of language (ASCII or NULLS) 5 bytes of country (ASCII or NULLs) 1 byte of timezone (-127 to 127; -128 means 'not given')
  • Readable sig
    1 byte of length Up to 20 bytes of signature
  • Content
  • Message signature
    Does sign pretty much everything except: the version, TTL, requesttype, structure bitfield, content length

5.3.2 Search packet

TODO

  • Structure bitfield (1byte)
  • Query length
  • Query contents

5.4 Local databases

5.4.1 Contains a cache of all whispers.

  • Has a user settable limit on it's size.
  • Old entries are removed if limit is hit.
  • Assuming pesimistic 1500 bytes per each whisper, a 2GB DB could store over 1mln entries - but without a search data.

5.4.2 Separate DB for searching

Index contains and can search for:

  • words (content/hashes/anything)
  • hashes
  • language
  • locality
  • encryption
  • author
  • thread
  • recipient

It will grow and be bigger than the main DB itself. Can be easily recreated.

Probably Xapian DB. It might be possible to merge those databases into one with Xapian too. Not sure if it's such a good idea. It wouldn't be certainly ok to reimplement a DB; Maybe use tokyocabinet… mongodb is ok too, but this shouldn't be so heavy. Single XapianDB, maybe with sqlite/other SQL.

5.4.3 DoS filter lists

Filter/delay IPs which were behaving badly. If you're getting too much packets from certain IP - increase probability of dropping the packet. This should either be implemented fine with good limits (adaptive?), or not at all.

If creating a user key is expensive it might be fine to limit not per IP, but per author.

5.5 Connectivity in distributed network

To send a publication / search request, a client keeps a list of hosts seen in the network. The list is sorted from the most recently contacted host to least. From the top Y entries software selects randomly X computers which are going to be contacted.

TTL of a packet is randomized from a range of values, so the contacted host can't determine if the previous host produced the packet or is just proxying it. Each proxying host inserts random delays when resending the packets so that time-analysis of packet source is complicated as well.

Each time the packet is resend, it's TTL is decreased by some (more/less random) amount (1-3). The search/publication is stopped after the TTL zeroes.

Initial delay before proxying should range between 0.1 and 2s, delay between each packet should be smaller (0.1-0.5s).

If there is a list of static addresses, "always to be contacted", they might be included randomly in selected dynamically hosts.

Host keeps lists of recently published messages and forwarded search queries and in case the newly received packet matches any from the list - it is ignored.

5.5.1 Publication

Publication packets don't get any response.

  • Probability
    What % of the network can we reach with the publication and in what time? What % of the network can we search? What is an estimated probability of finding a published whisper?

    Should we scale our requests with the length of known nodes?

  • Anonimity
    Described connectivity algorithm doesn't ensure TOR-grade anonimity of publishers but makes it hard to determine who published a whisper without observing the traffic closely.

    From a (random) time to time (in silence periods?) client spontaneously republicates whisper it had cached which has a high 'like' value (filters) and is relatively fresh.

    Whispers with invalid (future) timestamp, fields, length, signature, must be ignored.

  • Filtering / self-moderation
    Node receiving publication request might resend it to a larger set of nodes or smaller - depending on local like/dislike filters. It may cache it or it may ignore it. It might decrease the TTL a little or a lot.
  • What's the point?
    It should be hard to determine if a user published the message or just republished the whisper it had cached previously, or if it is forwarding a publication request it got some random time ago.

    By observing the network as a whole it will be hard - random delays, TTL randomization and alteration makes it impossible to determine node position in the forward chain in a large world-wide network.

    If all traffic to and from the node is tracked it is possible to prove that the node is a publication source/author of course.

    If would be possible to implement TOR-like circuit algorithm, but it doesn't really make any sense. If someone wants - he might publish using TOR. It's most important that after the publication it is impossible to determine the publisher.

  • Further ways to increase anonimity / accessibility
    It wouldn't be hard to tunnel the initial publication packet inside a DNS request if the message is not too long. It could be then forwarded via any selected, public DNS resolver to a DNS-weespur gate.

5.5.2 Searching

Each hosts, which got the search packet, searches it's local cache for X matching whispers. If they are found, they are marked as an interesting (to someone). Afterwards the hosts sends a responses to the previous host, one datagram per found whisper, at most - 10 whispers. And host forwards those back to the node who sent the search query.

Search can be started with a local cache search, and only if the newest packet found is older than X[s] - the external search would be started.

Querying host adds all new whispers to it's database (ignores already known), and runs a search on it's own cache to get a good search relevance/ordering. Searches can by forced by user to be only-local.

Search responses are given the search ID (sha1 of search packet fields)

  • similar data as in the publication packet.

5.6 Security

The author private key might be passphrase encrypted - it will make it harder to steal personality and to prove authorship of whispers.

Statistical analysis of local cache might be enough to prove authorship (do we always cache our messages?).

Analysis of a total traffic to/from node might reveal it's like/dislike filters.

User is at risk only during publication so in fact one could use TOR to publicate whispers and

5.7 Malicious behaviour

5.7.1 SPAM

As usual. Twitter has a problem problem with spam too. Good user-set filtering will allow to limit the problem (filter out too short messages, messages with just a link, etc.) But the problem will still persist.

5.7.2 DOS

Sending a simple packet might induce a large network traffic. That's one of the reasons the packets are rather small. This problem can be limited in few ways:

  • Network is probabilistic, so we can limit traffic on the node and silently drop incoming packets. This might DoS the weespur network a bit, but should save the internet.
  • Keep a list of too active hosts (publication/searches) and delay their requests.
  • Require computation of some value which is hard to compute, yet - pretty easy to check. Also it should scale nicely… Resending the same/computed value will cause most nodes to drop it (already answered list).

It would be possible to make the publication harder - force

5.7.3 Publishing files

Network is designed for short messages but of course file can be split into multiple messages and stored nevertheless. This can slow down network and therefore it's better to control it rather than ignore.

Files could be stored as a thread of messages with the first message describing the name/title and next containing only the 'thread' id except for the contents. Bitfields 'binary' and 'multipart' should be set.

Users might of course filter such messages from their cache.

5.8 Basic math

In general long/narrow chains of connections are preferred to short/wide (number of servers directly connected/TTL value). This increases anonimity and allows far better distribution of data and network self-filtering (by toggling TTL).

C - sending message in parallel to C computers T - TTL

  • With higher C the chance of hitting not-connected computer is higher
  • Users can't be controlled on what C do they use, so we must limit the TTL correctly… Still they can only toggle C at their host, not farther.
C \ T23456789101112
123456789101112
2614306212625451010222.0464.0948.190
312391203631.0923.2799.84029.52388.572265.719797.160
420843401.3645.46021.84487.380349.5241.398.1005.592.40422.369.620
5301557803.90519.53097.655488.2802.441.40512.207.030XX
101101 11011.110111.1101.111.11011.111.110111.111.1101.111.111 11011.111.111.110XX

5.8.1 Network background

I find it hard to estimate correct/good/bad values so let's start with defining some background.

Downloading a 650MB file over network (many media downloads I guess), assuming only 10 bytes for the protocol at least 444k packets, each will be acked -> probably over 900 milion packets.

3.5MB file (mp3?) - 4787 packets Opening a portal (1.4MB) - 1915

Twitter has 300mln tweets per day. Each tweet takes more than a single packet. Let's assume - 5 packets per tweet, which gives 17.361 packets per second. (That would take less than 26MB/s traffic).

In case of weespur the load is distributed so it might be larger. For the sake of calculations let's assume we can reach 1mln users.

There's also probability of loosing packet (or sending to dead host, or to the host which already got that packet) each time. The final result depends strongly on the moment of the packet drop.

5.8.2 Publication

In case of publication we want to reach some of the network, rest should be reached by the searches/republication. We don't get responses therefore we can reach more hosts than during search.

Assuming C=3, T=9 and 0.2 probability of loosing, the network will send 24000 packets during publication phase which in worst case will transfer 36MB of data.

Initial network coverage is not great - around 2.4%, but the whisper will later be republished during searches to other hosts.

5.8.3 Search

During the search, hosts running a query return packet responses which greatly increases total traffic.

Each host searches it's own cache first. If it has more than X (fresh) results, the request might not be forwarder further. Otherwise, the host forwards the request and gathers responses from that hosts. If there's not enough results for the query or the query is very specific (sha1 of whisper) the search can be deepen.

The requesting host doesn't want to get too much traffic for a single search query. 1MB would fit in at least 666 packets. If we are assuming C=3, that would give 200 packets/responses per host.

No duplicates should be returned so it decreases total number of responses.

C=3:

TTL543210
9-TTL:123456
Responses:600200662240

C=4:

TTL43210
9-TTL:12345
Responses:6001503792

Well, 1000 results might make a bit more sense; 10 / 10 pages (+ local cache)…

C=3:

TTL543210
9-TTL:123456
Responses:100033311137124

C=4:

TTL43210
9-TTL:12345
Responses:100025062153

This could be tweaked. For example the maximal TTL allowed could equal 9. If there's no data yet found - search deeper. If you get a lot of pretty fresh responses from the local cache - decrease the TTL. The total responses might vary… but let's estimate traffic first.

Node starting the search would receive all responses - 1000 to 5000 - 1.5MB to 7.5MB. That's pretty pessimistic, as nodes lower would remove duplicates and decrease total number of results. Nodes one level deeper in the search tree will only receive 1/C part of responses - 250-1250 packets - 0.4 - 1.875MB. (I'm assuming pessimistic case of each message having 1500bytes).

5.9 Spam protection

Users during account creation might be forced to do some calculations to limit the number of accounts. Then - network nodes could limit searches/publications for an account (which would work better than limiting it per IP).

In smaller networks this protection should work pretty well, in larger networks - the load should be better distributed.

The same could be done per search/publication request, but with a highly decreased difficulty (to be calculated in 0.1s;

Proposed algorithm: For given input (20 bytes of user public key or id (SHA1) of publication/search request) calculate complementary 4 bytes (K) so that SHA512(input + K) has difficulty higher or equal to requested.

Difficulty: mult = 60 # Parameter difficulty = 0 for byte in sha512hash: difficulty += int( mult / (256 - byte) )

The calculated complement is added into packets along the user key

Date: 2012-02-15 18:21:07 CET

Author: Tomasz bla Fortuna

Org version 7.8.02 with Emacs version 23

Validate XHTML 1.0