WEESPUR -*- mode: org; coding: utf-8; fill-column: 75; mode: auto-fill -*-
#+STARTUP: showall

* Basis of the idea
** Success of the social networking short message passing
   Ok - in fact - of Twitter. But not only, this also extends more basic
   ideas of Email and IM. For reference Twitter is said to handle 300*10^6
   tweets per day.
   
** Censorship in certain countries
   Twitter recently was forced to introduce censorship in certain
   countries. Other, while talking about limiting computer "piracy",
   usually limit peoples freedom.

** People value their freedom
   Or at least so it seems. In my country recently there was pretty big
   outcry after the goverment signed the ACTA agreement. But still, this
   freedom, needs to be well-packaged and interesting to get people to
   defend it.
   
* Sister projects
  GNUtella - for it's distributed search and list of active hosts.
  Freenet - works in pretty similar way, has a local datastore which is
  encrypted. It's way more general, heavier and it's data-neutral - you
  can't decide not to help share pedophile material while using it. Some
  people might dislike it.

  Bitcoin - Similar keying idea + computational generation of more
  resources using SHA.

  TOR - "circuits" and message passing, but no local datastore.
  CAN - Content Aware Networks - idea is similar in come places.

  Email - is decentralized, but is much less anonymous and by default
  private, not public.

* Name
  Weespur.  'We' is nice. 'Spur' (or 'stir' or /spear/) is not bad
  either. Pronounced just like a 'whisper'. Also I'm going to call messages
  'whispers' (in short whisp/wheesp), which is longer but better than
  'tweet'. Whatever.

* Required features
** Whispers
   It is a message with a timestamp, authorship data, connected (or not) to
   a thread, which may contain hashes (#icecream) for searching and locality +
   language data.

   Some #tags could be reserved. Like: '#userprofile' - so that user data
   is always tagged the same way and searching for a user + this hashtag
   returns his description.

** Totally decentralised 
   Clients keep a list of a known parts of the network. Dynamically refresh
   it while communicating with other nodes. Connections are made to a X,
   randomly selected hosts, of Y most recently contacted.

   No mandatory centralised servers; although it's possible to set a
   headless server for caching/proxying etc. It should even be promoted. In
   theory self-moderation and self-filtering of the network requires most
   of the participators to set their own clients and not use WWW
   access. Although smart WWW gates could fix such issue.

   Users can setup headless servers for:
   - proxying publication/search requests
   - keeping normal cache
   - archival servers (with unlimited/big cache and none searching
     capabilities - as search DB is larger, but, well).
   - proxies with web interface for browsing their resources.

   All nodes should participate in forwarding publication/search requests.

   For example university might create such a server so all it's students
   might communicate with each other (this would increase the chance of
   locating each other whispers by creating a local cache).
   
   Although there's no 'user of the server' idea in the system the
   cache filtering should allow university to never remove the data of
   'their' users. It would still cache all search requests/answers

** Cryptographical ensurance of authorship
   Similar in the idea to the bitcoin. Whispers are published along
   the user key and signature. Users themselves might keep a key->name
   list to decypher names of their close friends. Whispers contain a
   human-readable 'signature' which can be set to anything. It might
   be remembered in the list and validated for later whispers (like -
   signature in red if it doesn't match remembered key)

** Local whisper cache
   Network holds data in decentralised fashion. Therefore there's only some
   probability of finding certain information in the network. It's
   important to both maximize this probability and accept that it can't
   reach 1. More popular whispers will be easily found.

   User can set filters to what he wants to cache, can force caching of his
   own whispers etc. You also can publish a message and turn off your
   client - other clients will cache it for some time so anyone interested
   will be able to read it.

   Local cache is not encrypted, but it could - accessing the client would
   require a password then. 

   Local cache must be structured in a way which allows fast searches.

** Distributed search (think: GNUtella)
   Network can be searched for the data. Searches can be automated (search
   for new things of certain types every 2 hours), or manual.  You are also
   periodically send new publications which might appear automatically on
   your 'wall' if they match your search criteria.  You can have a set of
   your personal favorite searches which create your 'buddies' (search for
   whispers of certain user), 'wall' (searches you're mostly interested
   in. Probably a huge combined request...).

   Distributed cache also means that the searches won't be
   deterministic. Searching for the same data twice might return different
   results. Popular whispers will appear always, while less popular might
   pop in and out at random. Searches are naturally sorted by the whisper
   timestamp. All searches are locally cached. Number of results is
   limited - if the user doesn't find what he needs in about... 1000
   whispers - he should specify search criteria better.

   There's a limit on the length of a query (packet size). 

** Optional: Private messaging with the encryption
   Easy. Encrypt "for" someone. Users might filter those out
   and even forbid it's caching.

** Optional: Tor-like anonimity
   I weren't sure it was worth it and I believe it's not. If someones wants
   tor-like anonimity he can use tor for publishing. There might even be
   hidden weespur proxies inside TOR.
   
   After publication entries will be totally anonymous. During publication
   there're certain measures taken to make it difficult for anyone to
   determine who published a message, but it's not Tor-grade.


** Popularity, self-filtering
*** Fresh data first
    Fresh entries are promoted by the network, old might in some time
    vanish. This depends mostly on the servers cache - They have some total
    capacity. If user wants to retain his entries forever - he should start
    his own node.

*** What users can do
    - Can set filters to the content they dislike 
      (hashes, authors, threads, anything we search by)
      This can be done automatically by client after selecting "dislike
      message", "dislike author".

    - Can set filters to the content they like. 

    These filters have impact on algorithms of searching, publication
    and caching.

    - User can respond to a whisper which will increase it's chance to
      be visible by people who like the guy (and filter him
      positively). (+1!)
      To fullfil this 
      
    Some actions might trigger republication of a whisper which will
    increase it's popularity. The searches alone cause many nodes to cache
    the searched data thus propagating interesting whispers.

** Interface
   The program should be split into few parts:
*** Library
    Handles local database, caching, connectivity, publishing,
    searching and all other low-level things.
*** GUI Interface
    Creates a stand-alone application with user GUI. Basic app.  It
    ensures best anonimity, has a local cache and actively works in
    the network. Is certainly recommended.

*** Web interface
    It is possible to create a web application for browsing the
    network. It consist of a stand-alone server with his cache
    performing all low-level functions (proxying) + web interface.
    
    This approach looses user anonimity, decreases network cache but
    otherwise is just fine. The server can with its cache fix this
    problems. 

    Browsing the network should naturally work without problems,
    self-filtering might be altered badly, but smart web servers could fix
    it (by averaging users likes/dislikes).


* Details and how it works
** Connectivity
   We keep a list of network addresses which participated in the list.
   As the number of headless/constantly working servers will rise it
   might make sense to keep a 'static list' as well as 'dynamic list'.
   
   Both lists might contain IPs and domains, although hosts automatically added
   to the dynamic list will be added by IP only.
   
   We add to the list automatically the hosts which have contacted us.
   List is sorted according to host activity; more active hosts are more
   likely to be contacted for searches/publication, and long-forgotten
   hosts might be eventually removed from the list.

   List length might be limited.

*** Performance
    Communication might be kept inside a SINGLE UDP packet which makes
    the searches/publications fast and light. In general there are 
    following requests:
    - Whisper / Publication
    - Search
    - Revoke (owner might revoke it's message; might decrease it's popularity,
      but doubtful if it will remove it totally)
    - PING / PONG

    Most of them contain a 'TTL' field which has a certain allowed range.

** Whisper
*** Timestamp
    8 bytes UTC timestamp

*** Thread
    'In response to' field. This whisper responds to another one by
    including it's SHA1 sum (20 bytes).

*** Destination / to
    Recipient key: 20bytes

*** Author
    Total: 120 bytes
**** Key
     Elliptic Curve key - 160 bits ~ 20bytes

**** Signature
     ECC signature - 640bits? ~ 80 bytes

**** Human readable sig
     20 bytes for UTF-8

     De facto it is possible to publish an unsigned/anonymous whisper by 
     including zeroed key/signature fields. It might be possible to filter
     messages using this method too.

**** Optional: Creation timestamp
     It might be fine to include a creation timestamp, for account
     verification algorithm.
**** Optional: Verification block
     This field (4bytes) would be used if the account creation should be
     paid by the creator (spam limiting).

*** Recipient
    Public key of a user this message is destined to. (20bytes)

*** Language / locale
    Might be interesting for filtering information (only English, only something)
    Optional: Geographical location (country)
    Optional: timezone
    Optional: language

    3 language, 3 country, 3 timezone
    Should fit in 10 bytes

*** Bitfields
    Is content encrypted?
    At least byte.

*** Totals
    IP header: 20 bytes
    UDP header: 8 bytes

    8 + 20 + 20 + 100 + 10 + 20 + 1 = 179 --> 1321 bytes left
    Let's reserve 500 bytes for other headers,
    we can safely assume there's 800 bytes for the rest.

    200 characters should fit in at most 500 bytes...
    Assuming 500 bytes per UTF-8 message.

    Might be nice to be able to tunnel the traffic inside DNS
    requests.

** Packet
   <VERSION|TTL|REQUEST_TYPE| HEADERS | CONTENT >
   Protocol Version:
   0x80 -  0.01

   TTL:
   Valid range is from 0 (don't send further) to 25.

   Type:
   I'd rather keep a set of values if I wanted to differentiate those...
   (For example publicate like/dislike?)
   0x10 - 0x1E - Publication (15 values)
   0x1F        - Revoke
   0x20 - 0x2E - Search (16 values)
   0x2F -        Search response

   0xAX - management space
     0xA0 - Ping (Max TTL=0)
     0xA1 - Pong

*** Publication packet
**** Structure bitfield (1 byte)
     01234567
     bit 0: Do we have a 'Response to' field
     bit 1: Do we have 'Recipient' field
     bit 2: Do we have a timezone
     bit 3: Do we have language
     bit 4: Do we have a country
     bit 5: reserved
     bit 6: Is message encrypted
     bit 7: Is message binary
     This tells us what fields to read later and shrinks the size of a message.
**** Content length (2 bytes)
**** Timestamp (8 bytes)
     I don't want 2038 problem, yet 8 bytes is MORE than enough. It should
     be fine to have 5 minute resolution too, so 1 byte could be out - but
     it might not be worth complications.

**** Author 
     Key (around 20 bytes)
     Optional: verification block (4 bytes)
**** -- Up to this point length/data positions are constant - 38 bytes
**** Optional: Recipient
     Key of recipient (20bytes)
**** Optional: Thread
     SHA1 sum of answered publication (20bytes)
**** Localization
     5 bytes of language (ASCII or NULLS)
     5 bytes of country (ASCII or NULLs)
     1 byte of timezone (-127 to 127; -128 means 'not given')
     
**** Readable sig
     1 byte of length
     Up to 20 bytes of signature

**** Content
**** Message signature
     Does sign pretty much everything except: the version, TTL,
     request_type, structure bitfield, content length

*** Search packet
    TODO
**** Structure bitfield (1byte)
     
**** Query length
**** Query contents
     

** Local databases
*** Contains a cache of all whispers.
    - Has a user settable limit on it's size.
    - Old entries are removed if limit is hit.
    - Assuming pesimistic 1500 bytes per each whisper, a 2GB DB could
      store over 1mln entries - but without a search data.

*** Separate DB for searching
    Index contains and can search for:
    - words (content/hashes/anything)
    - hashes
    - language
    - locality
    - encryption
    - author
    - thread
    - recipient
    It will grow and be bigger than the main DB itself.
    Can be easily recreated.

    Probably Xapian DB. It might be possible to merge those databases into
    one with Xapian too. Not sure if it's such a good idea. It wouldn't be
    certainly ok to reimplement a DB; Maybe use tokyocabinet... mongodb is
    ok too, but this shouldn't be so heavy. Single XapianDB, maybe with
    sqlite/other SQL.
    
*** DoS filter lists
    Filter/delay IPs which were behaving badly. If you're getting too much
    packets from certain IP - increase probability of dropping the packet.
    This should either be implemented fine with good limits (adaptive?), or
    not at all. 

    If creating a user key is expensive it might be fine to limit not per
    IP, but per author.

** Connectivity in distributed network
   To send a publication / search request, a client keeps a list of hosts
   seen in the network. The list is sorted from the most recently contacted
   host to least. From the top Y entries software selects randomly X
   computers which are going to be contacted.

   TTL of a packet is randomized from a range of values, so the contacted
   host can't determine if the previous host produced the packet or is just
   proxying it. Each proxying host inserts random delays when resending the
   packets so that time-analysis of packet source is complicated as well.

   Each time the packet is resend, it's TTL is decreased by some (more/less
   random) amount (1-3). The search/publication is stopped after the TTL
   zeroes.

   Initial delay before proxying should range between 0.1 and 2s, delay
   between each packet should be smaller (0.1-0.5s).

   If there is a list of static addresses, "always to be contacted", 
   they might be included randomly in selected dynamically hosts.

   Host keeps lists of recently published messages and forwarded search
   queries and in case the newly received packet matches any from the
   list - it is ignored.

*** Publication
    Publication packets don't get any response. 

**** Probability
     What % of the network can we reach with the publication and in what
     time? What % of the network can we search? What is an estimated
     probability of finding a published whisper?

     Should we scale our requests with the length of known nodes?

**** Anonimity
     Described connectivity algorithm doesn't ensure TOR-grade anonimity of
     publishers but makes it hard to determine who published a whisper
     without observing the traffic closely.

     From a (random) time to time (in silence periods?) client
     spontaneously republicates whisper it had cached which has a high
     'like' value (filters) and is relatively fresh. 

     Whispers with invalid (future) timestamp, fields, length, signature, must be
     ignored.

**** Filtering / self-moderation
     Node receiving publication request might resend it to a larger set of
     nodes or smaller - depending on local like/dislike filters. It may
     cache it or it may ignore it. It might decrease the TTL a little or
     a lot.

**** What's the point?
     It should be hard to determine if a user published the message or just
     republished the whisper it had cached previously, or if it is
     forwarding a publication request it got some random time ago.

     By observing the network as a whole it will be hard - random delays,
     TTL randomization and alteration makes it impossible to determine node
     position in the forward chain in a large world-wide network. 

     If all traffic to and from the node is tracked it is possible to prove
     that the node is a publication source/author of course. 

     If would be possible to implement TOR-like circuit algorithm, but it
     doesn't really make any sense. If someone wants - he might publish
     using TOR. It's most important that after the publication it is
     impossible to determine the publisher.

**** Further ways to increase anonimity / accessibility
     It wouldn't be hard to tunnel the initial publication packet inside a
     DNS request if the message is not too long. It could be then forwarded
     via any selected, public DNS resolver to a DNS-weespur gate. 
     
   
*** Searching
    Each hosts, which got the search packet, searches it's local cache for X
    matching whispers. If they are found, they are marked as an interesting
    (to someone).  Afterwards the hosts sends a responses to the previous
    host, one datagram per found whisper, at most - 10 whispers. And host
    forwards those back to the node who sent the search query.

    Search can be started with a local cache search, and only if the newest
    packet found is older than X[s] - the external search would be started.

    Querying host adds all new whispers to it's database (ignores already
    known), and runs a search on it's own cache to get a good search
    relevance/ordering. Searches can by forced by user to be only-local.
    
    Search responses are given the search ID (sha1 of search packet fields)
    + similar data as in the publication packet.


** Security
   The author private key might be passphrase encrypted - it will make it
   harder to steal personality and to prove authorship of whispers.

   Statistical analysis of local cache might be enough to prove authorship
   (do we always cache our messages?).

   Analysis of a total traffic to/from node might reveal it's like/dislike
   filters.

   User is at risk only during publication so in fact one could use TOR to
   publicate whispers and 


** Malicious behaviour
*** SPAM
    As usual. Twitter has a problem problem with spam too. Good user-set
    filtering will allow to limit the problem (filter out too short
    messages, messages with just a link, etc.) But the problem will still
    persist.


*** DOS
    Sending a simple packet might induce a large network traffic. That's
    one of the reasons the packets are rather small. This problem can be
    limited in few ways:
    - Network is probabilistic, so we can limit traffic on the node
      and silently drop incoming packets. This might DoS the weespur
      network a bit, but should save the internet.
    - Keep a list of too active hosts (publication/searches) and delay
      their requests.
    - Require computation of some value which is hard to compute, 
      yet - pretty easy to check. Also it should scale nicely...
      Resending the same/computed value will cause most nodes to drop it
      (already answered list).


    It would be possible to make the publication harder - force 

*** Publishing files
    Network is designed for short messages but of course file can be split
    into multiple messages and stored nevertheless. This can slow down
    network and therefore it's better to control it rather than ignore.

    Files could be stored as a thread of messages with the first message
    describing the name/title and next containing only the 'thread' id
    except for the contents. Bitfields 'binary' and 'multipart' should be
    set.

    Users might of course filter such messages from their cache.

** Basic math
   In general long/narrow chains of connections are preferred to short/wide
   (number of servers directly connected/TTL value). This increases
   anonimity and allows far better distribution of data and network
   self-filtering (by toggling TTL).

   C - sending message in parallel to C computers 
   T - TTL

   - With higher C the chance of hitting not-connected computer is higher
   - Users can't be controlled on what C do they use, so we must limit the
     TTL correctly... Still they can only toggle C at their host, not
     farther.

   | C \ T |   2 |     3 |      4 |       5 |         6 |          7 |           8 |             9 |             10 |        11 |         12 |
   |-------+-----+-------+--------+---------+-----------+------------+-------------+---------------+----------------+-----------+------------|
   |     1 |   2 |     3 |      4 |       5 |         6 |          7 |           8 |             9 |             10 |        11 |         12 |
   |     2 |   6 |    14 |     30 |      62 |       126 |        254 |         510 |          1022 |          2.046 |     4.094 |      8.190 |
   |     3 |  12 |    39 |    120 |     363 |     1.092 |      3.279 |       9.840 |        29.523 |         88.572 |   265.719 |    797.160 |
   |     4 |  20 |    84 |    340 |   1.364 |     5.460 |     21.844 |      87.380 |       349.524 |      1.398.100 | 5.592.404 | 22.369.620 |
   |     5 |  30 |   155 |    780 |   3.905 |    19.530 |     97.655 |     488.280 |     2.441.405 |     12.207.030 |         X |          X |
   |    10 | 110 | 1 110 | 11.110 | 111.110 | 1.111.110 | 11.111.110 | 111.111.110 | 1.111.111 110 | 11.111.111.110 |         X |          X |


*** Network background
    I find it hard to estimate correct/good/bad values so let's start with
    defining some background.
 
    Downloading a 650MB file over network (many media downloads I guess),
    assuming only 10 bytes for the protocol at least 444k packets, each will be
    acked -> probably over 900 milion packets.

    3.5MB file (mp3?) - 4787 packets
    Opening a portal (1.4MB) - 1915

    Twitter has 300mln tweets per day. Each tweet takes more than a single
    packet. Let's assume - 5 packets per tweet, which gives 17.361 packets
    per second. (That would take less than 26MB/s traffic).

    In case of weespur the load is distributed so it might be larger.  For
    the sake of calculations let's assume we can reach 1mln users.

    There's also probability of loosing packet (or sending to dead host, or
    to the host which already got that packet) each time. The final result
    depends strongly on the moment of the packet drop.

*** Publication
    In case of publication we want to reach some of the network, rest
    should be reached by the searches/republication. We don't get responses
    therefore we can reach more hosts than during search.

    Assuming C=3, T=9 and 0.2 probability of loosing, the network will send
    24000 packets during publication phase which in worst case will
    transfer 36MB of data. 

    Initial network coverage is not great - around 2.4%, but the whisper
    will later be republished during searches to other hosts.
    

*** Search
    During the search, hosts running a query return packet responses which
    greatly increases total traffic.

    Each host searches it's own cache first. If it has more than X (fresh)
    results, the request might not be forwarder further. Otherwise, the
    host forwards the request and gathers responses from that hosts. If
    there's not enough results for the query or the query is very specific
    (sha1 of whisper) the search can be deepen.

    The requesting host doesn't want to get too much traffic for a single search
    query. 1MB would fit in at least 666 packets. If we are assuming C=3,
    that would give 200 packets/responses per host.

    No duplicates should be returned so it decreases total number of
    responses.

    C=3:
    | TTL        |   5 |   4 |  3 |  2 | 1 | 0 |
    | 9-TTL:     |   1 |   2 |  3 |  4 | 5 | 6 |
    | Responses: | 600 | 200 | 66 | 22 | 4 | 0 |

    C=4:
    | TTL        |   4 |   3 |  2 | 1 | 0 |
    | 9-TTL:     |   1 |   2 |  3 | 4 | 5 |
    | Responses: | 600 | 150 | 37 | 9 | 2 |

    Well, 1000 results might make a bit more sense; 10 / 10 pages (+ local cache)...

    C=3:
    | TTL        |    5 |   4 |   3 |  2 |  1 | 0 |
    | 9-TTL:     |    1 |   2 |   3 |  4 |  5 | 6 |
    | Responses: | 1000 | 333 | 111 | 37 | 12 | 4 |

    C=4:
    | TTL        |    4 |   3 |  2 |  1 | 0 |
    | 9-TTL:     |    1 |   2 |  3 |  4 | 5 |
    | Responses: | 1000 | 250 | 62 | 15 | 3 |

    This could be tweaked. For example the maximal TTL allowed could
    equal 9. If there's no data yet found - search deeper. If you get a lot
    of pretty fresh responses from the local cache - decrease the TTL.  The
    total responses might vary... but let's estimate traffic first.

    Node starting the search would receive all responses - 1000 to 5000 -
    1.5MB to 7.5MB. That's pretty pessimistic, as nodes lower would remove
    duplicates and decrease total number of results. Nodes one level deeper
    in the search tree will only receive 1/C part of responses - 250-1250
    packets - 0.4 - 1.875MB.  (I'm assuming pessimistic case of each
    message having 1500bytes).


** Spam protection
   Users during account creation might be forced to do some calculations to
   limit the number of accounts. Then - network nodes could limit
   searches/publications for an account (which would work better than limiting
   it per IP). 

   In smaller networks this protection should work pretty well, in larger
   networks - the load should be better distributed. 

   The same could be done per search/publication request, but with a highly
   decreased difficulty (to be calculated in 0.1s; 

   Proposed algorithm: 
   For given input (20 bytes of user public key or id (SHA1) of
   publication/search request) calculate complementary 4 bytes (K) so that
   SHA512(input + K) has difficulty higher or equal to requested.

   Difficulty:
   mult = 60 # Parameter
   difficulty = 0
   for byte in sha512_hash:
       difficulty += int( mult / (256 - byte) )

   The calculated complement is added into packets along the user key