FAQ - Frequently Asked Questions
General
What is this?
YaCy is a distributed Web Search Engine, based on a peer-to-peer network.
What can I use YaCy for?
Use cases range from:
- site search for your own website (may be shared with the p2p network or isolated),
- narrow-domain specialised search engine (e. g. gene-manipulation scientific magazines, sites about trains, computer security, encyclopedias, german law, language-domain...) with the advantage of crawling all the pages (not selective, as other search engines do),
- whole-world alternative search engine, using the P2P (RWI), sharing crawled index with other peers and taking advantage of indices of the others, bypassing the censorship of local laws,
- intranet search behind the firewall, not sharing the data with any 3rd party,
- personal web search used as a cache, indexing whatever you're browsing (limited by https now),
- news retrieval tool with user-defined recrawl time,
- experimental project (such as onion web search), connected with other software via API,
- to research in a field of decentralised search engines.
What does indexing mean?
Indexing means that a web page is separated into the single words on it and to save the URLs to the sites containing them under a reference to the word itself in a database. So searching for a word (or many words) may be easily performed by fetching all URLs "belonging" to the search term.
What is a DHT?
A distributed hash table (DHT) is a class of a decentralized distributed system that provides a lookup service similar to a hash table; (key, value) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. (source: Wikipedia)
What's the meaning of "to crawl"?
A so-called "crawler" fetches a web page and parses out all links on it; this is the first step or "depth 0". It continues to get all web pages linked on the first document which is then called "depth 1" and does the same respectively for all documents of this step. The crawler is limitable to a specified depth or can even crawl indefinitely and so can crawl the whole "indexable Web", including those parts of the indexable web who are censored by commercial search-engines and therefore normally not part of what most people are presented as The visible web.
What is a P2P network?
Meaning Peer to Peer computing or networking is a distributed application architecture that partitions tasks or work loads between peers. Peers are equally privileged, equipotent participants in the application. They are said to form a peer-to-peer network of nodes. Peers make a portion of their resources, such as processing power, disk storage or network bandwidth, directly available to other network participants, without the need for central coordination by servers or stable hosts. Wikipedia
What is a Peer?
A peer is a communications endpoint in a computer network. Each peer offers its services and uses the services of other peers. In YaCy, a peer provides web indexing services to other peers on the YaCy search network.
What is RWI?
RWI is an acronym for Reverse Word Index. This is generated by the indexer from the collected data and stored in the database.
Isn't P2P illegal?
No. P2P (peer to peer) only describes the technology by which computers exchange data amongst themselves. Past legal disputes have been over what types of data have been exchanged over such networks. Namely, copyrighted material. The P2P file sharing technique itself is legal, despite the fact that it has been used facilitate in the transfer of copyrighted data. The only files shared amongst YaCy peers are indexes of the publicly accessible internet.
YaCy in general
What are global and local indexes?
Once you use the proxy, a local copy of the indexed data is automatically created. You can access the global index only if the program has a connection to the YaCy network. This is effectively the combination of all local indexes.
I am not a technician. Can I install YaCy easily and use it to index my own web pages?
YaCy is very easy to install. You don't need any special knowledge or additional software; also you don't need to set up an extra database engine. Indexing your own website isn't hard too: Simply crawl it and turn off DHT Distribution and DHT Receive to keep the index of your site on your peer.
Can I crawl and index the web with YaCy?
Yes. You can start your own crawl and you may also trigger distributed crawling, which means that your own YaCy peer asks other peers to perform specific crawl tasks. You can specify many parameters that focus your crawl to a limited set of web pages.
Is there a central server? Does the search engine network need one?
No. The YaCy network architecture does not need a central server, and there is none. There are currently four so-called seed-list servers hard-coded into source code due to they are mostly available and have accurate seed list information (see FAQ for details).
Search Engines need a lot of terabytes of space, don't they? How much space do I need on my machine?
The global index is shared, but not copied to the peers. If you run YaCy, you need an average of the same disc memory amount for the index as you need for the cache. In fact, the global space for the index may reach the space of Terabytes, but not all of that is on your machine!
Do I need a fast machine? Search Engines need big server farms, don't they?
You don't need a fast machine to run YaCy. You also don't need a lot of space. You can configure the amount of Megabytes that you want to spend for the cache and (currently only indirectly) the index. Any time-critical task is delayed automatically and takes place when you are idle surfing (this works only if you use YaCy as http proxy).
How long does a search take?
Time for a search depends on many factors. One is, if you perform a local-only search or a global DHT-based search. Another factor comes in place when enable search heuristics. A third and important factor is if you have already searched for it often (short time between two same searches), in this case the search may still be cached. Another factor is if the parameter verify=true, if so the fetched HTML snippet got verified, say fetched again. If you use verify=false this may give you less accurate search results (wrong maybe) but they are way faster.
In general YaCy's architecture does not do peer-hopping, it also doesn't have a TTL (time-to-live). It is expected that search results are instantly responded to the requester. This can be done by asking the index-owning peer directly which is in fact possible by using DHT's (distributed hash tables). Because YaCy needs some redundancy to compensate for missing peers, it asks several peers simultaneously. To collect their response, YaCy waits a little time of at most 6 seconds (by default, you can change that).
Do I need to set up and run a separate database?
No. YaCy contains its built-in database engine (solr), which does not need any extra set-up or configuration. Or you can use a standalone external solr instance instead, if you wish.
Will running YaCy jeopardize my privacy?
YaCy respects user privacy. All password- or cookies-protected pages are excluded from indexing. Additionally, pages loaded using GET or POST parameters are not indexed by default. Thus, only publicly accessible, non-password-protected pages will be indexed.
For a detailed explanation on the technique: How YaCy protects your privacy wrt to personalized pages.
Can other people find-out about my browsing log/history?
There's no way to browse the pages that are stored on a peer. A search of the pages is only possible on a precise word. The words are themselves dispatched over the peers thanks to the distributed hash tables (DHT). Then the hash tables of the peers are mixed, which makes retrieving the history of browsing of a certain peer impossible.
Getting started
How do I install YaCy?
See download and installation guide to install on Linux, Windows, MacOS or various Unixes. Readme contains the more information.
How do I access my peer?
After successful start of YaCy, it should be running at localhost, port 8090, so you can access it by entering http://localhost:8090
in your browser.
How do I search?
Just enter your query into the search field. Your instance will ask the other peers for results and collect them in search result page. It may take some time. By default, the results are transfered to your peer as "RWI" and stored localy, so the next search will find the results more quickly.
You can also use some modifiers to refine your search. For example, /date
added to a query will sort the results by date (of indexing). inurl:
parameter will filter the results based on url, so inurl:nytimes.com
will show just results from New York Times.
Usage
My YaCy search pages doesn't show!
The default address for the YaCy search and administration page is http://localhost:8090. If you are using Internet Explorer, please mind adding http:// before localhost:8090. In case you have changed the default port of YaCy from 8090 to another one, you will have to open the new port in your firewall or router (and maybe close the port 8090 if you don't use it). Another reason could be a bad setting of the proxy, in which case you need to deactivate the proxy for the local pages.
Why does YaCy show different results from Google?
We expect YaCy to show different results than Google, for several reasons. As long as YaCy has only a few peers working, it cannot compete with Google. Hence the importance of having a great number of YaCy peers working. But even then YaCy will provide different and better results than Google, since it can be adapted to the user's own preferences and is not influenced by commercial aspects.
Network
What does Virgin, Junior, Senior, Principal Status mean?
virgin
Status Virgin means your peer did not have contact to the network yet. Simply put, you are "offline". You can search the local index only.
junior
Status Junior means your peer has contact to the yacy network but cannot be reached by other peers. One reason could be a firewall or missing router configuration. You can search the local index only. Junior peers can contribute to the network by submitting index files to senior/principal peers without being asked.
senior
Status Senior means your peer has contact to the yacy network and can be reached by other peers. It is now an access point for index sharing and distribution. You can search the local and global index. This is the best status to use, it supports the network.
principal
A Principal is a Senior peer that uploaded an additional peer-list to a server. This list supports other peers to get in contact with the existing Yacy network to perform a global search. If you have the possibility to upload a file to an FTP server, you can become a Principal by uploading your peer-list.
Why should I run my peer in Senior Mode?
Some p2p-based file sharing software assign non-contributing peers very low priority. We think that that this is not always fair since sometimes the operator does not have the choice of opening the firewall or configuring the router accordingly. Our idea of 'information wares' and their exchange can also be applied to junior peers: they must contribute to the global index by submitting their index actively, while senior peers contribute passively. Therefore we don't need to give junior peers low priority: they contribute equally, so they may participate equally. But enough senior peers are needed to make this architecture functional.
Since any peer contributes almost equally, either actively or passively, it is not mandatory. However, since any peer can add to the index, but what is added can only be stored on and found through senior peers, you should decide to run in Senior Mode if you can.
Even if only 1/10 of the peers which were Junior as of March 2012 become Senior, the network’s capacity will grow considerably.
My peer says it runs in 'Junior Mode'. How can I run it in Senior Mode?
Open your firewall for port 8090 (or the port you configured) or program your router to forward this port to your computer.
Or, if you have the option of running a ssh tunnel on a host with public ip, you can run:
ssh -f -R remotehost.org:8090:localhost:8090 remotelogin@remotehost.org -N
and create tunnel to remotehost.org, port 8090.
How can I change the Connection Timeout value?
This can be done on the configuration page "Admin Console" -> "Advanced behavior" http://localhost:8090/ConfigProperties_p.html. Just search for the line client-timeout and change the value there. The timeout is in milliseconds.
Do not forget to restart YaCy after the change.
Alternatively, another way to do this is through the configuration file httpProxy.conf in DATA/SETTINGS. If this type of configuration is to be performed then YaCy must be stopped before.
Troubleshooting
Something seems not to be working properly ; what should I do?
YaCy is still undergoing development, so one should opt for a stable version for use. The latest stable version can be downloaded from the YaCy homepage. If you are experiencing a strange behaviour of YaCy then you should search the community forum for known issues. If the issue is unknown, then you can ask for help on the forum (and provide the YaCy version, details on the occurrence of the issue, and if possible an excerpt from the log file in order to help fix the bug) or start an issue on github.
First thing to see while experiencing some errors, is the log located at DATA/LOG/yacy00.log
. You can monitor it live using tail
command. While it flips around when certain size is reached, it's better to use -F option:
tail -F DATA/LOG/yacy00.log
See more about setting and using the yacy log.
YaCy is running terribly slow; what should I do?
As an indexing and search host, YaCy is quite resource hungry. It's written in Java. Fast disks (SSD or RAID) and plenty of RAM will help.
It occupies only the amount of RAM specified in “Maximum Used Memory”, so if you have more physical RAM, increasing this value should help.
Sometimes also ‘Database Optimisation’ helps, but it takes some time to run.
For more tips see the Performance Tuning page.
I can not uninstall, because YaCy is still running
First check whether YaCy still runs. If it doesn't run, it may not have been shut down properly. Start YaCy again, then uninstall. Alternatively delete the yacy.running file in the yacy/DATA/ directory, then uninstall.
DHT - Distributed Hash Table
How do I give the index of one peer to another?
This actually happens automatically through the DHT distribution of the words. However, there is also the possibility of transferring the entire index to another peer. This can either be done through a so-called Index-Transfer (link needs update) or a index-Import (link needs update).
Why does RWIs (P2P Chunks) decrease?
YaCy maintains two indexes, the RWI (“Reverse Word Index”) and the Solr Index. The RWI is the distributed Word Index that is generated and then waits to be distributed to other peers according to a distributed hashtable schema. The target peers then host the RWIs again while on your own peer the RWI is deleted.
That results in larger RWIs on the target peers but on a smaller number of RWIs on your peer. That is not a contradiction: it increases the size of some of the RWIs but decreases the number of RWIs. That applies to all peers.
Are DHT entries unique in a search network or can URLs also appear twice or three times?
URLs are analyzed more than once so that a peer delayed does not lose his part in the search index. As for the indexes they are stored redundantly on multiple peers.
Crawling / indexing
How do I avoid indexing of some files?
One way is to limit the crawler using regular expressions in “filters”
section in advanced crawler. For example,
.*\.tar\.gz
in “Load filter on URLs” field in “crawler filter” section,
means that no tar.gz files will be browsed. You can use multiple of them
using “or” (|
) operator - for example .*\.tar\.gz|.*\.zip
will ignore
urls that end with .tar.gz
OR .zip
.
There are two separate filters, one for crawling (crawler filter), and one for actual indexing (“document filter”).
Note that regexp is not “normal regexp” but a "Java Pattern"
Will already-indexed pages (i.e. indexed and index-exchanged) automatically be reindexed after a few days/years?
Unfortunately no. However, there is the possibility for a chronological "recrawl" to be executed for a URL (or an entire website if desired). Learn more about this feature under "Index Control" -> "Index Creation."
How can I index Tor or Freenet pages?
The indexing of Tor or Freenet pages is for the moment deliberately avoided in the source code because it is not desired to index these pages at this stage of the development of YaCy. However, the crawling of such sites is planned in the future. There were attempts. Most likely the crawl results will not distributed globally, but will only be available to the local peer.
How can I crawl with YaCy when I am behind a proxy?
You can set-up proxy settings on http://<host>:<port>/Settings_p.html?page=proxy
or in configuration file
DATA/SETTINGS/yacy.conf
:
remoteProxyUse=true
remoteProxyHost=localhost # hostname or address of proxy
remoteProxyPort=8118 # proxy port
How to remove a certain type of files from Solr index (i.e .png or .svg)?
That's easy. Go to Index Deletion /IndexDeletion_p.html
- In the first text window “Delete by URL Matching” enter i.e.
.*\.png
for PNG files. - check the radio button “matching with regular expression”
- hit “Simulate Deletion”. This does not actually delete anything, but enables the button “Engange Deletion” and show how many documents would be deleted.
Now you can decide if you actually want to do this and
- click on “Engange Deletion”.
This cannot be undone.
The String that you entered here is a Java Pattern
The actual deletion is done later, upon clean-up (probably), deleted pages disappear from index after some time.
What is postprocessing?
After the crawl is finished, the CollectionConfiguration process is executed by Switchboard to compute all the Citation values and furthermore check and mark, if the document is unique to the index (for later low-ranking of non-unique documents). The status or progress of postprocessing is displayed in the Crawler Monitor.
The postprocess calculates the pagerank which is computational madness. Therefore the postprocessing is disabled in recent releases. If it is not disabled in yours, please do so.
To enable the posprocessing again you must switch on a specific index field
(“process_sxt”) in the index schema which you can find here:
http://localhost:8090/IndexSchema_p.html
Then freshly crawled content can be processed - but postprocessing starts only after the complete crawl has finished and the crawl stack is completely empty. The postprocessing does not start instantly but only if the cleanup job runs - which runs every 10 minutes.
Another condition is, that the Web Structure Index is switched on which you
can find at http://localhost:8090/IndexFederated_p.html
- but that should be
on by default.
What is Citation Reference?
While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser shows such backlinks to each presented links. The host browser therefore can show an information where an document is linked. The citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures.
What is the difference between Citation Reference (reverse link index) and Webgraph?
They contain both the same: links leading from page to page to calculate their CitationRank and hence the 'popularity'.
The only difference is in storage: "Webgraph" is stored in second solr core,
"Citation Reference" is stored internaly
(e. g. DATA/INDEX/freeworld/SEGMENTS/default/citation*
).
The number of solr Webgraph entries is limited by 2147483519, which is reached after several millions of pages indexed. This limitation could be overcome by using solr cluster.
Passwords
I can not log in YaCy anymore as I forgot my password. How do I reset my password?
If you have lost your password, you can reset it (or choose a new one). There are two methods:
Password reset while YaCy is running
This is the most convenient way. You don't need to shut down YaCy:
Use a command line terminal and log in to the user account running YaCy then
execute <yacy-app>/bin/passwd.sh <new-password>
This changes only the admin account password for the account named 'admin'.
Password reset while YaCy is shut down: edit the file DATA/SETTINGS/yacy.conf:
- remove the entry serverAccountBase64MD5
- remove the entry adminAccount (if any)
- choose a new password by setting the entry serverAccount to \<account>:\<password>, for example serverAccount=admin:mysecretpassword
The next time you start YaCy the account/password combination will be read, encrypted and then deleted from yacy.conf, so that it will not be available in plain text anywhere anymore.
Then you will be able to log again into YaCy with the account/password you entered in the yacy.conf file, or set another password if you didn't set a combination.
Disk space
How can I limit the size of single files to be downloaded?
The maximum file size can be set under Advanced settings -> Crawler settings. Maximum sizes can be specified for HTTP and FTP. The file size is in bytes.
How many links/words and how much disk space can a YaCy instance manage?
The number of storable links/words is theoretically not limited, but it becomes actually limited following the slowdown of the indexing process with the increase of the links/words number. There are users with more than 10 million Web pages indexed in their YaCy instance. Also, the necessary space for the index of a web page depends on the size and nature of the document. With 10 million web pages indexed, an index size of 20GB is not uncommon.
Can I limit the size of the indexes on my hard-drive?
For the moment not directly. Automatically limiting that size would mean having to delete stored indexes, which is not suitable.
You can set two minimums of free disk space at /Performance_p.html: one for the crawls, and the other for DHT-in. The number for crawls seems to have to be equal or bigger than the number for DHT-in. Note that, with DHT-in disabled, global searching using the peer's UI is disabled. Also proxy/crawling privacy might suffer. You can also just disable “Index Receive” at /ConfigNetwork_p.html, so that your index is only augmented through crawling (over which you have some control). For a very indirect additional limit, you can change the Index Reference Size at /IndexControlRWIs_p.html.
Further info
Where do I find more documentation?
You can see the legacy wiki. Not all information were transfered to this FAQ yet.
You can search the community forum or ask questions there.
For more theoretical concepts behind YaCy, you can see slides for talks of Michael Christen, the main developer, slides for lecture Information Retrieval (partly in German) and a scientific paper about YaCy.
How can I help?
First of all: run YaCy in senior mode. This helps to enrich the global index and to make YaCy more attractive.
If you're advanced user, you can be a big help for newbies in the community forum. Answering the questions and sharing both your knowledge and experiments keeps the community alive.
If you want to add your own code, you are welcome; but please contact the author first and discuss your idea to see how it may fit into the overall architecture.
If you can do Java, you can try to fix some of the issues on github. Every Java developer is warmly welcomed.
The documentation also needs a lot of improvement, and you can help a lot by editing it or adding your own remarks and experience.
And if you find a bug or you see an uncovered use-case, we welcome your bug-report. Please describe the problem precisely (expected and real behavior), try to provide as much information possible to reconstruct the problem and attach the respective log entries.
You can help a lot by simply giving us feedback or telling us about new ideas.
You can also help by telling other people about this software.