Solr Schema of YaCy

YaCy stores the information about crawled webpages and the links between them in solr databases: collection1 for page indexing and webgraph for links structure indexing.

Schema of collection1 is defined in defaults/solr.collection.schema and source/net/yacy/search/schema/CollectionSchema.java

and can be modified in Index Administration > Solr Schema Editor.

The names of fields are usually suffixed _dt if containing date, _dts for multiple dates, _i for integer number, _l for long number, _b for boolean (true/false), _s for string, _txt for text, and _sxt for ???, _p for geo-coordinates.

The fields are:

mandatory

id (string) - primary key of document, the URL hash mandatory field

sku (string) - url of document - a 'sku' is a stock-keeping unit, a unique identifier and a default field in unmodified solr.

last_modified (date) - last-modified from http header - date document was last modified, needed for media search and /date operator

load_date_dt (date) - time when resource was loaded

content_type (string) - mime-type of document

title (text_general) - content of title tag

host_id_s (string) - id of the host, a 6-byte hash that is part of the document id - String hosthash();

host_s (string) - host of the url

size_i (integer) - the size of the raw source - int size();

failreason_s (string) - fail reason if a page was not loaded. if the page was loaded then this field is empty

failtype_s (string) - fail type if a page was not loaded. This field is either empty, 'excl' or 'fail'

httpstatus_i (num_integer) - html status return code (i.e. "200" for ok), -1 if not loaded"

url_file_ext_s (string) - the file name extension

host_organization_s (string) - either the second level domain or, if a ccSLD is used, the third level domain - needed to search in the url

inboundlinks_urlstub_sxt (string) - internal links, the url only without the protocol - needed for IndexBrowser

inboundlinks_protocol_sxt (string) - internal links, only the protocol - for correct assembly of inboundlinks inboundlinks_protocol_sxt + inboundlinks_urlstub_sxt is needed

outboundlinks_protocol_sxt (string) - external links, only the protocol - for correct assembly of outboundlinks outboundlinks_protocol_sxt + outboundlinks_urlstub_sxt is needed

outboundlinks_urlstub_sxt (string) - external links, the url only without the protocol - needed to enhance the crawler

images_urlstub_sxt (string) - all image links without the protocol and '://'

images_protocol_sxt (string) - all image link protocols - for correct assembly of image url images_protocol_sxt + images_urlstub_sxt is needed

optional but recommended, part of index distribution

fresh_date_dt (date) - date until resource shall be considered as fresh

referrer_id_s (string) - id of the referrer to this document, discovered during crawling - byte[] referrerHash();

publisher_t (text_general) - the name of the publisher of the document - String dc_publisher();

language_s (string) - the language used in the document - byte[] language();

audiolinkscount_i (num_integer) - number of links to audio resources - int laudio();

videolinkscount_i (num_integer) - number of links to video resources - int lvideo();

applinkscount_i (num_integer) - number of links to application resources - int lapp();

optional but recommended

title_exact_signature_l (num_long) - the 64 bit hash of the org.apache.solr.update.processor.Lookup3Signature of title, used to compute title_unique_b

title_unique_b (bool) - flag shows if title is unique within all indexable documents of the same host with status code 200; if yes and another document appears with same title, the unique-flag is set to false

exact_signature_copycount_i (num_integer) - counter for the number of documents which are not unique (== count of not-unique-flagged documents + 1)

fuzzy_signature_text_t (text_general) - intermediate data produced in EnhancedTextProfileSignature: a list of word frequencies

fuzzy_signature_copycount_i (num_integer) - counter for the number of documents which are not unique (== count of not-unique-flagged documents + 1)

process_sxt (string) - needed (post-)processing steps on this metadata set

dates_in_content_dts (date) - if date expressions can be found in the content, these dates are listed here as date objects in order of the appearances

dates_in_content_count_i (num_integer) - the number of entries in dates_in_content_sxt

startDates_dts (date) - content of itemprop attributes with content='startDate'

endDates_dts (date) - content of itemprop attributes with content='endDate'

references_i (num_integer) - number of unique http references, should be equal to references_internal_i + references_external_i

references_internal_i (num_integer) - number of unique http references from same host to referenced url

references_external_i (num_integer) - number of unique http references from external hosts

references_exthosts_i (num_integer) - number of external hosts which provide http references

crawldepth_i (num_integer) - crawl depth of web page according to the number of steps that the crawler did to get to this document; if the crawl was started at a root document, then this is equal to the clickdepth

harvestkey_s (string) - key from a harvest process (i.e. the crawl profile hash key) which is needed for near-realtime postprocessing. This shall be deleted as soon as postprocessing has been terminated.

http_unique_b (bool) - unique-field which is true when an url appears the first time. If the same url which was http then appears as https (or vice versa) then the field is false

www_unique_b (bool) - unique-field which is true when an url appears the first time. If the same url within the subdomain www then appears without that subdomain (or vice versa) then the field is false

exact_signature_l (num_long) - the 64 bit hash of the org.apache.solr.update.processor.Lookup3Signature of text_t

exact_signature_unique_b (bool) - flag shows if exact_signature_l is unique at the time of document creation, used for double-check during search

fuzzy_signature_l (num_long) - 64 bit of the Lookup3Signature from EnhancedTextProfileSignature of text_t

fuzzy_signature_unique_b (bool) - flag shows if fuzzy_signature_l is unique at the time of document creation, used for double-check during search

coordinate_p (location) - point in degrees of latitude,longitude as declared in WSG84

coordinate_p_0_coordinate (coordinate) - automatically created subfield, (latitude)

coordinate_p_1_coordinate (coordinate) - automatically created subfield, (longitude)

ip_s (string) - ip of host of url (after DNS lookup)

author (text_general) - content of author-tag

author_sxt (string) - content of author-tag as copy-field from author. This is used for facet generation

description_txt (text_general) - content of description-tag(s)

description_exact_signature_l (num_long) - the 64 bit hash of the org.apache.solr.update.processor.Lookup3Signature of description, used to compute description_unique_b

description_unique_b (bool) - flag shows if description is unique within all indexable documents of the same host with status code 200; if yes and another document appears with same description, the unique-flag is set to false

keywords (text_general) - content of keywords tag; words are separated by comma, semicolon or space

charset_s (string) - character encoding

wordcount_i (num_integer) - number of words in visible area

linkscount_i (num_integer) - number of all outgoing links; including linksnofollowcount_i

linksnofollowcount_i (num_integer) - number of all outgoing inks with nofollow tag

inboundlinkscount_i (num_integer) - number of outgoing inbound (to same domain) links; including inboundlinksnofollowcount_i

inboundlinksnofollowcount_i (num_integer) - number of outgoing inbound (to same domain) links with nofollow tag

outboundlinkscount_i (num_integer) - number of outgoing outbound (to other domain) links, including outboundlinksnofollowcount_i

outboundlinksnofollowcount_i (num_integer) - number of outgoing outbound (to other domain) links with nofollow tag

imagescount_i (num_integer) - number of images

responsetime_i (num_integer) - response time of target server in milliseconds

text_t (text_general) - all visible text

synonyms_sxt (string) - additional synonyms to the words in the text

h1_txt (text_general) - h1 header

h2_txt (text_general) - h2 header

h3_txt (text_general) - h3 header

h4_txt (text_general) -h4 header

h5_txt (text_general) - h5 header

h6_txt (text_general) - h6 header

unused, delete candidates

md5_s (string) - the md5 of the raw source - String md5() - Deprecated

httpstatus_redirect_s (string) - redirect url if the error code is 299 < httpstatus_i < 310 - TODO: delete candidate, not used so far (2014-12-26)

optional values, not part of standard YaCy handling

(but useful for external applications)

collection_sxt (string) - tags that are attached to crawls/index generation to separate the search result into user-defined subsets called collections

csscount_i (num_integer) - number of entries in css_tag_txt and css_url_txt

css_tag_sxt (string) - full css tag with normalized url

css_url_sxt (string) - normalized urls within a css tag

scripts_sxt (string) - normalized urls within a scripts tag

scriptscount_i (num_integer) - number of entries in scripts_sxt

robots_i (num_integer) - content of <meta name="robots" content=\#content\#\> - tag and the "X-Robots-Tag" HTTP property

is encoded as binary value into an integer:

bit	value
bit 0	"all" contained in html header meta
bit 1	"index" contained in html header meta
bit 2	"follow" contained in html header meta
bit 3	"noindex" contained in html header meta
bit 4	"nofollow" contained in html header meta
bit 5	"noarchive" contained in html header meta
bit 8	"all" contained in http header X-Robots-Tag
bit 9	"noindex" contained in http header X-Robots-Tag
bit 10	"nofollow" contained in http header X-Robots-Tag
bit 11	"noarchive" contained in http header X-Robots-Tag
bit 12	"nosnippet" contained in http header X-Robots-Tag
bit 13	"noodp" contained in http header X-Robots-Tag
bit 14	"notranslate" contained in http header X-Robots-Tag
bit 15	"noimageindex" contained in http header X-Robots-Tag
bit 16	"unavailable_after" contained in http header X-Robots-Tag

metagenerator_t (text_general) - content of <meta name=\"generator\" content=#content#> tag

inboundlinks_anchortext_txt (text_general) - internal links, the visible anchor text

outboundlinks_anchortext_txt (text_general) - external links, the visible anchor text

icons_urlstub_sxt (string) - all icon links without the protocol and ://

All icon links protocols : split from icons_urlstub to provide some compression, as http protocol is implied as default and not stored

icons_protocol_sxt (string) - all icon links protocols

icons_rel_sxt (string) - all icon links relationships space separated (e.g.. 'icon apple-touch-icon')

icons_sizes_sxt (string) - all icon sizes space separated (e.g. '16x16 32x32')

images_text_t (text_general) - all text/words appearing in image alt texts or the tokenized url

images_alt_sxt (string) - all image link alt tag - no need to index this; don't turn it into a txt field; use images_text_t instead

images_height_val (num_integer) - size of images:height

images_width_val (num_integer) - size of images:width

images_pixel_val (num_integer) - size of images as number of pixels (easier for a search restriction than width and height)

images_withalt_i (num_integer) - number of image links with alt tag

htags_i (num_integer) - binary pattern for the existance of h1..h6 headlines

canonical_s (string) - url inside the canonical link element

canonical_equal_sku_b (bool) - flag shows if the url in canonical_t is equal to sku

refresh_s (string) - link from the url property inside the refresh link element

li_txt (text_general) - all texts in <li> tags

licount_i (num_integer) - number of <li> tags

dt_txt (text_general) - all texts in <dt> tags

dtcount_i (num_integer) - number of <dt> tags

dd_txt (text_general) - all texts in <dd> tags

ddcount_i (num_integer) - number of <dd> tags

article_txt (text_general) - all texts in <article> tags

articlecount_i (num_integer) - number of <article> tags

bold_txt (text_general) - all texts inside of <b> or <strong> tags. no doubles. listed in the order of number of occurrences in decreasing order

boldcount_i (num_integer) - total number of occurrences of <b> or <strong>

italic_txt (text_general) - all texts inside of <i> tags. no doubles. listed in the order of number of occurrences in decreasing order

italiccount_i (num_integer) - total number of occurrences of <i>

underline_txt (text_general) - all texts inside of <u> tags. no doubles. listed in the order of number of occurrences in decreasing order

underlinecount_i (num_integer) - total number of occurrences of <u>

flash_b (bool) - flag that shows if a adobe flash .swf file is linked

frames_sxt (string) - list of all links to frames

framesscount_i (num_integer) - number of frames_txt

iframes_sxt (string) - list of all links to iframes

iframesscount_i (num_integer) - number of iframes_txt

hreflang_url_sxt (string) - url of the hreflang link tag, see https://developers.google.com/search/docs/specialty/international/localized-versions

hreflang_cc_sxt (string) - country code of the hreflang link tag, see https://developers.google.com/search/docs/specialty/international/localized-versions

navigation_url_sxt (string) - page navigation url, see https://webmasters.googleblog.com/2011/09/pagination-with-relnext-and-relprev.html

navigation_type_sxt (string) - page navigation rel property value, can contain one of {top,up,next,prev,first,last}

publisher_url_s (string) - publisher url as defined in https://web.archive.org/web/20140530224715/https://support.google.com/plus/answer/1713826?hl=de

url_protocol_s (string) - the protocol of the url

url_file_name_s (string) - the file name (which is the string after the last '/' and before the query part from '?' on) without the file extension

url_file_name_tokens_t (text_general) - tokens generated from url_file_name_s which can be used for better matching and result boosting

url_paths_count_i (num_integer) - number of all path elements in the url hpath (see: https://www.ietf.org/rfc/rfc1738.txt) without the file name

url_paths_sxt (string) - all path elements in the url hpath (see: https://www.ietf.org/rfc/rfc1738.txt) without the file name

url_parameter_i (num_integer) - number of key-value pairs in search part of the url

url_parameter_key_sxt (string) - the keys from key-value pairs in the search part of the url

url_parameter_value_sxt (string) - the values from key-value pairs in the search part of the url

url_chars_i (num_integer) - number of all characters in the url == length of sku field

host_dnc_s (string) - the Domain Class Name, either the TLD or a combination of ccSLD+TLD if a ccSLD is used.

host_organizationdnc_s (string) - the organization and dnc concatenated with '.'

host_subdomain_s (string) - the remaining part of the host without organizationdnc

host_extent_i (num_integer) - number of documents from the same host; can be used to measure references_internal_i for likelihood computation

title_count_i (num_integer) - number of titles (counting the 'title' field) in the document

title_chars_val (num_integer) - number of characters for each title

title_words_val (num_integer) - number of words in each title

description_count_i (num_integer) - number of descriptions in the document. Its not counting the 'description' field since there is only one. But it counts the number of descriptions that appear in the document (if any)

description_chars_val (num_integer) - number of characters for each description

description_words_val (num_integer) - number of words in each description

h1_i (num_integer) - number of h1 header lines

h2_i (num_integer) - number of h2 header lines

h3_i (num_integer) - number of h3 header lines

h4_i (num_integer) - number of h4 header lines

h5_i (num_integer) - number of h5 header lines

h6_i (num_integer) - number of h6 header lines

schema_org_breadcrumb_i (num_integer) - number of itemprop="breadcrumb" appearances in div tags

opengraph_title_t (text_general) - Open Graph Metadata from og:title metadata field, see https://ogp.me/

opengraph_type_s (text_general) - Open Graph Metadata from og:type metadata field, see https://ogp.me/

opengraph_url_s (text_general) - Open Graph Metadata from og:url metadata field, see https://ogp.me/

opengraph_image_s (text_general) - Open Graph Metadata from og:image metadata field, see https://ogp.me/

link structure for ranking

cr_host_count_i (num_integer) - the number of documents within a single host

cr_host_chance_d (num_double) - the chance to click on this page when randomly clicking on links within on one host

cr_host_norm_i (num_integer) - normalization of chance: 0 for lower halve of cr_host_count_i urls, 1 for 1/2 of the remaining and so on. the maximum number is 10

custom rating

values to influence the ranking in combination with boost rules

rating_i (num_integer) - custom rating; to be set with external rating information

special values

can only be used if '_val' type is defined in schema file; this is not standard

bold_val (num_integer) - number of occurrences of texts in bold_txt

italic_val (num_integer) - number of occurrences of texts in italic_txt

underline_val (num_integer) - number of occurrences of texts in underline_txt

ext_cms_txt (text_general) - names of cms attributes; if several are recognized then they are listen in decreasing order of number of matching criterias

ext_cms_val (num_integer) - number of attributes that count for a specific cms in ext_cms_txt

ext_ads_txt (text_general) - names of ad-servers/ad-services

ext_ads_val (num_integer) - number of attributes counts in ext_ads_txt

ext_community_txt (text_general) - names of recognized community functions

ext_community_val (num_integer) - number of attribute counts in attr_community

ext_maps_txt - (text_general) names of map services

ext_maps_val (num_integer) - number of attribute counts in ext_maps_txt

ext_tracker_txt (text_general) - names of tracker server

ext_tracker_val (num_integer) - number of attribute counts in ext_tracker_txt

ext_title_txt (text_general) - names matching title expressions

ext_title_val (num_integer) - number of matching title expressions

vocabularies_sxt (string) - collection of all vocabulary names that have a matcher in the document - use this to boost with vocabularies