Solr Schema of YaCy
YaCy stores the information about crawled webpages and the links between them in solr databases:
collection1
for page indexing and webgraph
for links structure indexing.
Schema of collection1
is defined in defaults/solr.collection.schema
and source/net/yacy/search/schema/CollectionSchema.java
and can be modified in Index Administration > Solr Schema Editor.
The names of fields are usually suffixed _dt
if containing date, _dts
for multiple
dates, _i
for integer number, _l
for long number, _b
for boolean
(true/false), _s
for string, _txt
for text, and _sxt
for ???, _p
for
geo-coordinates.
The fields are:
mandatory
id
(string) - primary key of document, the URL hash mandatory field
sku
(string) - url of document - a 'sku' is a stock-keeping unit, a unique identifier and a default field in unmodified solr.
last_modified
(date) - last-modified from http header - date document was last modified, needed for media search and /date operator
load_date_dt
(date) - time when resource was loaded
content_type
(string) - mime-type of document
title
(text_general) - content of title tag
host_id_s
(string) - id of the host, a 6-byte hash that is part of the document id - String hosthash();
host_s
(string) - host of the url
size_i
(integer) - the size of the raw source - int size();
failreason_s
(string) - fail reason if a page was not loaded. if the page was loaded then this field is empty
failtype_s
(string) - fail type if a page was not loaded. This field is either empty, 'excl' or 'fail'
httpstatus_i
(num_integer) - html status return code (i.e. "200" for ok), -1 if not loaded"
url_file_ext_s
(string) - the file name extension
host_organization_s
(string) - either the second level domain or, if a ccSLD is used, the third level domain - needed to search in the url
inboundlinks_urlstub_sxt
(string) - internal links, the url only without the protocol - needed for IndexBrowser
inboundlinks_protocol_sxt
(string) - internal links, only the protocol - for correct assembly of inboundlinks
inboundlinks_protocol_sxt + inboundlinks_urlstub_sxt is needed
outboundlinks_protocol_sxt
(string) - external links, only the protocol - for correct assembly of outboundlinks
outboundlinks_protocol_sxt + outboundlinks_urlstub_sxt is needed
outboundlinks_urlstub_sxt
(string) - external links, the url only without the protocol - needed to enhance the crawler
images_urlstub_sxt
(string) - all image links without the protocol and '://'
images_protocol_sxt
(string) - all image link protocols - for correct assembly of image url images_protocol_sxt + images_urlstub_sxt is needed
optional but recommended, part of index distribution
fresh_date_dt
(date) - date until resource shall be considered as fresh
referrer_id_s
(string) - id of the referrer to this document, discovered during crawling - byte[] referrerHash();
publisher_t
(text_general) - the name of the publisher of the document - String dc_publisher();
language_s
(string) - the language used in the document - byte[] language();
audiolinkscount_i
(num_integer) - number of links to audio resources - int laudio();
videolinkscount_i
(num_integer) - number of links to video resources - int lvideo();
applinkscount_i
(num_integer) - number of links to application resources - int lapp();
optional but recommended
title_exact_signature_l
(num_long) - the 64 bit hash of the org.apache.solr.update.processor.Lookup3Signature
of title, used to compute title_unique_b
title_unique_b
(bool) - flag shows if title is unique within all indexable documents of the same host with status code 200; if yes and another document appears with same title, the unique-flag is set to false
exact_signature_copycount_i
(num_integer) - counter for the number of documents which are not unique (== count of not-unique-flagged documents + 1)
fuzzy_signature_text_t
(text_general) - intermediate data produced in EnhancedTextProfileSignature
: a list of word frequencies
fuzzy_signature_copycount_i
(num_integer) - counter for the number of documents which are not unique (== count of not-unique-flagged documents + 1)
process_sxt
(string) - needed (post-)processing steps on this metadata set
dates_in_content_dts
(date) - if date expressions can be found in the content, these dates are listed here as date objects in order of the appearances
dates_in_content_count_i
(num_integer) - the number of entries in dates_in_content_sxt
startDates_dts
(date) - content of itemprop attributes with content='startDate'
endDates_dts
(date) - content of itemprop attributes with content='endDate'
references_i
(num_integer) - number of unique http references, should be equal to references_internal_i + references_external_i
references_internal_i
(num_integer) - number of unique http references from same host to referenced url
references_external_i
(num_integer) - number of unique http references from external hosts
references_exthosts_i
(num_integer) - number of external hosts which provide http references
crawldepth_i
(num_integer) - crawl depth of web page according to the number of steps that the crawler did to get to this document; if the crawl was started at a root document, then this is equal to the clickdepth
harvestkey_s
(string) - key from a harvest process (i.e. the crawl profile hash key) which is needed for near-realtime postprocessing. This shall be deleted as soon as postprocessing has been terminated.
http_unique_b
(bool) - unique-field which is true when an url appears the first time. If the same url which was http then appears as https (or vice versa) then the field is false
www_unique_b
(bool) - unique-field which is true when an url appears the first time. If the same url within the subdomain www then appears without that subdomain (or vice versa) then the field is false
exact_signature_l
(num_long) - the 64 bit hash of the org.apache.solr.update.processor.Lookup3Signature
of text_t
exact_signature_unique_b
(bool) - flag shows if exact_signature_l is unique at the time of document creation, used for double-check during search
fuzzy_signature_l
(num_long) - 64 bit of the Lookup3Signature
from EnhancedTextProfileSignature
of text_t
fuzzy_signature_unique_b
(bool) - flag shows if fuzzy_signature_l
is unique at the time of document creation, used for double-check during search
coordinate_p
(location) - point in degrees of latitude,longitude as declared in WSG84
coordinate_p_0_coordinate
(coordinate) - automatically created subfield, (latitude)
coordinate_p_1_coordinate
(coordinate) - automatically created subfield, (longitude)
ip_s
(string) - ip of host of url (after DNS lookup)
author
(text_general) - content of author-tag
author_sxt
(string) - content of author-tag as copy-field from author. This is used for facet generation
description_txt
(text_general) - content of description-tag(s)
description_exact_signature_l
(num_long) - the 64 bit hash of the org.apache.solr.update.processor.Lookup3Signature
of description, used to compute description_unique_b
description_unique_b
(bool) - flag shows if description is unique within all indexable documents of the same host with status code 200; if yes and another document appears with same description, the unique-flag is set to false
keywords
(text_general) - content of keywords tag; words are separated by comma, semicolon or space
charset_s
(string) - character encoding
wordcount_i
(num_integer) - number of words in visible area
linkscount_i
(num_integer) - number of all outgoing links; including linksnofollowcount_i
linksnofollowcount_i
(num_integer) - number of all outgoing inks with nofollow tag
inboundlinkscount_i
(num_integer) - number of outgoing inbound (to same domain) links; including inboundlinksnofollowcount_i
inboundlinksnofollowcount_i
(num_integer) - number of outgoing inbound (to same domain) links with nofollow
tag
outboundlinkscount_i
(num_integer) - number of outgoing outbound (to other domain) links, including outboundlinksnofollowcount_i
outboundlinksnofollowcount_i
(num_integer) - number of outgoing outbound (to other domain) links with nofollow tag
imagescount_i
(num_integer) - number of images
responsetime_i
(num_integer) - response time of target server in milliseconds
text_t
(text_general) - all visible text
synonyms_sxt
(string) - additional synonyms to the words in the text
h1_txt
(text_general) - h1 header
h2_txt
(text_general) - h2 header
h3_txt
(text_general) - h3 header
h4_txt
(text_general) -h4 header
h5_txt
(text_general) - h5 header
h6_txt
(text_general) - h6 header
unused, delete candidates
md5_s
(string) - the md5 of the raw source - String md5() - Deprecated
httpstatus_redirect_s
(string) - redirect url if the error code is 299 < httpstatus_i < 310
- TODO: delete candidate, not used so far (2014-12-26)
optional values, not part of standard YaCy handling
(but useful for external applications)
collection_sxt
(string) - tags that are attached to crawls/index generation to separate the search result into user-defined subsets called
collections
csscount_i
(num_integer) - number of entries in css_tag_txt and css_url_txt
css_tag_sxt
(string) - full css tag with normalized url
css_url_sxt
(string) - normalized urls within a css tag
scripts_sxt
(string) - normalized urls within a scripts tag
scriptscount_i
(num_integer) - number of entries in scripts_sxt
robots_i
(num_integer) - content of <meta name="robots"
content=\#content\#\>
- tag and the "X-Robots-Tag" HTTP property
is encoded as binary value into an integer:
bit | value |
---|---|
bit 0 | "all" contained in html header meta |
bit 1 | "index" contained in html header meta |
bit 2 | "follow" contained in html header meta |
bit 3 | "noindex" contained in html header meta |
bit 4 | "nofollow" contained in html header meta |
bit 5 | "noarchive" contained in html header meta |
bit 8 | "all" contained in http header X-Robots-Tag |
bit 9 | "noindex" contained in http header X-Robots-Tag |
bit 10 | "nofollow" contained in http header X-Robots-Tag |
bit 11 | "noarchive" contained in http header X-Robots-Tag |
bit 12 | "nosnippet" contained in http header X-Robots-Tag |
bit 13 | "noodp" contained in http header X-Robots-Tag |
bit 14 | "notranslate" contained in http header X-Robots-Tag |
bit 15 | "noimageindex" contained in http header X-Robots-Tag |
bit 16 | "unavailable_after" contained in http header X-Robots-Tag |
metagenerator_t
(text_general) - content of <meta name=\"generator\" content=#content#>
tag
inboundlinks_anchortext_txt
(text_general) - internal links, the visible anchor text
outboundlinks_anchortext_txt
(text_general) - external links, the visible anchor text
icons_urlstub_sxt
(string) - all icon links without the protocol and ://
All icon links protocols : split from icons_urlstub
to provide some compression, as http protocol is implied as default and not
stored
icons_protocol_sxt
(string) - all icon links protocols
icons_rel_sxt
(string) - all icon links relationships space separated (e.g.. 'icon apple-touch-icon')
icons_sizes_sxt
(string) - all icon sizes space separated (e.g. '16x16 32x32')
images_text_t
(text_general) - all text/words appearing in image alt texts or the tokenized url
images_alt_sxt
(string) - all image link alt tag - no need to index this; don't turn it into a txt field; use images_text_t
instead
images_height_val
(num_integer) - size of images:height
images_width_val
(num_integer) - size of images:width
images_pixel_val
(num_integer) - size of images as number of pixels (easier for a search restriction than width and height)
images_withalt_i
(num_integer) - number of image links with alt tag
htags_i
(num_integer) - binary pattern for the existance of h1..h6 headlines
canonical_s
(string) - url inside the canonical link element
canonical_equal_sku_b
(bool) - flag shows if the url in canonical_t
is equal to sku
refresh_s
(string) - link from the url property inside the refresh link element
li_txt
(text_general) - all texts in <li>
tags
licount_i
(num_integer) - number of <li>
tags
dt_txt
(text_general) - all texts in <dt>
tags
dtcount_i
(num_integer) - number of <dt>
tags
dd_txt
(text_general) - all texts in <dd>
tags
ddcount_i
(num_integer) - number of <dd>
tags
article_txt
(text_general) - all texts in <article>
tags
articlecount_i
(num_integer) - number of <article>
tags
bold_txt
(text_general) - all texts inside of
<b>
or <strong>
tags. no doubles. listed in the order of number of occurrences in decreasing order
boldcount_i
(num_integer) - total number of occurrences of
<b>
or <strong>
italic_txt
(text_general) - all texts inside of
<i>
tags. no doubles. listed in the order of number of occurrences in decreasing order
italiccount_i
(num_integer) - total number of occurrences of <i>
underline_txt
(text_general) - all texts inside of
<u>
tags. no doubles. listed in the order of number of occurrences in decreasing order
underlinecount_i
(num_integer) - total number of occurrences of
<u>
flash_b
(bool) - flag that shows if a adobe flash .swf file is linked
frames_sxt
(string) - list of all links to frames
framesscount_i
(num_integer) - number of frames_txt
iframes_sxt
(string) - list of all links to iframes
iframesscount_i
(num_integer) - number of iframes_txt
hreflang_url_sxt
(string) - url of the hreflang link tag, see https://developers.google.com/search/docs/specialty/international/localized-versions
hreflang_cc_sxt
(string) - country code of the hreflang link tag, see https://developers.google.com/search/docs/specialty/international/localized-versions
navigation_url_sxt
(string) - page navigation url, see https://webmasters.googleblog.com/2011/09/pagination-with-relnext-and-relprev.html
navigation_type_sxt
(string) - page navigation rel property value, can contain one of {top,up,next,prev,first,last}
publisher_url_s
(string) - publisher url as defined in https://web.archive.org/web/20140530224715/https://support.google.com/plus/answer/1713826?hl=de
url_protocol_s
(string) - the protocol of the url
url_file_name_s
(string) - the file name (which is the string after the last '/' and before the query part from '?' on) without the file extension
url_file_name_tokens_t
(text_general) - tokens generated from url_file_name_s
which can be used for better matching and result boosting
url_paths_count_i
(num_integer) - number of all path elements in the url hpath (see: https://www.ietf.org/rfc/rfc1738.txt) without the file name
url_paths_sxt
(string) - all path elements in the url hpath (see: https://www.ietf.org/rfc/rfc1738.txt) without the file name
url_parameter_i
(num_integer) - number of key-value pairs in search part of the url
url_parameter_key_sxt
(string) - the keys from key-value pairs in the search part of the url
url_parameter_value_sxt
(string) - the values from key-value pairs in the search part of the url
url_chars_i
(num_integer) - number of all characters in the url == length of sku field
host_dnc_s
(string) - the Domain Class Name, either the TLD or a combination of ccSLD+TLD if a ccSLD is used.
host_organizationdnc_s
(string) - the organization and dnc concatenated with '.'
host_subdomain_s
(string) - the remaining part of the host without organizationdnc
host_extent_i
(num_integer) - number of documents from the same host; can be used to measure references_internal_i
for likelihood computation
title_count_i
(num_integer) - number of titles (counting the 'title' field) in the document
title_chars_val
(num_integer) - number of characters for each title
title_words_val
(num_integer) - number of words in each title
description_count_i
(num_integer) - number of descriptions in the document. Its not counting the 'description' field since there is only one. But it counts the number of descriptions that appear in the document (if any)
description_chars_val
(num_integer) - number of characters for each description
description_words_val
(num_integer) - number of words in each description
h1_i
(num_integer) - number of h1 header lines
h2_i
(num_integer) - number of h2 header lines
h3_i
(num_integer) - number of h3 header lines
h4_i
(num_integer) - number of h4 header lines
h5_i
(num_integer) - number of h5 header lines
h6_i
(num_integer) - number of h6 header lines
schema_org_breadcrumb_i
(num_integer) - number of itemprop="breadcrumb"
appearances in div tags
opengraph_title_t
(text_general) - Open Graph Metadata from og:title
metadata field, see https://ogp.me/
opengraph_type_s
(text_general) - Open Graph Metadata from og:type
metadata field, see https://ogp.me/
opengraph_url_s
(text_general) - Open Graph Metadata from og:url
metadata field, see https://ogp.me/
opengraph_image_s
(text_general) - Open Graph Metadata from og:image
metadata field, see https://ogp.me/
link structure for ranking
cr_host_count_i
(num_integer) - the number of documents within a single host
cr_host_chance_d
(num_double) - the chance to click on this page when randomly clicking on links within on one host
cr_host_norm_i
(num_integer) - normalization of chance: 0 for lower halve of cr_host_count_i urls
, 1 for 1/2 of the remaining and so on. the maximum number is 10
custom rating
values to influence the ranking in combination with boost rules
rating_i
(num_integer) - custom rating; to be set with external rating information
special values
can only be used if '_val' type is defined in schema file; this is not standard
bold_val
(num_integer) - number of occurrences of texts in bold_txt
italic_val
(num_integer) - number of occurrences of texts in italic_txt
underline_val
(num_integer) - number of occurrences of texts in underline_txt
ext_cms_txt
(text_general) - names of cms attributes; if several are recognized then they are listen in decreasing order of number of matching criterias
ext_cms_val
(num_integer) - number of attributes that count for a specific cms in ext_cms_txt
ext_ads_txt
(text_general) - names of ad-servers/ad-services
ext_ads_val
(num_integer) - number of attributes counts in ext_ads_txt
ext_community_txt
(text_general) - names of recognized community functions
ext_community_val
(num_integer) - number of attribute counts in attr_community
ext_maps_txt
- (text_general) names of map services
ext_maps_val
(num_integer) - number of attribute counts in ext_maps_txt
ext_tracker_txt
(text_general) - names of tracker server
ext_tracker_val
(num_integer) - number of attribute counts in ext_tracker_txt
ext_title_txt
(text_general) - names matching title expressions
ext_title_val
(num_integer) - number of matching title expressions
vocabularies_sxt
(string) - collection of all vocabulary names that have a matcher in the document - use this to boost with vocabularies