Further to previous emails (6th Nov 2015), we are all set to kick off the web harvest in
the new year.
We’ll be starting the crawl on the 11th Jan 2016, and expect to be crawling for
approximately 3 to 4 weeks.
The crawlers will be using the user agent string ““NLNZ_IAHarvester2016” so if you do need
to set any specific rules for our crawlers this would be the identifier to use.
The robots.txt and Robots META tag exclusions on crawled sites will be obeyed with some
We’ll strictly obey all rules that relate to the user agent (apart from slash pages, which
will be harvested regardless).
Facebook and some other curated social media sites will be harvested regardless of their
The Crawl Notification (Notice to Webmasters) page is located at on the Library’s web site
If you have any questions or concerns about the harvest, please drop me a line. I’ll watch
for email at various points over the Christmas break, and back in the office on the 4th
Jan to address any questions/concerns.
Jay Gattuso | Digital Preservation Analyst | Preservation, Research and Consultancy
National Library of New Zealand | Te Puna Mātauranga o Aotearoa
PO Box 1467 Wellington 6140 New Zealand | +64 (0)4 474 3064