public class SearchEngineExtractor
extends EvalFunc<String>
SearchEngineExtractor takes a url string and extracts the search engine. For example, given
http://www.google.com/search?hl=en&safe=active&rls=GGLG,GGLG:2005-24,GGLG:en&q=purpose+of+life&btnG=Search
then
Google
would be extracted.
From pig latin, usage looks something like
searchEngine = FOREACH row GENERATE
org.apache.pig.piggybank.evaluation.util.apachelogparser.SearchEngineExtractor(referer);
Supported search engines include abacho.com, alice.it, alltheweb.com, altavista.com, aolsearch.aol.com,
as.starware.com, ask.com, blogs.icerocket.com, blogsearch.google.com, blueyonder.co.uk, busca.orange.es,
buscador.lycos.es, buscador.terra.es, buscar.ozu.es, categorico.it, cuil.com, excite.com, excite.it,
fastweb.it, feedster.com, godado.com, godado.it, google.ad, google.ae, google.af, google.ag, google.am,
google.as, google.at, google.az, google.ba, google.be, google.bg, google.bi, google.biz, google.bo,
google.bs, google.bz, google.ca, google.cc, google.cd, google.cg, google.ch, google.ci, google.cl,
google.cn, google.co.at , google.co.bi, google.co.bw, google.co.ci, google.co.ck, google.co.cr,
google.co.gg, google.co.gl, google.co.gy, google.co.hu, google.co.id, google.co.il, google.co.im,
google.co.in, google.co.it, google.co.je, google.co.jp, google.co.ke, google.co.kr, google.co.ls,
google.co.ma, google.co.mu, google.co.mw, google.co.nz, google.co.pn, google.co.th, google.co.tt,
google.co.ug, google.co.uk, google.co.uz, google.co.ve, google.co.vi, google.co.za, google.co.zm,
google.co.zw, google.com, google.com.af, google.com.ag, google.com.ai, google.com.ar, google.com.au,
google.com.az, google.com.bd, google.com.bh, google.com.bi, google.com.bn, google.com.bo, google.com.br,
google.com.bs, google.com.bz, google.com.cn, google.com.co, google.com.cu, google.com.do, google.com.ec,
google.com.eg, google.com.et, google.com.fj, google.com.ge, google.com.gh, google.com.gi, google.com.gl,
google.com.gp, google.com.gr, google.com.gt, google.com.gy, google.com.hk, google.com.hn, google.com.hr,
google.com.jm, google.com.jo, google.com.kg, google.com.kh, google.com.ki, google.com.kz, google.com.lk,
google.com.lv, google.com.ly, google.com.mt, google.com.mu, google.com.mw, google.com.mx, google.com.my,
google.com.na, google.com.nf, google.com.ng, google.com.ni, google.com.np, google.com.nr, google.com.om,
google.com.pa, google.com.pe, google.com.ph, google.com.pk, google.com.pl, google.com.pr, google.com.pt,
google.com.py, google.com.qa, google.com.ru, google.com.sa, google.com.sb, google.com.sc, google.com.sg,
google.com.sv, google.com.tj, google.com.tr, google.com.tt, google.com.tw, google.com.uy, google.com.uz,
google.com.ve, google.com.vi, google.com.vn, google.com.ws, google.cz, google.de, google.dj, google.dk ,
google.dm , google.ec, google.ee, google.es, google.fi, google.fm, google.fr, google.gd, google.ge,
google.gf, google.gg, google.gl, google.gm, google.gp, google.gr, google.gy, google.hk, google.hn,
google.hr, google.ht, google.hu, google.ie, google.im, google.in, google.info, google.is, google.it,
google.je, google.jo, google.jobs, google.jp, google.kg, google.ki, google.kz, google.la, google.li,
google.lk, google.lt, google.lu, google.lv, google.ma, google.md, google.mn, google.mobi, google.ms,
google.mu, google.mv, google.mw, google.net, google.nf, google.nl, google.no, google.nr, google.nu,
google.off.ai, google.ph, google.pk, google.pl, google.pn, google.pr, google.pt, google.ro, google.ru,
google.rw, google.sc, google.se, google.sg, google.sh, google.si, google.sk, google.sm, google.sn,
google.sr, google.st, google.tk, google.tm, google.to, google.tp, google.tt, google.tv, google.tw,
google.ug, google.us, google.uz, google.vg, google.vn, google.vu, google.ws, gps.virgin.net, hotbot.com,
ilmotore.com, ithaki.net, kataweb.it, libero.it, lycos.it, mamma.com, megasearching.net, mirago.co.uk,
netscape.com, search.aol.co.uk, search.arabia.msn.com, search.bbc.co.uk, search.conduit.com,
search.icq.com, search.live.com, search.lycos.co.uk, search.lycos.com, search.msn.co.uk, search.msn.com,
search.myway.com, search.mywebsearch.com, search.ntlworld.com, search.orange.co.uk, search.prodigy.msn.com,
search.sweetim.com, search.virginmedia.com, search.yahoo.co.jp, search.yahoo.com, search.yahoo.jp,
simpatico.ws, soso.com, suche.fireball.de, suche.t-online.de, suche.web.de, technorati.com, tesco.net,
thespider.it, tiscali.co.uk, uk.altavista.com, uk.ask.com, uk.search.yahoo.com
Thanks to Spiros Denaxas for his URI::ParseSearchString, which is the basis for the lookups.