apache pig - How to read the .doc or .docx file -


how read the.doc file using apache pig latin programming using map reduce


a = load './pig/test.docx';

b = foreach generate flatten(textloader((chararray)$0)) word;

c = group b word;

d = foreach c generate count(b), group;

store d './wordcountone';


you need create custom load function pig script. first start simple .doc or .docx parsing java, example available here: how read doc or docx file in java? i'm sure find more on google.

once know how data word document need implement pig function.

example of custom pig loader (step step) can found here


Comments

Popular posts from this blog

java - activate/deactivate sonar maven plugin by profile? -

python - TypeError: can only concatenate tuple (not "float") to tuple -

java - What is the difference between String. and String.this. ? -