Parquet tools installation

Hi @abhinav,

Requesting you to install ‘parquet-tools’ package in CloudxLab. It will help students like us to learn about Parquet file format.

Hi @raviteja,

Yes, I will install it globally

In the meanwhile, please follow this link to configure it locally in your home directory in web console

Hope this helps

Thanks

1 Like

Thanks @abhinav for prompt respose,

I have already tried it installing locally & facing below issue:
Step 1: Cloned Parquet repository & tried to installed locally using maven

git clone https://github.com/Parquet/parquet-mr.git
cd parquet-mr/parquet-tools/
mvn clean package -Plocal

Below is the issue i’m facing:

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.091s
[INFO] Finished at: Mon Dec 25 09:01:12 UTC 2017
[INFO] Final Memory: 11M/234M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project parquet-tools: Could not resolve dependencies for project org.apache.parquet:parquet-tools:jar:1.9.1-SNAPSHOT: Failed to collect dependencies for [org.apache.parquet:parquet-format:jar:2.4.0 (compile), org.apache.parquet:parquet-hadoop:jar:1.9.1-SNAPSHOT (compile), org.apache.hadoop:hadoop-client:jar:2.7.3 (compile), commons-cli:commons-cli:jar:1.3.1 (compile), com.google.guava:guava:jar:20.0 (compile), org.slf4j:slf4j-log4j12:jar:1.7.22 (compile), junit:junit:jar:4.12 (test), org.easymock:easymock:jar:3.4 (test), commons-httpclient:commons-httpclient:jar:3.1 (test)]: Failed to read artifact descriptor for org.apache.parquet:parquet-hadoop:jar:1.9.1-SNAPSHOT: Could not transfer artifact org.apache.parquet:parquet-hadoop:pom:1.9.1-SNAPSHOT from/to jitpack.io (https://jitpack.io): peer not authenticated -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

Please help me to resolve.

Hi @abhinav,

Above issue has been resolved, below are the proper steps needs to be followed for local installation of ‘parquet-tools’

Java::

  1. Downloading stable Parquet release

$wget https://github.com/apache/parquet-mr/archive/apache-parquet-1.8.2.tar.gz

  1. Maven local install

$cd parquet-tools && mvn clean package -Plocal

  1. Test it:

$java -jar parquet-tools-1.8.2.jar schema sample.parquet

Note: Git repository clone is not stable & has few build issues, so downloaded stable release & built from local.

Now i can start working with parquet files.

Python::
Please follow below steps for working with Python instead of Java for parquet files:

  1. Create virtualenv or direct install:

$virtualenv parquet-tools
$source parquet-tools/bin/activate
$pip install parquet

  1. To print metadata of parquet file

$parquet --metadata test.parquet

  1. To print parquet data

$parquet test.parquet

Note that, 'parquet' command directly works only after activating virtualenv.

1 Like

Thanks @raviteja for providing steps for Python :slight_smile:

1 Like