본문 바로가기

[혁신성장 청년인재] 인공지능을 활용한 보안전문가 양성과정

EMBER 데이터셋 불러오기 | Colab 에서 내 드라이브 마운트해서 데이터 저장

플젝에서 EMBER 데이터셋 쓰려고 했는데, 못쓰게되었다. 사용법만 간단하게 정리!

Colab에서 내 드라이브 마운트해서 진행했고, EMBER데이터 커서 다운받는데 오래걸림. 

 

# 드라이브에 마운트해서 내 드라이브에 ember dataset 다운받기 
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive

# 내 드라이브로 이동
cd /content/gdrive/My\ Drive
/content/gdrive/My Drive

# 폴더 하나 만들기
mkdir YAK_project

# 생성한 폴더로 이동
cd YAK_project/
/content/gdrive/My Drive/YAK_project

# ember 설치
!pip install git+https://github.com/endgameinc/ember.git
Collecting git+https://github.com/endgameinc/ember.git
 Cloning https://github.com/endgameinc/ember.git to /tmp/pip-req-build-d0i0rbg6
 Running command git clone -q https://github.com/endgameinc/ember.git /tmp/pip-req-build-d0i0rbg6
Collecting lief>=0.9.0
어쩌구 저쩌구 막 설치됨 | 생략
Successfully built ember
Installing collected packages: lief, ember
Successfully installed ember-0.1.0 lief-0.10.1

# ember 데이터셋 다운받기
!wget --no-check-certificate https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2
--2020-11-03 00:16:34-- https://pubdata.endgame.com/ember/ember_dataset_2018_2.tar.bz2
Resolving pubdata.endgame.com (pubdata.endgame.com)... 64.250.189.21
Connecting to pubdata.endgame.com (pubdata.endgame.com)|64.250.189.21|:443... connected.
WARNING: cannot verify pubdata.endgame.com's certificate, issued by ‘CN=Go Daddy Secure Certificate Authority - G2,OU=http://certs.godadd'
 Issued certificate has expired.
HTTP request sent, awaiting response... 200 OK
Length: 1696539273 (1.6G) [application/octet-stream]
Saving to: ‘ember_dataset_2018_2.tar.bz2’
ember_dataset_2018_ 100%[===================>] 1.58G 13.1MB/s in 1m 48s
2020-11-03 00:18:24 (14.9 MB/s) - ‘ember_dataset_2018_2.tar.bz2’ saved [1696539273/1696539273]

# 잘 다운됐나 확인
ls -la
total 1656777
-rw------- 1 root root 1696539273 Jul 30 2019 ember_dataset_2018_2.tar.bz2

# 압축 풀기
! tar -xvf ember_dataset_2018_2.tar.bz2
ember2018/
ember2018/train_features_1.jsonl
ember2018/train_features_0.jsonl
ember2018/train_features_3.jsonl
ember2018/test_features.jsonl
ember2018/ember_model_2018.txt
ember2018/train_features_5.jsonl
ember2018/train_features_4.jsonl
ember2018/train_features_2.jsonl

# 압축 잘 풀렸나 확인
ls -la
total 1656781
drwx------ 2 root root 4096 Jul 29 2019 ember2018/
-rw------- 1 root root 1696539273 Jul 30 2019 ember_dataset_2018_2.tar.bz2

# jsonl 파싱해주기

import json
    
def load_jsonl(input_path) -> list:
    """
    Read list of objects from a JSON lines file.
    """
    data = []
    with open(input_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line.rstrip('\n|\r')))
    print('Loaded {} records from {}'.format(len(data), input_path))
    return data
    
import pandas as pd
import numpy as np

# ember 데이터 가져오기
ember_data = load_jsonl('ember2018/train_features_0.jsonl')
Loaded 50000 records from ember2018/train_features_0.jsonl

# data frame 형태로 저장
ember_df = pd.DataFrame(ember_data)

# shape 확인
ember_df.shape
(50000, 14)

# 0번째 ember 데이터셋 보기
ember_df.iloc[0]
sha256 0abb4fda7d5b13801d63bee53e5e256be43e141faa077a...
md5 63956d6417f8f43357d9a8e79e52257e
appeared 2006-12
label 0
avclass
histogram [45521, 13095, 12167, 12496, 12429, 11709, 118...
byteentropy [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
strings {'numstrings': 14573, 'avlength': 5.9720716393...
general {'size': 3101705, 'vsize': 380928, 'has_debug'...
header {'coff': {'timestamp': 1124149349, 'machine': ...
section {'entry': '.text', 'sections': [{'name': '.tex...
imports {'KERNEL32.dll': ['SetFileTime', 'CompareFileT...
exports []
datadirectories [{'name': 'EXPORT_TABLE', 'size': 0, 'virtual_...
Name: 0, dtype: object
반응형