Pure Textual Features /scikit-learn/countvectorize

Pure Textual Features

Scikit-Learn 의 문서 전처리 기능

SciKit-Learn에서 CountVectorizer() 의 방식,

이는 문서 집합에서 단어 토큰을 생성하고 각 단어의 수를 세어 BOW 인코딩한 벡터를 만든다.

문법이나 글의 순서가 엉망이여도

이는 이러한 문제를 뛰어넘는 강력한 기능이다.

>>> from sklearn.feature_extraction.text import CountVectorizer
	
>>> corpus = [
...  "Authman ran faster than Harry because he is an athlete.",
...  "Authman and Harry ran faster and faster.",
... ]
	
>>> bow = CountVectorizer()
>>> X = bow.fit_transform(corpus) # Sparse Matrix
	
>>> bow.get_feature_names()
['an', 'and', 'athlete', 'authman', 'because', 'faster', 'harry', 'he', 'is', 'ran', 'than']
	
>>> X.toarray()
[[1 0 1 1 1 1 1 1 1 1 1]
 [0 2 0 1 0 2 1 0 0 1 0]]

[n_samples, n_features] dataframe 이라는 형태로 저장되어있지않더라도.

SciPy가 이를 해결할 수 있다.

SciPy는 수학적 알고리즘과 편리한 기능을 담은 NumPy를 뛰어넘는 라이브러리이다.

Graphical Features

# Uses the Image module (PIL)
from scipy import misc
	
# Load the image up
img = misc.imread('image.png')
	
# Is the image too big? Resample it down by an order of magnitude
img = img[::2, ::2]
	
# Scale colors from (0-255) to (0-1), then reshape to 1D array per pixel, e.g. grayscale
# If you had color images and wanted to preserve all color channels, use .reshape(-1,3)
X = (img / 255.0).reshape(-1)
	
# To-Do: Machine Learning with X!
#

# Uses the Image module (PIL)
from scipy import misc
	
# Load the image up
dset = []
for fname in files:
  img = misc.imread(fname)
  dset.append(  (img[::2, ::2] / 255.0).reshape(-1)  )
	
dset = pd.DataFrame( dset )

Audio Features

import scipy.io.wavfile as wavfile
	
sample_rate, audio_data = wavfile.read('sound.wav')
print audio_data

'Code > Phyton' 카테고리의 다른 글

Parallel coordinate plots / Andrews curve / (0)	2020.03.04
Matplotlib / Histogram / 2D & 3D Scatter Plots (0)	2020.03.03
pandas/Textual Categorical-Features/ordinal/nominal (0)	2020.02.27
파이썬으로 공연예술 검색엔진 만들기 (별첨) (0)	2019.12.28
파이썬으로 공연예술 검색엔진 만들기(3) (0)	2019.12.28

예술 근처 어디쯤

Pure Textual Features /scikit-learn/countvectorize

Pure Textual Features

Scikit-Learn 의 문서 전처리 기능

Graphical Features

Audio Features

'Code > Phyton' 카테고리의 다른 글

티스토리툴바

Pure Textual Features /scikit-learn/countvectorize

Pure Textual Features

Scikit-Learn 의 문서 전처리 기능

Graphical Features

Audio Features

'Code > Phyton' 카테고리의 다른 글

'Code/Phyton' Related Articles

티스토리툴바